[ad_1]
This work research the usage of consideration masking in transformer transducer primarily based speech recognition for constructing a single configurable mannequin for various deployment eventualities. We current a complete set of experiments evaluating mounted masking, the place the identical consideration masks is utilized at each body, with chunked masking, the place the eye masks for every body is decided by chunk boundaries, when it comes to recognition accuracy and latency. We then discover the usage of variable masking, the place the eye masks are sampled from a goal distribution at coaching time, to construct fashions that may work in several configurations. Lastly, we examine how a single configurable mannequin can be utilized to carry out each first cross streaming recognition and second cross acoustic rescoring. Experiments present that chunked masking achieves a greater accuracy vs latency trade-off in comparison with mounted masking, each with and with out FastEmit. We additionally present that variable masking improves the accuracy by as much as 8% relative within the acoustic re-scoring situation.
Support authors and subscribe to content
This is premium stuff. Subscribe to read the entire article.