[ad_1]
In embedding-matching acoustic-to-word (A2W) ASR, each phrase within the vocabulary is represented by a fixed-dimension embedding vector that may be added or eliminated independently of the remainder of the system. The strategy is probably a chic resolution for the dynamic out-of-vocabulary (OOV) phrases downside, the place speaker- and context-dependent named entities like contact names have to be integrated into the ASR on-the-fly for each speech utterance at testing time. Challenges nonetheless stay, nevertheless, in bettering the general accuracy of embedding-matching A2W. On this paper, we contribute two strategies that enhance the accuracy of embedding-matching A2W. First, we suggest internally producing a number of embeddings, as an alternative of a single embedding, at every occasion in time, which permits the A2W mannequin to suggest a richer set of hypotheses over a number of time segments within the audio. Second, we suggest utilizing phrase pronunciation embeddings reasonably than phrase orthography embeddings to scale back ambiguities launched by phrases which have a couple of sound. We present that the above concepts give important accuracy enchancment, with the identical coaching information and almost similar mannequin dimension, in situations the place dynamic OOV phrases play a vital function. On a dataset of queries to a speech-based digital assistant that embrace many user-dependent contact names, we observe as much as 18% lower in phrase error charge utilizing the proposed enhancements.
Support authors and subscribe to content
This is premium stuff. Subscribe to read the entire article.