[ad_1]
Pre-trained mannequin representations have demonstrated state-of-the-art efficiency in speech recognition, pure language processing, and different purposes. Speech fashions, similar to Bidirectional Encoder Representations from Transformers (BERT) and Hidden models BERT (HuBERT), have enabled producing lexical and acoustic representations to profit speech recognition purposes. We investigated the usage of pre-trained mannequin representations for estimating dimensional feelings, similar to activation, valence, and dominance, from speech. We noticed that whereas valence could rely closely on lexical representations, activation and dominance rely totally on acoustic info. On this work, we used multi-modal fusion representations from pre-trained fashions to generate state-of-the-art speech emotion estimation, and we confirmed a 100% and 30% relative enchancment in concordance correlation coefficient (CCC) on valence estimation in comparison with customary acoustic and lexical baselines. Lastly, we investigated the robustness of pre-trained mannequin representations towards noise and reverberation degradation and observed that lexical and acoustic representations are impacted in another way. We found that lexical representations are extra sturdy to distortions in comparison with acoustic representations, and demonstrated that information distillation from a multi-modal mannequin helps to enhance the noise-robustness of acoustic-based fashions
Support authors and subscribe to content
This is premium stuff. Subscribe to read the entire article.