[ad_1]
The neural transducer is an end-to-end mannequin for automated speech recognition (ASR). Whereas the mannequin is well-suited for streaming ASR, the coaching course of stays difficult. Throughout coaching, the reminiscence necessities could rapidly exceed the capability of state-of-the-art GPUs, limiting batch measurement and sequence lengths. On this work, we analyze the time and area complexity of a typical transducer coaching setup. We suggest a memory-efficient coaching methodology that computes the transducer loss and gradients pattern by pattern. We current optimizations to extend the effectivity and parallelism of the sample-wise methodology. In a set of thorough benchmarks, we present that our sample-wise methodology considerably reduces reminiscence utilization, and performs at aggressive pace when in comparison with the default batched computation. As a spotlight, we handle to compute the transducer loss and gradients for a batch measurement of 1024, and audio size of 40 seconds, utilizing solely 6 GB of reminiscence.
Support authors and subscribe to content
This is premium stuff. Subscribe to read the entire article.