Spark in me - Internet, data science, math, deep learning, philosophy(@snakers4). Digest 2021-11 # Speech Towards Building ASR Systems for the Next Billion Users

Digest 2021-11 # Speech Towards Building ASR Systems for the Next Billion Users - http://arxiv.org/abs/2111.03945 - 17,000 hours of raw speech data for 40 Indian languages from a wide variety of domains (YouTube + newsonair) - Pretrain several variants of wav2vec style models for 40 Indian languages - Fine-tune this model for downstream ASR for 9 languages - WER reported for Indic ASR is significantly higher and sensitively depends on availability of resources: pretraining corpus, fine-tuning data, and task-specific language information - youtube-dl + py-webrtcvad => 16 8kHz, 1.5 TB in wav format - 24 GPUs to train, 10-24 to tune, not feasible without English pre-train - KenLM models search helps a lot, rescoring helps a bit - Just adding more annotated data helps a lot PSEUDO-LABELING FOR MASSIVELY MULTILINGUAL SPEECH RECOGNITION - http://arxiv.org/abs/2111.00161 - Data - CV with (3.5k hours) and Vox Populi (384k) unlabeled, 60 languages - Base model (275M), we also train larger models (1.06B) with CTC loss and classifier - Base - 16 GPUs with dynamic batching using 200s of audio per batch per GPU - Large - 64 GPUs with 50s of audio per GPU - SlimIPL, an iterative approach - a number of updates on labeled data - Followed by continuous training using labeled data and pseudo-labeled data stored in a dynamic cache which is periodically updated with pseudo-labels (PLs) re-generated by the current model state - First fine-tune the trained multilingual model by training only on CV data for that language for 10k updates, and then run slimIPL using the corresponding VP data - Utterances of VP (average duration of 30s) are much longer than those of CV (average duration of 5.3s) - TLDR - VP pre-training / tuning improves quality for very low resource languages, but the quality on CV becomes worse for high-resource langauges #digest