Digest 2021-11
# Speech
Towards Building ASR Systems for the Next Billion Users
- http://arxiv.org/abs/2111.03945
- 17,000 hours of raw speech data for 40 Indian languages from a wide variety of domains (YouTube + newsonair)
- Pretrain several variants of wav2vec style models for 40 Indian languages
- Fine-tune this model for downstream ASR for 9 languages
- WER reported for Indic ASR is significantly higher and sensitively depends on availability of resources: pretraining corpus, fine-tuning data, and task-specific language information
- youtube-dl + py-webrtcvad => 16 8kHz, 1.5 TB in wav format
- 24 GPUs to train, 10-24 to tune, not feasible without English pre-train
- KenLM models search helps a lot, rescoring helps a bit
- Just adding more annotated data helps a lot
PSEUDO-LABELING FOR MASSIVELY MULTILINGUAL SPEECH RECOGNITION
- http://arxiv.org/abs/2111.00161
- Data - CV with (3.5k hours) and Vox Populi (384k) unlabeled, 60 languages
- Base model (275M), we also train larger models (1.06B) with CTC loss and classifier
- Base - 16 GPUs with dynamic batching using 200s of audio per batch per GPU
- Large - 64 GPUs with 50s of audio per GPU
- SlimIPL, an iterative approach - a number of updates on labeled data
- Followed by continuous training using labeled data and pseudo-labeled data stored in a dynamic cache which is periodically updated with pseudo-labels (PLs) re-generated by the current model state
- First fine-tune the trained multilingual model by training only on CV data for that language for 10k updates, and then run slimIPL using the corresponding VP data
- Utterances of VP (average duration of 30s) are much longer than those of CV (average duration of 5.3s)
- TLDR - VP pre-training / tuning improves quality for very low resource languages, but the quality on CV becomes worse for high-resource langauges
#digest