Spark in me - Internet, data science, math, deep learning, philosophy(@snakers4). data2vec: A General Framework for Self-supervised Learning in Speech, Vision and Language

data2vec: A General Framework for Self-supervised Learning in Speech, Vision and Language - [Illustration](https://scontent-arn2-1.xx.fbcdn.net/v/t39.2365-6/271815807_4636921079718503_8613393990345138136_n.gif?_nc_cat=107&ccb=1-5&_nc_sid=ad8a9d&_nc_ohc=yn27DielBOYAX8rk045&_nc_ht=scontent-arn2-1.xx&oh=00_AT8ueSOOllDdunQw26KIBUYwyoOq_b1leSPKrmSfZoeazA&oe=61F26871) - [Link](ai.facebook.com/blog/th…and-text) - These are actually 3 separate models (!) - marketing lies as usual - No clear indication, but the NLP model uses 16 GPUs, others - not specified - The first high-performance self-supervised algorithm that works for speech, vision, and text - Trained by predicting the model representations of the full input data given a partial view of the input - Standard Transformer architecture with a modality-specific encoding - The encoding of the unmasked training sample is parameterized by an exponentially moving average of the model parameters - Training targets based on the output of the top K blocks of the teacher network for time-steps which are masked in student mode - We apply a normalization to each block before averaging the top K blocks - For speech representations, we use instance normalization - For NLP and vision we found parameter-less layer normalization - 800 epochs, 86M parameters and 307M parameters - Smooth L1 loss HuBERT: Self-Supervised Speech Representation Learning by Masked Prediction of Hidden Units - 2106.07447 - Offline clustering step to provide aligned target labels for a BERT-like prediction loss - Applying the prediction loss over the masked regions only - Relies on the consistency of the unsupervised clustering step rather than the intrinsic quality of the assigned cluster labels - Acoustic unit discovery models to provide frame-level targets - How to mask and where to apply the prediction loss: - p% of the timesteps are randomly selected as start indices, and spans of l steps are masked - cross-entropy loss computed over masked and unmasked timesteps, weighted, α parameter - α = 1 is more resilient to the quality of cluster targets, which is demonstrated in our experiments - Multuple clustering, iterative refinement starting with MFCC - Convolutional waveform encoder, a BERT encoder, a projection layer and a code embedding layer - BASE, LARGE, and X-LARGE - 95M, 317M, 964M - ![image](user-images.githubusercontent.com/1251544…24ce.png) - Convolutional encoder generates a feature sequence at a 20ms framerate for audio sampled at 16kHz (CNN encoder down-sampling factor is 320x) - After pre-training, CTC loss for ASR fine-tuning of the whole model weights except the convolutional audio encoder, which remains frozen - CTC target vocabulary includes 26 English chars + space + apostrophe + CTC blank - 960h of LibriSpeech + 60kh of Libri-light - First iteration labels: 960 hour LibriSpeech training set, k-means clustering with 100 clusters on 39-dimensional MFCC features, which are 13 coefficients with the first and the second-order derivatives - For the subsequent iterations, k-means clustering with 500 clusters on the latent features from the HuBERT model pre-trained in the previous iteration - MiniBatchKMeans - BASE - two iterations on the 960h on 32 GPUs (batch size of at most 87.5 seconds of audio per GPU), 250k steps - LARGE and X-LARGE for one iteration on 60kh on 128 and 256 GPUs, respectively, for 400k steps #digest