PIXEL
Tokenization is one of the main steps in the preprocessing pipeline of the text before it comes as input to a model. Language models are defined over a finite set of inputs, which creates a vocabulary bottleneck, it is hard to propagate the approach of tokeniaztion when you want to work with multiple languages. How to overcome this problem?
We can refer to text as to an image and work not with tokens but with patches (as in image processing). The idea is based on ViT-MAE — a Transformer-based encoder-decoder trained to reconstruct the pixels in masked image patches. So, the text can be rendered as an image and then go to the Encoder layer avoiding Embedding layer. Thus, theoretically, such model can support any language (even if in the texts there are some emojis or any other pictured symbols, as on the screenshot) as you can mask anything!
The paper: https://arxiv.org/pdf/2207.06991.pdf
The code: https://github.com/xplip/pixel
The already pretrained models on 🤗: https://huggingface.co/Team-PIXEL