Bloom Library: Multimodal Datasets in 300+ Languages for a Variety of Downstream Tasks
Remember BLOOM 🌸 model? Now there are BLOOM datasets: multimodal multilingual datasets covering 363 languages across 32 language families💪!
Four datasets are released:
* bloom-lm for language modeling in 351 languages;
* bloom-captioning for image-to-text or text-to-image tasks in 351 languages;
* bloom-vist for visual storytelling in 351 languages;
* bloom-speech for speech-to-text and text-to-speech tasks in 56 languages.
The original paper with all details about collection process and datasets here.