Scaling Instruction-Finetuned Language Models
TL;DR Additional fine-tuning of T5 or PaLM models on 1k (!) tasks make them better on evaluation tasks, make them to cover more languages, and scale to the new unseen tasks better.
Google Brain team experimented with new methods of fine-tuning of Large Language Models. The main recipes for better LLMs:
* the bigger amount of the tasks for pre-training you have, the better;
* smarter prompts are also help more. By smarter here we can understand the usage of instructions and Chain-of-thought (see screenshots). Translating to human language, the more clues you give the model in the request, the more precise answer you will receive. The Chain-of-thought concept is quite interesting, the original paper of it is here.
The optimal amount of tasks of pre-training is still an open research question (authors in their experiments jumped from 282 tasks directly to 1,836 tasks, quite a gap of number to explore).
But, in the end, if we want to solve a new task and we generate smarter prompts for it, as the model was pre-trained, it will significantly improve zero-shot performance.
The original paper with all details and a lot of table and examples of performances on different tasks.
🤗model cards: all variations of t5, flan-t5-base for illustration.