Towards NLP(@towards_nlp). GALACTICA The amount of papers being published every month, week, and even day now is very overwhel

GALACTICA The amount of papers being published every month, week, and even day now is very overwhelmed. In May 2022, an average of 516 papers per day were submitted to arXiv. How will it be nice if there is a tool that helps researches to find papers for review more precisely, summarize it and help to organize research better? Now it is possible💪 The researches from Meta AI introduced new language model Galactica. What makes this model capable to work with equations, chemistry sequences, references, code, plain text, and other symbolic chains so good? * Dataset: The Galactica Corpus. Contains of 48m papers, 106b tokens from papers, reference material, encyclopedias and other scientific sources. * Tokenization: special type of tokenization and separation tokens for each type of sequences: citation, mathematics, chemistry sequences, and others. * Working Memory Token: recently, there was introduced chain-of-thoughts concept. In this work, the authors go further: memory token <work> that wraps prompting into step-by-step reasoning part. * Prompt Pre-Training (similar to FLAN) based on different tasks: QA, summarization, NER extraction, reasoning, dialogue, others. * Architecture: a Transformer architecture in a decoder-only setup. Now, using the demo, you can search by reference, short description of the main idea of the paper or even formula, and ask for summarization. Thanks for the Twitter community, the demo is now shouted down🫣 However, as always, the presented scientific is still interesting by itself. In a meanwhile, we will wait to again test the model in its full power. [link] The main page [link] The paper about Galactica LLM