Towards NLP(@towards_nlp). Language models can explain neurons in language models What about to use GPT-4 to automatically wri

Language models can explain neurons in language models What about to use GPT-4 to automatically write explanations for the behavior of neurons in large language models and to score those explanations? * Explain: Generate an explanation of the neuron’s behavior by showing the explainer model (token, activation) pairs from the neuron’s responses to text excerpts. * Simulate: Use the simulator model to simulate the neuron's activations based on the explanation. * Score: Automatically score the explanation based on how well the simulated activations match the real activations. Blog from ~~Closed~~ OpenAI: [link] Paper: [link] Code and collected dataset of explanations: [link]