Language models can explain neurons in language models
What about to use GPT-4 to automatically write explanations for the behavior of neurons in large language models and to score those explanations?
* Explain: Generate an explanation of the neuron’s behavior by showing the explainer model (token, activation) pairs from the neuron’s responses to text excerpts.
* Simulate: Use the simulator model to simulate the neuron's activations based on the explanation.
* Score: Automatically score the explanation based on how well the simulated activations match the real activations.
Blog from Closed OpenAI: [link]
Paper: [link]
Code and collected dataset of explanations: [link]
We use GPT-4 to automatically write explanations for the behavior of neurons in large language models and to score those explanations. We release a dataset of these (imperfect) explanations and scores for every neuron in GPT-2.