Skip to content

aaronmueller/MIB

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

73 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

MIB Splash

A benchmark for systematic comparison of featurization and localization methods.
circuits · causal variables · localization · featurization · faithfulness · interchange interventions · SAEs · counterfactuals

Apache 2.0 Email website

Overview

This repository documents the Mechanistic Interpretability Benchmark (MIB). Here, you can find links to the code for both tracks, information about and links to our datasets, information about the MIB leaderboard, instructions for submission, and a citation that we ask you use if you use any of the resources provided here or at the links provided below.

Tracks

MIB contains two tracks. The circuit localization track benchmarks methods that aim to locate graphs of causal dependencies in neural networks. The causal variable localization track benchmarks methods that aim to locate specific human-interpretable causal variables in neural networks.

Circuit Localization

Overview of the circuit localization track.

This track benchmarks circuit discovery methods—i.e., methods for locating graphs of causal dependencies in neural networks. Most circuit discovery pipelines look something like this:

  1. Compute importance scores for each component or each edge between components.
  2. Ablate all components from the network except those that surpass some importance threshold, or those in the top k%.
  3. Evaluate how well the circuit (model with only the most important components not ablated) performs, or replicates the full model's behavior.

In the circuit localization track's repository, we provide code for discovering and evaluating circuits.

Notice that step (3) mentioned two distinct evaluation criteria: how well the circuit performs, and how well the circuit replicates the model's behavior. Past work often implicitly conflates these two, whether by discovering a circuit using one criterion and then evaluating another, or by not clarifying the precise goals. We believe these are complementary but separate concepts, so we split them into two separate evaluation metrics. See circuit localization repo or the paper for more details.

Causal Variable Localization

Overview of the causal variable localization track.

This track benchmarks featurization methods—i.e., methods for transforming model activations into a space where it's easier to isolate a given causal variable. Most pipelines under this paradigm look like this:

  1. Curate a dataset of contrastive pairs, where each pair differs only with respect to the targeted causal variable.
  2. If using a supervised method, train the featurization method using the contrastive pairs.
  3. To evaluate: feed the model an input from a pair, use the featurizer to transform an activation vector, intervene in the transformed space, transform back out, and see whether the model's new behavior aligns with what is expected under the intervention.

In the causal variable localization track's repository, we provide code for training and evaluating featurizers.

We provide the results from our baseline experiments in this Google Drive folder. These results were generated using the scripts provided in this repo.

Data and Models

Our benchmark consists of five datasets: IOI, MCQA, Arithmetic, ARC, and RAVEL. These were chosen to represent (1) a mixture of commonly studied and unstudied tasks, (2) tasks of varying formats, and (3) tasks of varying difficulty levels. Each dataset comes with a train, validation and public test set. We also hold out a private test set, which can only be evaluated on by submitting to the leaderboard.

Mechanistic interpretability comparisons are only valid for a given task-and-model pair. Thus, we choose four realisitic language models of varying sizes and capability levels to standardize comparisons: GPT-2 Small, Qwen-2.5 (0.5B), Gemma-2 (2B), Llama-3.1 (8B). We also include an InterpBench model, which we train to encode a ground-truth circuit for the IOI task.

InterpBench: A Ground-truth Circuit

Language models, when trained according to a next-token prediction objective, encode unpredictable mechanisms and concepts. Thus, it is difficult to define metrics that capture any notion of a ground-truth. For the circuit localization track, we therefore also include an InterpBench model, which we train to encode a known mechanism that we specify. Because we know what the ground-truth nodes and edges are in the circuit, we can compute more meaningful metrics, like the AUROC!

For more detail on how InterpBench models are trained, please see the InterpBench paper.

Leaderboard

To encourage participation, we have created a leaderboard, hosted on HuggingFace. This leaderboard shows scores on the private test set. We have set a strict rate limit of 2 submission per user per week to discourage hill-climbing on the private test set.

Our hope is that the public validation and test sets will enable fast iteration on mechanistic interpretability methods, while the private test will remain a more stable and meaningful measure of the state of the art.

Submission

Submit to the MIB leaderboard at this link. You will need the following:

  • Circuit localization: 9 circuits per model/task combination. These should be of varying sizes, and satisfy the criteria described in the circuit localization repo. See here for an example submission with one .pt file per model/task, and here for an example submission with separate circuit files for each circuit size threshold per model/task. See the circuit localization track repository for more details on the format of these files.
  • Causal variable localization: a featurization function that follows the API format specified in the causal variable localization repo, a token position function specifying where the featurizer should be applied, and a folder containing trained (inverse) featurizers and token indices. See here for an example submission. See the causal variable localization track repository for more details on the format of these files, see the Jupyter notebook MIB-causal-variable-track/ioi_example_submission.ipynb for an example of how to get the trained featurizer and token indices files for the ioi task and the Jupyter notebook MIB-causal-variable-track/ioi_example_submission.ipynb for an example of how to get the trained featurizer and token indices files for the rest of the tasks. The README MIB-causal-variable-track/README.md explains how to structure the folder for a submission to this track.

The leaderboard submission portal will verify that your submission is in the correct format. For the circuit localization track, this is done on our backend.

For the causal variable localization track, please ensure that your submission is valid using our automated submission checker script. Once you've verified, please provide the requested HF repository linking to your files. This should be a model repository, not a dataset.

Citation

If you use any of the MIB datasets or code, please cite our paper:

@article{mib-2025,
	title = {{MIB}: A Mechanistic Interpretability Benchmark},
	author = {Aaron Mueller and Atticus Geiger and Sarah Wiegreffe and Dana Arad and Iv{\'a}n Arcuschin and Adam Belfki and Yik Siu Chan and Jaden Fiotto-Kaufman and Tal Haklay and Michael Hanna and Jing Huang and Rohan Gupta and Yaniv Nikankin and Hadas Orgad and Nikhil Prakash and Anja Reusch and Aruna Sankaranarayanan and Shun Shao and Alessandro Stolfo and Martin Tutek and Amir Zur and David Bau and Yonatan Belinkov},
	year = {2025},
	journal = {CoRR},
	volume = {arXiv:2504.13151},
	url = {https://arxiv.org/abs/2504.13151v1}
}

License

We release the content in this repository and all sub-repositories under an Apache 2.0 license.

About

Landing page for MIB: A Mechanistic Interpretability Benchmark

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published