Google Scholar

RAVEL: Evaluating interpretability methods on disentangling language model representations

J Huang, Z Wu, C Potts, M Geva, A Geiger - arXiv preprint arXiv …, 2024 - arxiv.org

Individual neurons participate in the representation of multiple high-level concepts. To what
extent can different interpretability methods successfully disentangle these roles? To help
address this question, we introduce RAVEL (Resolving Attribute-Value Entanglements in
Language Models), a dataset that enables tightly controlled, quantitative comparisons
between a variety of existing interpretability methods. We use the resulting conceptual
framework to define the new method of Multi-task Distributed Alignment Search (MDAS) …

Save Cite Cited by 36 Related articles All 7 versions View as HTML

[CITATION][C] Ravel: Evaluating interpretability methods on disentangling language model representations, 2024

J Huang, Z Wu, C Potts, M Geva, A Geiger - URL https://arxiv. org/abs/2402.17700

Save Cite Cited by 4 Related articles

Showing the best results for this search. See all results

Cite

Advanced search

Saved to My library

RAVEL: Evaluating interpretability methods on disentangling language model representations

[CITATION][C] Ravel: Evaluating interpretability methods on disentangling language model representations, 2024