RAVEL: Evaluating interpretability methods on disentangling language model representations
Individual neurons participate in the representation of multiple high-level concepts. To what
extent can different interpretability methods successfully disentangle these roles? To help
address this question, we introduce RAVEL (Resolving Attribute-Value Entanglements in
Language Models), a dataset that enables tightly controlled, quantitative comparisons
between a variety of existing interpretability methods. We use the resulting conceptual
framework to define the new method of Multi-task Distributed Alignment Search (MDAS) …
extent can different interpretability methods successfully disentangle these roles? To help
address this question, we introduce RAVEL (Resolving Attribute-Value Entanglements in
Language Models), a dataset that enables tightly controlled, quantitative comparisons
between a variety of existing interpretability methods. We use the resulting conceptual
framework to define the new method of Multi-task Distributed Alignment Search (MDAS) …
[CITATION][C] Ravel: Evaluating interpretability methods on disentangling language model representations, 2024
J Huang, Z Wu, C Potts, M Geva, A Geiger - URL https://arxiv. org/abs/2402.17700
Showing the best results for this search. See all results