0% found this document useful (0 votes)
25 views

A Sparse Autoencoder Framework for Interpreting and Steering

This paper introduces SAIF, a framework utilizing sparse autoencoders to interpret and enhance instruction-following capabilities of large language models (LLMs). It identifies specific latent features that correlate with instruction adherence and proposes methods for steering model outputs effectively. The research emphasizes the importance of multiple instruction-relevant features and the role of the final layer in achieving optimal steering performance across various model sizes and instruction types.

Uploaded by

zeqinghe63
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
25 views

A Sparse Autoencoder Framework for Interpreting and Steering

This paper introduces SAIF, a framework utilizing sparse autoencoders to interpret and enhance instruction-following capabilities of large language models (LLMs). It identifies specific latent features that correlate with instruction adherence and proposes methods for steering model outputs effectively. The research emphasizes the importance of multiple instruction-relevant features and the role of the final layer in achieving optimal steering performance across various model sizes and instruction types.

Uploaded by

zeqinghe63
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 21

SAIF: A Sparse Autoencoder Framework for Interpreting and Steering

Instruction Following of Language Models

Zirui He1,* , Haiyan Zhao1,* , Yiran Qiao2 , Fan Yang3 ,


Ali Payani4 , Jing Ma2 , Mengnan Du1
1
NJIT Case Western Reserve University 3 Wake Forest University
2 4
Cisco
*
Equal contribution
{zh296,hz54,mengnan.du}@njit.edu, {yxq350,jxm1384}@case.edu, [email protected], [email protected]

Abstract 1) prompting-based; 2) activation-space-based.


Among prompting-based studies, the importance
The ability of large language models (LLMs) to
follow instructions is crucial for their practical
of instruction positions has been thoroughly stud-
arXiv:2502.11356v1 [cs.LG] 17 Feb 2025

applications, yet the underlying mechanisms ied (Liu et al., 2024; Ma et al., 2024). For
remain poorly understood. This paper presents activation-based studies, Stolfo et al. (2024) pro-
a novel framework that leverages sparse autoen- pose to manipulate model following instructions
coders (SAE) to interpret how instruction fol- with representation vector in residual stream. How-
lowing works in these models. We demonstrate ever, both methods ultimately fail to explain the
how the features we identify can effectively inner workings of how LLMs follow instructions
steer model outputs to align with given instruc-
in a fine-grained manner, i.e. the concept level.
tions. Through analysis of SAE latent activa-
tions, we identify specific latents responsible Specifically, prompting-based approaches provide
for instruction following behavior. Our find- insights into better prompt formulation strategies
ings reveal that instruction following capabili- to improve instruction following, while activation-
ties are encoded by a distinct set of instruction- space-based methods provide a possible way to im-
relevant SAE latents. These latents both show plement steering with instruction following rather
semantic proximity to relevant instructions and than explaining how it works.
demonstrate causal effects on model behavior.
In this paper, we propose a novel framework
Our research highlights several crucial factors
for achieving effective steering performance: SAIF (Sparse Autoencoder steering for Instruction
precise feature identification, the role of final Following) to understand working mechanisms of
layer, and optimal instruction positioning. Ad- instruction following at the concept level through
ditionally, we demonstrate that our methodol- the lens of sparse autoencoders (SAEs). First, we
ogy scales effectively across SAEs and LLMs develop a robust method to sample instruction-
of varying sizes. relevant features. Then, we select influential fea-
1 Introduction tures using designed metrics and further compute
steering vectors (see Figure 1a). Furthermore, we
Large language models (LLMs) have demon- measure the effectiveness of these steering vectors
strated remarkable capabilities in following in- through steering tasks (see Figure 1b). Addition-
structions, enabling alignment between model out- ally, we examine the extracted features using Neu-
puts and user objectives. These capabilities are ronpedia (Lin, 2023) to illustrate how semantically
typically gained through instruction tuning meth- relevant the activating text of features is to instruc-
ods (Ouyang et al., 2022; Wei et al., 2022), includ- tions. We also measure steering performance to
ing extensive training data and computationally demonstrate the effectiveness of extracted features.
intensive fine-tuning processes. While these ap- Through these tools, we gain some intriguing in-
proaches effectively control model behavior, the sights regarding the importance of the feature num-
underlying mechanisms by which models process ber used in representing instructions, the role of
and respond to instructions remain poorly under- the last layer, the impact of instruction position and
stood. In-depth mechanistic investigations are es- model scale. Our main contributions in this work
sential for improving our ability to control models can be summarized as follows:
and enhance their instruction-following capability.
Prior research has attempted to understand • We propose SAIF, a framework that interprets
instructions following from two perspectives: instruction following in LLMs at a fine-grained

1
INPUT
Residual SAE SAE Decoder + LLM Generation
Tell me something He is one of the most
about LeBron James. SAE(X) influential basketball
players in NBA history.
SAE Latent Steering Vector
Tell me something
about LeBron James.
Instruction: Translate Residual SAE SAE Decoder + LLM Dites-moi quelque chose
SAE(X′) sur LeBron James.
this sentence to French.

(a) Extract Steering Vector

SAE(X)
INPUT Generation
Residual SAE SAE Latent SAE Decoder + LLM
Tell me something Dites-moi quelque chose
about LeBron James. + sur LeBron James.
Steering Vector

(b) Steering Model Behavior

Figure 1: The proposed SAIF framework. The model computes steering vectors from SAE latent differences to
guide outputs according to instructions. (a) Extract steering vector. (b) Apply steering for controlled output.

conceptual level. Our analysis reveals how mod- where benc ∈ Rm and bdec ∈ Rd are the bias terms.
els internally encode and process instructions The decomposed high-dimension latent activations
through interpretable latent features in their rep- a(z) have dimension m and m ≫ d, which is a
resentation space. highly sparse vector. Note that different SAEs use
• We demonstrate that instructions cannot be ade- different non-linear activation function σ. For ex-
quately represented by a single concept in SAEs, ample, Llama Scope (He et al., 2024) adopts TopK-
but rather comprise multiple high-level concepts. ReLU, while Gemma Scope (Lieberum et al., 2024)
Effective instruction steering requires a set of uses JumpReLU (Rajamanoharan et al., 2024).
instruction-relevant features, which our method
precisely identifies. Steering with SAE Latents. Following Eq. (2),
• We reveal the critical role of the last layer in the reconstructed SAE outputs are a linear com-
SAE-based activation steering. Moreover, the bination of SAE latents, which represent the row
effectiveness of our framework has been demon- vectors of SAE decoder W dec . The weight of j-th
strated across instruction types and model scales. SAE latent is a(z)j . Typically, a prominent dimen-
sion j ∈ {1, · · · , m} is chosen, and its decoder
2 Preliminaries latent vector dj is scaled with a factor α and then
added to the SAE outputs (Ferrando et al., 2025).
Sparse Autoencoders (SAEs). Dictionary learn- The computation is as follows:
ing enables disentangling representations into a set
of concepts (Olshausen and Field, 1997; Bricken z new ← z + αdj . (3)
et al., 2023). SAEs are employed to decompose This modified representation z new can then be fed
hidden representations into a high-dimension space back into the model’s residual stream to steer the
and then reconstruct the hidden representations. model’s behavior during generation.
Specifically, the input of SAEs is the hidden repre-
sentation from a model’s residual stream denoted 3 Proposed Method
as z ∈ Rd and the reconstructed output is denoted
as SAE(z) ∈ Rd , we obtain that z = SAE(z) + ϵ In this section, we introduce the SAIF, a framework
where ϵ is the error. In our paper, we focus on layer- for analyzing and steering instruction following in
wise SAEs trained with an encoder W enc ∈ Rd×m LLMs. First, we introduce linguistic variations
followed by the non-linear activation function, and to construct diverse instruction sentences and re-
a decoder W dec ∈ Rm×d (He et al., 2024). The lated datasets, which are further used to compute
definition of SAEs is: SAE latent activations. Second, we develop a two-
stage process for computing steering vectors that
a(z) = σ (zW enc + benc ) , (1)
quantifies the sensitivity of features to instruction
SAE(z) = a(z)W dec + bdec , (2) presence. Finally, we investigate how these identi-

2
fied features can be leveraged for steering model 3.2 Steering Vector Computation
behavior, demonstrating a technique for enhanc- Based on SAE latent activations computed in Sec-
ing instruction following while preserving output tion 3.1, we develop a two-step process for com-
coherence (see Figure 1). puting steering vectors. The first step identifies
features that consistently respond to a given in-
3.1 Format Instruction Feature struction, while the second step quantifies their
To identify instruction-relevant features given an in- sensitivity.
struction, we construct a dataset D with N positive- Given N input samples and a target instruction
negative sample pairs. For example, we focus type (e.g., translation), we first obtain both positive
on an instruction Translate the sentence to samples (with instruction) and negative samples
French . In a sample pair, the positive sample (without instruction) for each input. For each sam-
refers to a prefix prompt followed by the instruc- ple pair i and feature j, we compute the activation
tion, while the negative sample refers to the prefix state change:
prompt without the instruction sentence. ∆hi,j = 1(hw
i,j > 0) − 1(hi,j > 0),
w/o
(4)
The difference-in-means (Rimsky et al., 2024)
is a typical approach to derive concept vectors. It where hw w/o
i,j and hi,j represent the SAE latent acti-
computes the activation differences between each vation values with and without instruction respec-
sample pair over the last token, and then averages tively, and 1(·) is the indicator function. ∆hi,j
over all pairs of activation difference vectors. How- captures whether feature j becomes activated in
ever, directly applying this pipeline to instruction response to the instruction for sample i. We then
following presents a significant challenge. When compute a sensitivity score Cj for each feature:
a single instruction sentence is used repeatedly N
1 X
to generate samples, the model tends to encode Cj = 1(∆hi,j > 0). (5)
N
the specific semantic meaning of that instruction i=1
rather than learning a general-purpose vector that The score represents the proportion of samples
can reliably execute the intended operation (See whose feature j becomes activated in response
Appendix G). Specifically, the derived vector can to instructions. Features with higher scores are
barely operate the same instruction if we rephrase more consistently responsive to instruction pres-
the instruction in a linguistically different but se- ence. By sorting these sensitivity scores in a de-
mantically similar manner. To resolve this chal- scending order, we select the top-k responsive fea-
lenge, we propose to introduce linguistic variations tures. These selected features form the instruction-
to extract instruction functions. relevant feature set V = {W dec,j |rank(Cj ) ≤ k}
We formulate instruction sentences for a given where W dec,j = W dec [j, :] denotes the j-th SAE
instruction through different strategies. These vari- latent. These features will be used for further con-
ations include syntactic reformulations (e.g., imper- structing steering vectors.
ative to interrogative form, task-oriented to process-
based description) and cross-lingual translations 3.3 Steering Procedure
(e.g., English, Chinese, German). In this way, we Different from the classic steering approach de-
generated six diverse instruction sentences compre- fined in Eq. (3), we hypothesize that instruction
hensively capturing key features of an instruction. following steering requires a set of features to be
The instruction design used in our paper is shown effective. The individual feature utilized in the clas-
in Appendix A. sic method focuses on token-level concepts, where
For each instruction variant, we extract samples’ individual concepts typically correlate with a few
residual stream representation and compute the cor- SAE latent activations. As a result, this approach
responding SAE latent activations. While diverse can barely operate instructions. It is partly due
linguistic information are contained, the latent fea- to the complexity of sentence-level instructions,
tures specifically corresponding to the core instruc- which are composed of multiple high-level features
tional concept should maintain relatively consistent represented by a set of SAE latent features. Addi-
activation levels across all variants. These dimen- tionally, SAEs tend to overly split features, which
sions with consistent activation patterns will be further increases the number of features needed for
further used to construct instruction vectors. steering (Ferrando et al., 2025). Thus, we propose

3
Algorithm 1: The proposed SAIF framework instruction-related concepts? (Section 4.2)
Input: Input text x; Target instruction type (e.g., • RQ2: Can the proposed SAIF framework effec-
translation, summarization) tively control model behavior? (Section 4.3)
Stage 1: Format Instruction Feature
Generate diverse instruction variants • RQ3: What role does the final Transformer layer
Construct dataset D with N positive/negative pairs play in the instruction following? (Section 4.4)
Stage 2: Compute Steering Vector
for each sample pair i and feature j do • RQ4: How does instruction positioning affect the
Compute activation state change: effectiveness of instruction following and feature
∆hi,j = 1(hw i,j > 0) − 1(hi,j > 0)
w/o
activation patterns? (Section 4.5)
Calculate sensitivity score:
i=1 1(∆hi,j > 0)
PN
Cj = N1 4.1 Experimental Setup
Sort features by sensitivity scores Cj
Select top-k features as instruction-relevant set V Datasets and Models. Our experiments are con-
Stage 3: Steering Procedure ducted with multiple language models including
Obtain residual stream representation z of input x Gemma-2-2b, Gemma-2-9b (Team et al., 2024) and
for each feature i ∈ V do
Compute activation strength: αi = µi + βsi Llama3.1-8b. The Cross-lingual Natural Language
where µi is mean activation, si is std deviation Inference (XNLI) dataset (Conneau et al., 2018) is
Apply steering: z new = z + ki=1 αi v i
P
used to construct input samples. It encompasses di-
Output: Steered text following the instruction verse languages (including English, French, Span-
ish, German, Greek, Bulgarian, Russian, Turkish,
Arabic, Vietnamese, Thai, Chinese, Hindi, Swahili
to determine how to steer with a set of vectors.
and Urdu) and rich syntactic structures (such as
Building on top of the feature set V derived in
active/passive voice alternations, negation patterns,
Section 3.2, we employ the set of features to steer
and various clause structures). The diverse lin-
residual stream representation of a certain input at
guistic patterns within the dataset are essential in
layer l. Our steering is implemented as below:
constructing a comprehensive set of samples for an
k
X instruction. Moreover, it ensures extracting consis-
z new = z + αi v i , (6) tent SAE activations from the residual stream of
i=1 input samples.
where z represents the residual stream representa- Instruction Design. Following the settings in IFE-
tion of the input over the last token, and αi denotes val (Zhou et al., 2023), we investigate three types
the steering strength of feature i. Here, v i repre- of instructions: keyword inclusion, summarization,
sents a certain instruction-relevant feature in V . and translation. For keywords inclusion, we pro-
As the strength of each selected feature is crucial vide models with a keyword (e.g., “Sunday”), and
to steering performance, we further compute the expect model output incorporating the specified
strength of each feature by employing statistical keyword. For formatting, we instruct the model
measurements of feature activation values to make to perform summarization, where the ideal out-
it more robust and reliable. The activation strength put should be concise, maintain the key informa-
for feature i is calculated as: tion from the original text, and follow a consis-
αi = µi + βsi , (7) tent format with a clear topic sentence followed
by supporting details. For translation, we direct
where µi is the mean activation value of feature i the model to translate sentences into different lan-
observed in instruction-following examples, si is guages (English, French, and Chinese), where the
the standard deviation of these activation values, ideal model output should accurately perform the
and β is a hyperparameter to scale si meanwhile requested translation while preserving the original
controlling the strength value. meaning. The complete set of instructions used for
each task is provided in Appendix A.
4 Experiments
Implementation Details. We use pre-trained
In this section, we conduct experiments to eval- SAEs from Gemma Scope (Lieberum et al., 2024)
uate the effectiveness of SAIF by answering the and Llama Scope (He et al., 2024). When con-
following research questions (RQs): structing input samples for each instruction, we
• RQ1: How interpretable are the features ex- set the number of positive/negative samples N to
tracted using SAEs, and do they correspond to 800. For SAE latent extraction, we use sparse au-

4
Feature Activation Strength Feature Stability Feature Probability
Comparison Comparison Comparison
120 Pre-instruction 100
20
Activation Strength

100 Post-instruction 80

Probability (%)
80 15

Stability
60
60 10
40
40
5 20
20
0 0 0
F3 5
F3 9
F4 4
F4 4
F5 4
F5 2
F5 7
61

F3 5
F3 9
F4 4
F4 4
F5 4
F5 2
F5 7
61

F3 5
F3 9
F4 4
F4 4
F5 4
F5 2
F5 7
61
2
5
1
7
5
0
2

2
5
1
7
5
0
2

2
5
1
7
5
0
2
54
36
40
23
94
49
54
90

54
36
40
23
94
49
54
90

54
36
40
23
94
49
54
90
F1

F1

F1
Figure 2: Comparison of feature activation patterns between pre-instruction and post-instruction conditions across
different SAE latent dimensions. The plots show three key metrics: activation strength (left), feature stability
(middle), and activation probability (right) for eight identified instruction-relevant features.

toencoders with dimensions of 65K and 131K for curacy, which measures the proportion of cases
Gemma-2-2b-it1 and Gemma-2-9b-it2 models re- where the model completely follows the instruc-
spectively. We also use SAE with dimension 32K tion, meaning it both understands and produces out-
for Llama3.1-8b3 . All experiments were run on 1 put exactly as instructed; and 2) Loose Accuracy,
NVIDIA A100 GPU. As default settings, for Equa- which measures the proportion of cases where the
tion (6), we fix k = 15, meaning that we use the model partially follows the instruction, meaning it
top 15 most responsive SAE features for instruc- understands the instruction but the output does not
tion steering. The strategy to choose the optimal k fully conform to the requirements. Note that we
will be further discussed in Section 4.2. For Equa- use GPT-4o-mini to rate the responses, and please
tion (7), we fix the hyperparameter β = 0, and we refer to the details in Appendix D.
discuss the impact of adjusting this hyperparameter
on the steering effect in Appendix C. 4.2 Analysis of Instruction-Related Concepts
SAE Latent Activation Metrics. We consider To investigate RQ1, we analyze the interpretabil-
the following three metrics to quantify features’ ity of features extracted using SAEs and assess
behavior and reliability in instruction processing. their correspondence to instruction-related con-
Note that we only consider features activated on cepts. Our analysis consists of two parts. First,
positive samples but not negative samples. we examine the activating text of extracted features
with Neuronpedia (Lin, 2023) to evaluate their se-
• Activation Strength: The mean activation value mantic relevance to instructions. Second, we com-
1 P
is calculated as: µi = |Ai | a∈Ai a, where Ai is pare how strongly the activating examples of top-k
the set of non-zero activation values for feature i. features and lower-ranked features correspond to
• Activation Probability: The probability of feature instruction-related concepts, demonstrating the re-
i is activated across positive/negative samples: lationship between feature importance and instruc-
Pi = |ANi | , where N is the total number of posi- tion relevance.
tive/negative samples. We focus on analyzing the consistent instruction-
• Activation Stability: The normalized standard relevant latent activations through the lens of Neu-
deviation value of non-zero activation values: ronpedia (Lin, 2023), which provides detailed acti-
Ωi = 1/si . vated text for each SAE latent. Taking translation-
related instructions as an example (e.g., “Translate
A high-quality instruction-relevant feature should
the sentence to French.”), we identify a notable
ideally exhibit strong activation (µi ), consistent
latent that shows strong activation patterns. This
triggering (Pi ), and stable behavior (Ωi ) across
latent exhibits high activation not only for various
different formulations of the same instruction.
languages but also for directional prepositions like
Steering Effectiveness Metrics. We evaluate “to” and “from” that commonly appear in transla-
steering outputs with two metrics: 1) Strict Ac- tion instructions, as shown in Table 1. We summa-
1
https://huggingface.co/google/gemma-2-2b-it rize two key findings as below:
2
https://huggingface.co/google/gemma-2-9b-it
3
https://huggingface.co/meta-llama/Llama-3.1-8B- • Our extracted SAE latent features show strong
Instruct correspondence with instruction-related concepts,

5
Table 1: Maximally activating examples for Feature 15425 in Layer 25 of Gemma2-2b-it when prompted with
“Translate the sentence to French.” Data sourced from Neuronpedia (Lin, 2023).

Activating Examples with ‘Translate the sentence to French’ (Feature 15425, Layer 25)
The Theory of Super conductivity (1958) ( translated from Russian: Consultants Bureau, Inc., New York.
Save your game, go back to change the PS 3 system language settings to English.
We have posted a partial translation of his speech from Yiddish to Hebrew , which was posted in...
I can speak English, but i’m afraid it may be worse than your french.

Table 2: Layer25 Experimental Results


Table 3: Performance of instruction positions,
F15453 F33659 F65085 F2369 F58810 F21836 including pre-instruction and post-instruction.
k=1 k=2 k=3 k = 13 k = 14 k = 15
Position Strict Acc Loose Acc Original
translation French language bienfaits here NameInMap
Translation France Speaking attentes Here CloseOperation Pre-Instruction 0.14 0.47 0.56
translators french languages prochaines Below Jspwriter Post-Instruction 0.23 0.64 0.75

1.0 investigate the optimal number of features needed


0.8 SA
LA for reliable control.
0.6 Steering Effectiveness. We visualize a case study
Value

0.4 in Figure 4 and compare the performance of steer-


0.2 ing results in Figure 5, including both strict ac-
0.0 curacy and loose accuracy. Our analysis reveals
1 2 5 10 15 20 25 30 several key findings:
Latent Dimensions
• The quantitative results in Figure 5 demonstrate
Figure 3: The impact of the number of latent dimensions significant improvements in instruction follow-
(k) on our steering experiments. The x-axis represents
ing, with the steered models achieving over 30%
different values of k, while the y-axis records the ac-
curacy. We track the trend of strict accuracy (SA) and strict accuracy across different tasks. The loose
loose accuracy (LA) across 8 different k values. accuracy of our steered approach performs nearly
on par with prompting-based instruction methods,
falling only slightly below. These results strongly
as demonstrated in Table 1. The extracted fea- indicate that SAIF can effectively extract features
tures consistently activate on instruction-relevant for user instructions and adjust LLMs’ behaviors
terms (e.g., “translate”, “French”) and related according to relevant instructions.
linguistic elements.
• The case study in Figure 4 illustrates two distinct
• The activating examples of our extracted top-k scenarios of instruction following: strict adher-
features reveal a clear relevance pattern: they ence (successful Chinese-to-French translation)
are directly corresponding to core instruction and loose following (understanding that this is
elements (e.g., task commands, target specifi- a French translation task). It demonstrates how
cations), while those of lower-ranked features SAIF manipulates model responses from the fail-
show decreasing relevance to instruction-relevant ure case toward either strict instruction following
terms, capturing more peripheral or contextual or loose instruction following.
information. The result is shown in Table 2. Take
• The Gemma-2-9b-it model consistently outper-
the Layer 25 as an example, for the top-13th fea-
forms Gemma-2-2b-it with slightly higher in-
ture, the top 3 tokens are French words. But for
struction steering performance across all five
the top-14th and 15th features, the top 3 tokens
tasks, suggesting that SAIF’s effectiveness scales
seem irrelevant to the instruction.
well with model size.
• The LLaMA-3.1-8B model shows comparable
4.3 Steering Performance Analysis performance to the Gemma models across tasks.
In this section, we evaluate the effectiveness of Looking at French translation as an example,
steering vectors constructed from SAE features and LLaMA-3.1-8B achieves around 30% strict accu-

6
An example of French-Translation Task (Strict Instruction Following)
Prompt: Robi
Output: Elle a une soupe simple, mais son goût hésite, pas aussi clair que Robi.
001
An example of French-Translation Task (Loose Instruction Following)
Prompt: ε ό, τι αφορά την κυβέρνηση, αποτελεί μέρος της ισπανικής επαρχίας ��������
Output: The French translation government, it is a part of the Spanish of the sentence is: “ **En What
relates to the archipelago of the Balearic Islands.**”
002
003
Figure 4: Examples of French translation task outcomes showing strict instruction following and loose instruction
following An
using inputsof
example inFrench-Translation
different languages.Task
(Gemma-2-2b-it, SAE
(Failed to follow dimension of 65K)
Instruction)
Prompt: Nefes kesen gösterir biri, Çin Mahallesi ' nin batı yakasında jalan bandar ' da bulunan Sri Ma-
hamariamman Hindu Tapınağı ' nda Kwong Siew Derneği Tapınağı ' ndan kısa bir süre sonra geliyor.
racy and 65% loose accuracy, which is similar to 4.4 The Role of Last Layer Representations in
Output: Translation
Gemma-2-2b-it’s translation Translation Translation Translation
performance. Translation
Instruction Translation Transla-
Processing
tion Translation Translation Translation
004
In previous sections, we exclusively used SAE005
Latent Dimension Analysis. We study the effect
from the last Transformer layer for concept vector
of single An
latent and the
example numberKeyword
of Include of latents
Taskon(Strict
steer-Instruction Following)
extraction and instruction steering. In this section,
ing, showing that too few and too many dimensions
Prompt: Es wurde 1972 abgeschlossen , aber die Fruchtbarkeitwe analyze derwhy
ägyptischen Landwirtschaft
extracting conceptssinkt
and steering
both leadjetzt
to failures. Forder
, da sie von individual
jährlichen latent,
Schichtwe use Nährstoffe , die durch die Flut verursacht wurden ,
frischer from the final layer is most effective.
the singleabgeschnitten
top 1 latent and latent listed in Table 1 for
wird..
steering. Despite their apparent
Output: Während er jedensemantic
Sunday und relevance
das ganze JahrConcept Extraction
über eine verlässliche Perspective.
Bewässerung From the re-
ermöglicht,
to translation
hat ertasks, the model
auch negative shows zero accuracy. sults in Table 4, we observe an intriguing phe-
Auswirkungen.
006
This suggests that instruction following cannot be nomenon that shallower layers are less effective in007
captured by a single high-level concept, even when providing clean instruction-relevant features. Fol-
that concept
An appears
example highly correlated
of Include KeywordwithTaskspecific lowingFollowing)
(Loose Instruction our default experimental settings, we ex-
instruction types.
Prompt: Естественный отбор занимает у миллионера yachters.. tract the top 15 SAE features from each layer
This observation
Output: To leads
IncludeusSunday
to investigate whether
in text, please of the
provide more model. The features extracted from the
information.
a combination of multiple latent dimensions could last layer can precisely capture the semantics of008
achieve better steering performance. Our experi- ‘French’, showing strong activations on French-009
ments, shown in Figure 3, systematically evaluate related words, where k = 2 indicates this feature
An example of Include Keyword Task (Failed to follow Instruction)
the impact of varying the number of latent dimen- is considered the second most instruction-relevant
Prompt: Chapter 6 examines the availability of resources necessary
feature. for thefrom
Starting installation of SO2 , NOX layer, as
the penultimate
sions from 1 to 30. The instructions used here are
, and mercury control retrofit technologies for the timing and emission reductions proposed under the
sourced from
Clear French
Skies Act.translation task. The results we attempt to trace French-related features, our ex-
reveal several
Output:keyPlease tell me what you think is the meaningperimental
patterns: results
of the passage, reveal
and can that the
you clarify the extracted
meaning French-
of “Clear Skies Act”? related concepts undergo a gradual shift as the layer
010
• Steering performance remains near zero when depth decreases. Specifically, the feature evolves
011
k ≤ 5, indicating that too few dimensions are from exclusively activating on French-related to-
insufficient for capturing
An example instruction-following
of Summarization kens Following)
Task (Strict Instruction to encompassing a broader spectrum of lan-
behavior. Performance begins to improve notably guages (English, Spanish, Hindi, and Belgian),
Prompt:
around k = 10, with both strict accuracy and demonstrating a hierarchical abstraction pattern
loose accuracy showing substantial increases.
Output: The government encourages retirees to work from
and language-specific
supports their economic to cross-lingual
development by represen-
• The optimal performance
providing skills trainingis and
achieved
coursesatto k = them
help tations.
15, re-enter Moreover,
the job market. the increasing k values sug-
where loose accuracy peaks at approximately 0.7 gest that these French-related features become less012
013
and strict accuracy reaches about 0.25. instruction-relevant in earlier layers. For Gemma2-
• However, as we increase dimensions beyond 1 2b-it model, before Layer 21, we can no longer
k = 15, both metrics show a consistent decline. identify French-related features among the top 15
This deterioration becomes more pronounced as SAE features.
k approaches 30, suggesting that excessive di- Steering Perspective. We conducted steering ex-
mensions introduce noise that interferes with ef- periments using the top 15 features extracted from
fective steering. Layers 21-25 respectively under default settings on
French Translation task. The results align with our

7
Gemma-2-2b-it Gemma-2-9b-it LLaMA-3.1-8B Steered-SA
100 100 100 Steered-LA
75 75 75 Original
ACC(%)

50 50 50
25 25 25
0 0 0

ary

ary

ary
Ch ch
En e
Ke sh

Ch ch
En e
Ke sh

Ch ch
En e
Ke sh
Su ord

Su ord

Su ord
s

s
ine

ine

ine
gli

gli

gli
n

n
mm

mm

mm
yw

yw

yw
Fre

Fre

Fre
Figure 5: Performance comparison between original model outputs and two steering approaches across different
instruction types on Gemma-2-2b-it and Gemma-2-9b-it models. Results show the accuracy percentages for
translation tasks (French, Chinese, English), keyword inclusion, and summarization tasks.

Table 4: Analysis of Layer Features

# of Layer Top 5 tokens with the highest logit increases by the # of top_k # of Feature
feature influence
25 French, France, french, FRENCH, Paris 2 33659
24 French, nb, french, Erreur, Fonction 8 65238
23 French, France, french, Paris, Francis 15 49043
22 English, english, Spanish, French, Hindi 12 351
21 Belgian, Belgium, Brussels, Flemish, Belgique 14 27665

findings on concept extraction, showing the effec- instruction positioning consistently outperforms
tiveness and importance of last layer representation pre-instruction, with post-instruction achieving
on instruction following. Using loose accuracy higher accuracy across all measures (Strict Acc:
as the evaluation metric, we observe that steering 0.23 vs 0.14, Loose Acc: 0.64 vs 0.47), aligning
with Layer 24 features still maintains some effec- with the result in Liu et al. (2024).
tiveness, though the loose accuracy drops sharply • Feature activation patterns show that post-
from 0.64 (Layer 25) to 0.33. Steering attempts instruction enables more robust processing with
using features from earlier layers fail to guide the stronger activation peaks (particularly for key
model towards instruction-following behavior, with features like F33659), more consistent stabil-
the model instead tending to generate repetitive and ity scores, and higher activation probabilities
instruction-irrelevant content. (>80%) across most features compared to pre-
instruction’s more variable patterns.
4.5 Impact of Instruction Position
Previous studies have shown that models’ 5 Conclusions
instruction-following capabilities can vary signif- In this paper, we have introduced to use SAEs to
icantly depending on the relative positioning of analyze instruction following in LLMs, revealing
instructions and content. This motivates us to ex- the underlying mechanisms through which mod-
amine how instruction positioning affects the acti- els encode and process instructions. Our analysis
vation patterns of previously identified features. demonstrates that instruction following is mediated
We investigate the effect of instruction position by interpretable latent features in the model’s rep-
by comparing two patterns: pre-instruction (Ppre resentation space We have developed a lightweight
= [Instruction] + [Content]) and post-instruction steering technique that enhances instruction follow-
(Ppost = [Content] + [Instruction]) as in Liu et al. ing by making targeted modifications to specific
(2024). Using identical instruction-content pairs latent dimensions. We find that effective steering
while varying only their relative positions allows requires the careful combination of multiple latent
us to isolate the effects of position. Our analysis features with precisely calibrated weights. Exten-
reveals several key findings from both the quanti- sive experiments across diverse instruction types
tative metrics (see Table 3) and feature activation have demonstrated that our proposed steering ap-
patterns (see Figure 2): proach enables precise control over model behavior
• Performance metrics demonstrate that post- while consistently maintaining coherent outputs.

8
Limitations Alexis Conneau, Ruty Rinott, Guillaume Lample, Ad-
ina Williams, Samuel R. Bowman, Holger Schwenk,
One limitation of our steering approach is that it and Veselin Stoyanov. 2018. Xnli: Evaluating cross-
sometimes produces outputs that only partially fol- lingual sentence representations. In Proceedings of
low the intended instructions, particularly when the 2018 Conference on Empirical Methods in Natu-
ral Language Processing. Association for Computa-
handling complex tasks. While the model may un- tional Linguistics.
derstand the general intent of the instruction, the
generated outputs may not fully satisfy all aspects Javier Ferrando, Oscar Balcells Obeso, Senthooran Ra-
jamanoharan, and Neel Nanda. 2025. Do i know this
of the requested task. For example, in translation entity? knowledge awareness and hallucinations in
tasks, the model might incorporate some elements language models. In The Thirteenth International
of the target language but fail to produce a com- Conference on Learning Representations.
plete and accurate translation. Besides, our current Leo Gao, Tom Dupré la Tour, Henk Tillman, Gabriel
work focuses primarily on simple, single-task in- Goh, Rajan Troll, Alec Radford, Ilya Sutskever,
structions like translation or summarization. In Jan Leike, and Jeffrey Wu. 2024. Scaling and
future, we plan to investigate how to extend this evaluating sparse autoencoders. arXiv preprint
arXiv:2406.04093.
approach to handle more sophisticated instruction
types, such as multi-step reasoning tasks or instruc- Zhengfu He, Wentao Shu, Xuyang Ge, Lingjie Chen,
tions that combine multiple objectives. Addition- Junxuan Wang, Yunhua Zhou, Frances Liu, Qipeng
Guo, Xuanjing Huang, Zuxuan Wu, et al. 2024.
ally, our experiments were conducted using models Llama scope: Extracting millions of features from
from the Gemma and Llama two LLM families. In llama-3.1-8b with sparse autoencoders. arXiv
the future, we plan to extend this analysis to a more preprint arXiv:2410.20526.
diverse set of language model architectures and John Hewitt, Nelson F Liu, Percy Liang, and Christo-
families to validate the generality of our findings. pher D Manning. 2024. Instruction following without
instruction tuning. arXiv preprint arXiv:2409.14254.

References Ole Jorgensen, Dylan Cope, Nandi Schoots, and Murray


Shanahan. 2024. Improving activation steering in
Andy Arditi, Oscar Obeso, Aaquib Syed, Daniel Paleka, language models with mean-centring. In Responsible
Nina Panickssery, Wes Gurnee, and Neel Nanda. Language Models Workshop at AAAI-24.
2024. Refusal in language models is mediated by
a single direction. arXiv preprint arXiv:2406.11717. Been Kim, Martin Wattenberg, Justin Gilmer, Carrie
Cai, James Wexler, Fernanda Viegas, et al. 2018. In-
Yonatan Belinkov. 2022. Probing classifiers: Promises, terpretability beyond feature attribution: Quantitative
shortcomings, and advances. Computational Linguis- testing with concept activation vectors (tcav). In In-
tics, 48(1):207–219. ternational conference on machine learning, pages
2668–2677. PMLR.
Trenton Bricken, Adly Templeton, Joshua Batson, Brian
Chen, Adam Jermyn, Tom Conerly, Nicholas L Connor Kissane, Robert Krzyzanowski, Joseph Isaac
Turner, Cem Anil, Carson Denison, Amanda Askell, Bloom, Arthur Conmy, and Neel Nanda. 2024. In-
Robert Lasenby, Yifan Wu, Shauna Kravec, Nicholas terpreting attention layer outputs with sparse autoen-
Schiefer, Tim Maxwell, Nicholas Joseph, Alex coders. In ICML 2024 Workshop on Mechanistic
Tamkin, Karina Nguyen, Brayden McLean, Josiah E Interpretability.
Burke, Tristan Hume, Shan Carter, Tom Henighan, Po-Nien Kung and Nanyun Peng. 2023. Do models
and Chris Olah. 2023. Towards monosemanticity: really learn to follow instructions? an empirical study
Decomposing language models with dictionary learn- of instruction tuning. In Proceedings of the 61st
ing. Annual Meeting of the Association for Computational
Hyung Won Chung, Le Hou, Shayne Longpre, Barret Linguistics (Volume 2: Short Papers), pages 1317–
Zoph, Yi Tay, William Fedus, Yunxuan Li, Xuezhi 1328.
Wang, Mostafa Dehghani, Siddhartha Brahma, Al- Kenneth Li, Tianle Liu, Naomi Bashkansky, David
bert Webson, Shixiang Shane Gu, Zhuyun Dai, Bau, Fernanda Viégas, Hanspeter Pfister, and Martin
Mirac Suzgun, Xinyun Chen, Aakanksha Chowdh- Wattenberg. 2024a. Measuring and controlling per-
ery, Alex Castro-Ros, Marie Pellat, Kevin Robinson, sona drift in language model dialogs. arXiv preprint
Dasha Valter, Sharan Narang, Gaurav Mishra, Adams arXiv:2402.10962.
Yu, Vincent Zhao, Yanping Huang, Andrew Dai,
Hongkun Yu, Slav Petrov, Ed H. Chi, Jeff Dean, Ja- Kenneth Li, Oam Patel, Fernanda Viégas, Hanspeter
cob Devlin, Adam Roberts, Denny Zhou, Quoc V. Le, Pfister, and Martin Wattenberg. 2024b. Inference-
and Jason Wei. 2024. Scaling instruction-finetuned time intervention: Eliciting truthful answers from a
language models. Journal of Machine Learning Re- language model. Advances in Neural Information
search (JMLR), 25(70):1–53. Processing Systems, 36.

9
Tom Lieberum, Senthooran Rajamanoharan, Arthur Alessandro Stolfo, Vidhisha Balachandran, Safoora
Conmy, Lewis Smith, Nicolas Sonnerat, Vikrant Yousefi, Eric Horvitz, and Besmira Nushi. 2024.
Varma, János Kramár, Anca Dragan, Rohin Shah, Improving instruction-following in language mod-
and Neel Nanda. 2024. Gemma scope: Open sparse els through activation steering. arXiv preprint
autoencoders everywhere all at once on gemma 2. arXiv:2410.12877.
arXiv preprint arXiv:2408.05147.
Jiao Sun, Yufei Tian, Wangchunshu Zhou, Nan Xu, Qian
Johnny Lin. 2023. Neuronpedia: Interactive reference Hu, Rahul Gupta, John Wieting, Nanyun Peng, and
and tooling for analyzing neural networks. Software Xuezhe Ma. 2023. Evaluating large language models
available from neuronpedia.org. on controlled generation tasks. In Proceedings of the
2023 Conference on Empirical Methods in Natural
Yijin Liu, Xianfeng Zeng, Fandong Meng, and Jie Zhou. Language Processing (EMNLP), pages 3155–3168,
2024. Instruction position matters in sequence gen- Singapore. Association for Computational Linguis-
eration with large language models. In Findings of tics.
the Association for Computational Linguistics: ACL
2024. Gemma Team, Morgane Riviere, Shreya Pathak,
Pier Giuseppe Sessa, Cassidy Hardin, Surya Bhupati-
Wanqin Ma, Chenyang Yang, and Christian Kästner. raju, Léonard Hussenot, Thomas Mesnard, Bobak
2024. (why) is my prompt getting worse? rethinking Shahriari, Alexandre Ramé, et al. 2024. Gemma 2:
regression testing for evolving llm apis. In Proceed- Improving open language models at a practical size.
ings of the IEEE/ACM 3rd International Conference arXiv preprint arXiv:2408.00118.
on AI Engineering - Software Engineering for AI,
CAIN ’24, page 166–171, New York, NY, USA. As- Curt Tigges, Oskar John Hollinsworth, Atticus Geiger,
sociation for Computing Machinery. and Neel Nanda. 2023. Linear representations
of sentiment in large language models. Preprint,
Samuel Marks and Max Tegmark. 2024. The geometry arXiv:2310.15154.
of truth: Emergent linear structure in large language
model representations of true/false datasets. In Con- Jason Wei, Maarten Bosma, Vincent Zhao, Kelvin Guu,
ference on Language Modeling. Adams Wei Yu, Brian Lester, Nan Du, Andrew M.
Dai, and Quoc V Le. 2022. Finetuned language mod-
Bruno A. Olshausen and David J. Field. 1997. Sparse els are zero-shot learners. In International Confer-
coding with an overcomplete basis set: A strategy ence on Learning Representations (ICLR).
employed by v1? Vision Research, 37(23):3311–
3325. Haiyan Zhao, Heng Zhao, Bo Shen, Ali Payani, Fan
Yang, and Mengnan Du. 2025. Beyond single con-
Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, cept vector: Modeling concept subspace in LLMs
Carroll Wainwright, Pamela Mishkin, Chong Zhang, with gaussian distribution. In The Thirteenth Interna-
Sandhini Agarwal, Katarina Slama, Alex Ray, et al. tional Conference on Learning Representations.
2022. Training language models to follow instruc-
tions with human feedback. Advances in Neural In- Chunting Zhou, Pengfei Liu, Puxin Xu, Srinivasan Iyer,
formation Processing Systems (NeurIPS), 35:27730– Jiao Sun, Yuning Mao, Xuezhe Ma, Avia Efrat, Ping
27744. Yu, Lili Yu, et al. 2024. Lima: Less is more for align-
ment. Advances in Neural Information Processing
Senthooran Rajamanoharan, Tom Lieberum, Nicolas Systems, 36.
Sonnerat, Arthur Conmy, Vikrant Varma, János
Kramár, and Neel Nanda. 2024. Jumping ahead: Im- Jeffrey Zhou, Tianjian Lu, Swaroop Mishra, Sid-
proving reconstruction fidelity with jumprelu sparse dhartha Brahma, Sujoy Basu, Yi Luan, Denny Zhou,
autoencoders. Preprint, arXiv:2407.14435. and Le Hou. 2023. Instruction-following evalu-
ation for large language models. arXiv preprint
Nina Rimsky, Nick Gabrieli, Julian Schulz, Meg Tong, arXiv:2311.07911.
Evan Hubinger, and Alexander Turner. 2024. Steer-
ing llama 2 via contrastive activation addition. In Andy Zou, Long Phan, Sarah Chen, James Campbell,
Proceedings of the 62nd Annual Meeting of the As- Phillip Guo, Richard Ren, Alexander Pan, Xuwang
sociation for Computational Linguistics (Volume 1: Yin, Mantas Mazeika, Ann-Kathrin Dombrowski,
Long Papers). Shashwat Goel, Nathaniel Li, Michael J. Byun, Zifan
Wang, Alex Mallen, Steven Basart, Sanmi Koyejo,
Victor Sanh, Albert Webson, Colin Raffel, Stephen H Dawn Song, Matt Fredrikson, J. Zico Kolter, and
Bach, Lintang Sutawika, Zaid Alyafeai, Antoine Dan Hendrycks. 2023. Representation engineering:
Chaffin, Arnaud Stiegler, Teven Le Scao, Arun Raja, A top-down approach to ai transparency. Preprint,
et al. 2022. Multitask prompted training enables zero- arXiv:2310.01405.
shot task generalization. In International Conference
on Learning Representations.
Lee Sharkey, Dan Braun, and Beren Millidge. 2022.
Taking features out of superposition with sparse au-
toencoders.

10
A Details of Instructions

Instruction Description
Include Keywords Include keyword {keyword} in your response.
In your response, the keyword {keyword} should appear.
Answer with keyword {keyword} in your response.
{keyword}
Fügen Sie das Schlüsselwort {keyword} in Ihre Antwort ein.
返信の中にキーワード {keyword} を含めてください。
Extraction Provide a summary of this text.
Summarize this sentence.
Give me the bullet points of the given content.

Fassen Sie diesen Text zusammen.


このテキストの要約を提供してください。
Translation (French) Translate the sentence to French.
Provide a French translation of the given text.
Convert this sentence to French.

Übersetzen Sie den Satz ins Französische.


この文を仏語に翻訳してください。
Translation (English) Translate the sentence to English.
Provide an English translation of the given text.
Convert this sentence to English.

Übersetzen Sie den Satz ins Englische.


この文を英語に翻訳してください。
Translation (Chinese) Translate the sentence to Chinese.
Provide a Chinese translation of the given text.
Convert this sentence to Chinese.

Übersetzen Sie den Satz ins Chinesische.


この文を中国語に翻訳してください。

B Related Work
In this section, we briefly summarize several research directions that are most relevant to ours.
Instruction Following in Language Models. Instruction following capabilities are crucial for improving
LLM performance and ensuring safe deployment. Recent advances in instruction tuning have demonstrated
significant progress through various methods (Ouyang et al., 2022; Sanh et al., 2022; Wei et al., 2022;
Chung et al., 2024). However, capable models still struggle with hard-constrained tasks (Sun et al., 2023)
and lengthy generations(Li et al., 2024a). Some studies find that instruction following can be improved
with in-context few-shot examples (Kung and Peng, 2023), optimal instruction positions (Liu et al., 2024),
carefully selected instruction-response pairs with fine-tuning (Zhou et al., 2024), and adaptations (Hewitt
et al., 2024). Unfortunately, the mechanistic understanding of how LLMs internally represent and process
these instructions remains limited.
Language Model Representations. A body of research have focused on studying the linear represen-
tation of concepts in representation space (Kim et al., 2018). The basic idea is to find a direction in the

11
3
0.50 Distribution of Relative Standard Deviation

Relative Standard Deviation (std/mean)


0.45
0.40
0.35
0.30
0.25
0.20
0.15
0.10 Llama3.1-8B Gemma2-9B

Figure 6: Visualization of steering vectors extracted from LLaMA-3.1-8B and Gemma-2-9B for French translation
task. The y-axis denotes the ratio between the standard deviation and mean of feature activation strengths.

space to represent the related concept. This can be achieved using a dataset with positive and negative
samples relevant to concepts. Existing approaches computing the concept vectors include probing classi-
fiers (Belinkov, 2022), mean difference (Rimsky et al., 2024; Zou et al., 2023), mean centering (Jorgensen
et al., 2024), gaussian concept subspace (Zhao et al., 2025), which provide a rich set of tools to derive
concept vectors. The derived concept vectors represent various high-level concepts such as honesty (Li
et al., 2024b), truthfulness (Tigges et al., 2023), harmfulness (Zou et al., 2023), and sentiments (Zhao
et al., 2025).
Sparse Autoencoders. Dictionary learning is effective in disentangling features in superposition without
representation space. Sparse autoencoder (SAE) offers a feasible way to map representations into a higher-
dimension space and reconstruct to representation space. Various SAEs have been proposed to improve
their performance such as vallina SAEs (Sharkey et al., 2022), TopK SAEs (Gao et al., 2024). Based
on them, a range of sparse autoencoders (SAEs) have been trained to interpret hidden representations
including Gemma Scope (Lieberum et al., 2024) and Llama Scope (He et al., 2024). These SAEs have
also been used to interpret models’ representational output (Kissane et al., 2024) and understand their
abilities (Ferrando et al., 2025).
Activation Steering. Recently, a body of research has utilized concept vectors to steer model behaviors
during inference. Specifically, concepts vectors can be computed with diverse approaches, and these
vectors are mostly effective on manipulating models generating concept-relevant text. For instance, many
studies find it useful in improving truthfulness(Marks and Tegmark, 2024) and safety (Arditi et al., 2024),
mitigating sycophantic and biases (Zou et al., 2023). Steering primarily operates in the residual stream
following methods defined in Eq. (3), but it is worth-noting that the steering vectors can be computed
from either residual stream representations or SAEs. Existing work mostly concentrates on computing
with residual stream representations, which provide limited insights on what finer features contribute to
the high-level concept vector. This coarse approach could further limit our deeper understanding on more
complicated vectors such as instructions. In our work, we aim to bridge this gap by studying instruction
vectors with SAEs to uncover their working mechanism.

C Additional Results for Llama-3.1-8b


In our experimental setup, we employ Equation (7) to control feature activation during model steering,
where µi denotes the pre-computed mean activation strength and si represents the standard deviation for
feature i. The hyperparameter β controls the perturbation magnitude relative to the standard deviation.
Our experiments reveal distinct robustness characteristics across different model architectures. For the
Gemma-2 family models, the steering vectors maintain their effectiveness when β ∈ [−1, 1], indicating
robust feature representations. These models exhibit high activation strength values (µi ) with low standard
deviations (si ), suggesting stable and consistent feature characteristics. In contrast, the Llama-3.1-8b

12
Table 5: Evaluation Prompt for Generated Output

Your task is to strictly evaluate whether the generated output follows the given instruction.
First you should review the following components:
Original Input: {input_text}
Instruction: {instruction}
Generated Output: {generated_output}
Here is the evaluation criteria:
A: The generated content completely follows the instruction.
B: Contains instruction keywords but doesn’t follow the instruction completely.
C: Completely irrelevant to the instruction Critical.
Remember:
If the Generated Output only contains repeated words or sentences, select C immediately.
DO NOT provide explanation. Provide your evaluation by selecting one option(A/B/C).
Your Answer is:

model demonstrates higher sensitivity to activation perturbations. The steering vectors remain effective
only when β ∈ [−0.1, 0.1], indicating a significantly narrower tolerance range. The relative standard
deviations illustrated in Figure 6 quantify this distinction. This narrow tolerance range suggests that
Llama-3.1-8b’s feature space may possess the following characteristics: stricter boundaries between
features, more discrete transitions between different instruction states, and poorer robustness to noise.

D Steering Accuracy Evaluation based on GPT-4o-mini


To evaluate generated outputs, we instruct GPT-4o-mini to rate in the following way. For each instance,
we provide GPT-4o-mini with three components: the original input text, the instruction, and the model-
generated output. To ensure reliable assessment, we implement a voting mechanism where GPT-4o-mini
performs five independent evaluations for each instance. For each evaluation, GPT-4o-mini is prompted
to assess the instruction following level by selecting whether the generated content completely follows
the instruction (A), contains instruction keywords but doesn’t follow the instruction (B), or is completely
irrelevant to the instruction (C). The final grade is determined by majority voting among the five evaluations.
In cases where there is no clear majority (e.g., when votes are split as 2-2-1), we choose the lower grade
between the two options that received the most votes (C is considered lower than B, and B is lower than
A). This ensures a stringent evaluation standard when the votes are divided. Thus, the Strict Accuracy is
the ratio of A and the Loose Accuracy is the ratio of A + B. The prompt we use in the experiments can be
found in Table 5.

E Model Scale Analysis


We explore the influence of both model scale and SAE scale, showing larger sizes always contribute
to better performance. Using SAE with larger dimensions (e.g., increasing Gemma-2-2b’s SAE from
16K to 65K) can effectively improve the interpretability of feature extraction. For the same prompt,
Gemma-2-2b’s 16K SAE is almost unable to extract interpretable features under our settings, while the
65K model performs well. For Gemma-2-9b and Llama3.1-8b models, even the SAE with minimal
dimensions can extract features with good interpretability.

13
F More Activating Examples of Top-ranked Features

Table 6: The remaining eight features we used to construct the steering vector for Gemma2-2B SAE on the French
Translation task, along with their corresponding activation examples. (The other seven features can be found in
Table 1 and Table 2.) The examples are provided by Neuropedia (Lin & Bloom, 2024).

Layer25, Feature42374

Could you please translate the following sentence to French?


I think “everyone” and “we” are the same in this sentence.

Layer25, Feature49454

Quote from the article below : Variable names are case - sensitive.
With pure mind and internal comtemplation there is no need for...

Layer25, Feature54902

The incredible spe ta culo de la vida, the incredible spe ta culo de la muerte!
This is a continuation of the precedent the band established with Re...

Layer25, Feature55427

Whatever the modifier may be, both sentences are discussing...


I can make no distinction between the two lsentences at issue...

Layer25, Feature6201

Furthermore, figure has a plethora of other senses, evinced by the dictionary entry linked above.
The meaning and nuance of this phrase can be quite different depending on the context.

Layer25, Feature17780

How to convert the text into Hyperlinks? Thanks in advance!


Hi Jimmy, I don’t have your grandfather Birl listed in my files...

Layer25, Feature22091

She can’t focus sufficiently to utter complete sentences without needing to stop and reflect.
He speaks in a Hiroshima accent and often ends his sentences with " garu" and " ja".

Layer25, Feature59061

Helderberg is a Dutch name meaning "clear mountrain".


Kaila - Altered form of English Kaylay, meaning "slender".

14
An example
Output: The of
τι Summarization
French
εEsό,wurdeαφοράtranslation Task (Failed
government, itto
is afollow Instruction)
part ofτης
theισπανικής
Spanish ofεπαρχίας
the sentence is: “ **En What
Prompt:
Prompt: 1972την κυβέρνηση,
abgeschlossen αποτελεί
, aber die μέρος
Fruchtbarkeit ��������
der ägyptischen Landwirtschaft sinkt
relates to the archipelago
With its economy of the
in Balearic ,Islands.**”
disarray the government introduced a limited number
Prompt:
jetzt , da The
Output: sie von
Frenchder translation
jährlichen government,
Schicht frischer
it is Nährstoffe
a part of the, Spanish
die durch die sentence
of the is: “of
Flut verursacht capitalist
wurden
**En What, 002
measures
abgeschnittenwhile
relates to the maintaining
wird..
archipelago a firm
of the political
Balearic grip.
Islands.**” 003
002
This was er
Output: Während
Output: a response
jeden Sundayto itsund
citizens’ increasing
das ganze demand
Jahr über for a more Bewässerung
eine verlässliche m oriented system . The
ermöglicht, 003
phrase
An
hat er “limited
example
auch of number of capitalist measures”
French-Translation
negative Auswirkungen. implies
Task (Failed a combination
to follow of both traditional and capitalist
Instruction)
mechanisms that would not completely free the economy from state control. 006
016 G Prompt:
Examples
An example Nefes
of kesen gösterir biri,
of Instruction
French-Translation Çin
Following Mahallesi
Task Tasks
(Failed ' nin batı Steering
with
to followyakasında jalan bandar ' da bulunan Sri Ma-
Vectors
Instruction) 007
017 hamariamman Hindu Tapınağı ' nda Kwong Siew Derneği Tapınağı ' ndan kısa bir süre sonra geliyor.
Prompt: Nefes kesen gösterir biri, Çin Mahallesi ' nin batı yakasında jalan bandar ' da bulunan Sri Ma-
An example of Include
French-Translation
Translation Task (Loose Instruction Following)
Output:
An example
hamariamman ofHindu translation
Keyword
Tapınağı Translation
' ndaTask
Kwong Translation
(Loose
Siew DerneğiTranslation
Instruction Translation
Following)
Tapınağı ' ndan kısa birTranslation
süre sonra Transla-
geliyor.
tion Translation
Prompt: They Translation
got divorced Translation
because the wife found out that the husband had betrayed her.
Prompt: Естественный
Output: Translation отбор занимает
translation Translationу миллионера yachters.. Translation Translation Transla-
Translation Translation 004
Output:
Output: French
tion Translation Translation:
Translation
To Include Please
text, translate
Sunday Translation
in the translation
please provide into French. Translation does not translate
more information. 005
004
properly in its current format, we need to rewrite it. Please rewrite the translation into French. 008
018 005
An example of Include Keyword Task (Strict Instruction Following) 009
019
Prompt:
An Es wurde
example 1972Keyword
of Include abgeschlossen
Task ,(Strict
aber die Fruchtbarkeit
Instruction der ägyptischen Landwirtschaft sinkt
Following)
An
jetztexample
, da sie of
vonInclude Keyword
der jährlichen Task (Failed
Schicht frischer to follow Instruction)
Nährstoffe , die durch die Flut verursacht wurden ,
Prompt: Es wurde 1972 abgeschlossen , aber die Fruchtbarkeit der ägyptischen Landwirtschaft sinkt
abgeschnitten
Prompt: wird..
Chapter 6 examines the availability of resources necessary for the installation of SO2 , NOX
jetzt , da sie von der jährlichen Schicht frischer Nährstoffe , die durch die Flut verursacht wurden ,
,Output:
and mercury control
Während retrofit technologies for the timing and emission reductions proposed under the
abgeschnitten wird..er jeden Sunday und das ganze Jahr über eine verlässliche Bewässerung ermöglicht,
An
Clear example
hat er Skies of
Act.French-Translation
auch negative Auswirkungen. Task (Strict Instruction Following)
Output: Während er jeden Sunday und das ganze Jahr über eine verlässliche Bewässerung ermöglicht, 006
Output:
Prompt: Please tell me what you think is the meaning
Robi of the passage, and can you clarify the meaning
hat er auch negative Auswirkungen. 007
of “Clear Skies Act”? 006
Output: Elle a une soupe simple, mais son goût hésite, pas aussi clair que Robi. 010
001
007
011
An
An example
example ofof Include Keyword Task
French-Translation Task(Loose Instruction
(Loose Following)
Instruction Following)
Prompt:
An
An example
exampleЕстественный
of Include отбор занимает
Keyword Task у миллионера
(Loose Instruction yachters..
Following)επαρχίας ��������
Prompt: ε ό, of
τι Summarization
αφορά την Task
κυβέρνηση, (Strict Instruction
αποτελεί μέρος τηςFollowing)
ισπανικής
Output: To Include Sunday
Естественный in text,
отбор pleaseуprovide
занимает more information.
миллионера
Prompt: The
Output:
Prompt: French translation government, it is a part of yachters..
the Spanish of the sentence is: “ **En What 008
relates to To
Output: theInclude
archipelago of the
Sunday Balearic
in text, Islands.**”
please provide more information. 009
002
008
Output: The government encourages retirees to work and supports their economic development by 003
009
providing
An exampleskills training Keyword
of Include and coursesTask
to help themtore-enter
(Failed follow the job market.
Instruction)
012
An example
Prompt:
An example of
of French-Translation
Chapter 6 examines
Include Keyword Task
the availability
Task (Failedto to follow
of resources
(Failed follow Instruction)
necessary for the installation of SO2 , NOX
Instruction) 013
,Prompt:
and mercury control retrofit technologies for the timing and emission reductions 'proposed underMa-
the
Prompt: Nefes
Chapterkesen gösterir the
6 examines biri,availability
Çin Mahallesi ' nin batı necessary
of resources yakasındaforjalan
thebandar da bulunan
installation of SO2Sri
, NOX
Clear Skies
hamariamman Act.Hindu Tapınağı ' nda Kwongfor Siew
, and mercury control retrofit technologies the 1Derneği
timing and Tapınağı ' ndan
emission kısa birproposed
reductions süre sonraunder
geliyor.
the
Output: Please
Clear Skies Act. tell me what you think is the meaning of the passage, and can you clarify the meaning
Output: Translation translation Translation Translation Translation Translation Translation Transla-
of “Clear Skies Act”?
tion Translation
Output: Translation
Please tell me whatTranslation
you think is the meaning of the passage, and can you clarify the meaning 010
004
of “Clear Skies Act”? 011
005
010
011
An example of Summarization Task (Strict Instruction Following)
An example of Include Keyword Task (Strict Instruction Following)
An example
Prompt:
Prompt: of Summarization
Es wurde Task (Strict
1972 abgeschlossen , aber Instruction Following)
die Fruchtbarkeit der ägyptischen Landwirtschaft sinkt
jetzt , da sie von der jährlichen Schicht frischer Nährstoffe , die durch die Flut verursacht wurden ,
Prompt:
Output: The government
abgeschnitten wird.. encourages retirees to work and supports their economic development by
providingWährend
skills training and coursesund
er jeden encourages
Sunday to das
helpganze
them Jahr
re-enter the job market. Bewässerung ermöglicht,
Output: The government retirees to worküber
andeine verlässliche
supports their economic development by 012
hat er auchskills
providing negative Auswirkungen.
training and courses to help them re-enter the job market. 013
006
012
An example of Summarization Task (Loose Instruction 2 Following) 007
013
An example of Summarization Task (Loose Instruction 1 Following)
Prompt: So I’m up here you know nights and weekends you know working on the little PC on the PC
An example
Prompt: of Include
So I’m up here Keyword
you know Task
nights(Loose Instruction
and weekends Following)
you know working on the little PC on the PC
up here you know typing stuff so I can see where 1it would be a great advantage to have one at home
up here you know typing stuff so
занимает у миллионера yachters.. advantage to have one at home
I can see where it would be a great
you knowЕстественный
Prompt: отбор
if you were going to school.
you know if you were going to school.
To Include
Output: Let Sunday
me break down in text,
this please and
sentence provide
give more
you ainformation.
summary.
014 Output: Let me break down this sentence and give you a summary. 008
014
015 009
015
An example of Summarization Task (Failed to follow Instruction)
An
An example
example of
of Summarization
Include Keyword Task
Task(Failed
(Failedtoto
follow
followInstruction)
Instruction)
Prompt: With its economy in disarray , the government introduced a limited number of capitalist
Prompt: With its6economy
Prompt: Chapter examinesainthedisarray , theofgovernment introduced a limited number of capitalist
measures while maintaining firmavailability
political grip.resources necessary for the installation of SO2 , NOX
measures while maintaining a firm political grip.
, and mercury control retrofit technologies for the timing and emission reductions proposed under the
Output: This was a response to its citizens’ increasing demand for a more m oriented system . The
Output: This
Clear Skies was a response to its citizens’ increasing demand for a more m oriented system . The
Act.
phrase “limited number of capitalist measures” implies a combination of both traditional and capitalist
phrase “limited
Output: Please number of capitalist measures” impliesofa combination of both
can traditional
you clarifyand
the capitalist
mechanisms thattell me what
would you think
not completely isfree
thethe
meaning
economythe passage,
from and
state control. meaning
016 mechanisms that would
of “Clear Skies Act”? not completely free the economy from state control.
016 010
017
017 011

An example of Summarization Task (Strict Instruction Following)


Prompt: 15

Output: The government encourages retirees to work and supports their economic development by
H Extracted Features Correlation Visualization and Analysis
In Section 4.5, we explored how instruction placement (before or after the original prompt) affects model
behavior. To further understand how the model encodes and processes instructions in different positions,
we present visualization analysis using feature correlation heatmaps. Figure 7 to Figure 11 show the
feature correlations of Gemma-2-2b model across five different tasks.
Taking Figure 7 as an example, the visualization is divided into Pre-Instruction and Post-Instruction
modes. Each part contains two 20×20 heatmap matrices showing Activation Probability and Activation
Strength correlations respectively. The heatmaps use a red-blue color scheme, where dark red indicates
strong positive correlation (1.0), dark blue indicates strong negative correlation (-1.0), and light or white
areas indicate correlations close to 0. The axes range from 0 to 19, representing the top 20 SAE latent
features.
Our analysis reveals distinct differences between the two instruction placement modes. The Pre-
Instruction mode shows dispersed correlations with predominantly light colors outside the diagonal,
indicating stronger feature independence. In contrast, the Post-Instruction mode exhibits more pronounced
red and blue areas, demonstrating enhanced feature correlations and a more tightly connected feature
network. This finding aligns with our key conclusion that effective instruction following requires precise
combinations of multiple latent features. The stronger feature correlations in Post-Instruction mode
confirm that single-feature manipulation is insufficient for reliable control. This insight into feature
cooperation supports the effectiveness of our proposed steering technique based on precisely calibrated
weights across multiple features.

16
Pre-Instruction Feature Correlations
Activation Probability Correlation 1.00 Activation Strength Correlation 1.00
0 0
1 1
2 0.75 2 0.75
3 3
4 0.50 4 0.50
5 5
6 6
7 0.25 7 0.25
Feature Index

Feature Index
8 8
9 9
10 0.00 10 0.00
11 11
12 0.25 12 0.25
13 13
14 14
15 0.50 15 0.50
16 16
17 0.75 17 0.75
18 18
19 19
1.00 1.00
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19

0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
Feature Index Feature Index
Post-Instruction Feature Correlations
Activation Probability Correlation 1.00 Activation Strength Correlation 1.00
0 0
1 1
2 0.75 2 0.75
3 3
4 0.50 4 0.50
5 5
6 6
7 0.25 7 0.25
Feature Index

Feature Index

8 8
9 9
10 0.00 10 0.00
11 11
12 0.25 12 0.25
13 13
14 14
15 0.50 15 0.50
16 16
17 0.75 17 0.75
18 18
19 19
1.00 1.00
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19

0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19

Feature Index Feature Index


Figure 7: Heatmaps for Keyword Task.

17
Pre-Instruction Feature Correlations
Activation Probability Correlation 1.00 Activation Strength Correlation 1.00
0 0
1 1
2 0.75 2 0.75
3 3
4 0.50 4 0.50
5 5
6 6
7 0.25 7 0.25
Feature Index

Feature Index
8 8
9 9
10 0.00 10 0.00
11 11
12 0.25 12 0.25
13 13
14 14
15 0.50 15 0.50
16 16
17 0.75 17 0.75
18 18
19 19
1.00 1.00
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19

0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
Feature Index Feature Index
Post-Instruction Feature Correlations
Activation Probability Correlation 1.00 Activation Strength Correlation 1.00
0 0
1 1
2 0.75 2 0.75
3 3
4 0.50 4 0.50
5 5
6 6
7 0.25 7 0.25
Feature Index

Feature Index

8 8
9 9
10 0.00 10 0.00
11 11
12 0.25 12 0.25
13 13
14 14
15 0.50 15 0.50
16 16
17 0.75 17 0.75
18 18
19 19
1.00 1.00
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19

0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19

Feature Index Feature Index


Figure 8: Heatmaps for Summarization Task.

18
Pre-Instruction Feature Correlations
Activation Probability Correlation 1.00 Activation Strength Correlation 1.00
0 0
1 1
2 0.75 2 0.75
3 3
4 0.50 4 0.50
5 5
6 6
7 0.25 7 0.25
Feature Index

Feature Index
8 8
9 9
10 0.00 10 0.00
11 11
12 0.25 12 0.25
13 13
14 14
15 0.50 15 0.50
16 16
17 0.75 17 0.75
18 18
19 19
1.00 1.00
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19

0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
Feature Index Feature Index
Post-Instruction Feature Correlations
Activation Probability Correlation 1.00 Activation Strength Correlation 1.00
0 0
1 1
2 0.75 2 0.75
3 3
4 0.50 4 0.50
5 5
6 6
7 0.25 7 0.25
Feature Index

Feature Index

8 8
9 9
10 0.00 10 0.00
11 11
12 0.25 12 0.25
13 13
14 14
15 0.50 15 0.50
16 16
17 0.75 17 0.75
18 18
19 19
1.00 1.00
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19

0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19

Feature Index Feature Index


Figure 9: Heatmaps for Translation(English) Task.

19
Pre-Instruction Feature Correlations
Activation Probability Correlation 1.00 Activation Strength Correlation 1.00
0 0
1 1
2 0.75 2 0.75
3 3
4 0.50 4 0.50
5 5
6 6
7 0.25 7 0.25
Feature Index

Feature Index
8 8
9 9
10 0.00 10 0.00
11 11
12 0.25 12 0.25
13 13
14 14
15 0.50 15 0.50
16 16
17 0.75 17 0.75
18 18
19 19
1.00 1.00
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19

0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
Feature Index Feature Index
Post-Instruction Feature Correlations
Activation Probability Correlation 1.00 Activation Strength Correlation 1.00
0 0
1 1
2 0.75 2 0.75
3 3
4 0.50 4 0.50
5 5
6 6
7 0.25 7 0.25
Feature Index

Feature Index

8 8
9 9
10 0.00 10 0.00
11 11
12 0.25 12 0.25
13 13
14 14
15 0.50 15 0.50
16 16
17 0.75 17 0.75
18 18
19 19
1.00 1.00
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19

0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19

Feature Index Feature Index


Figure 10: Heatmaps for Translation(French) Task.

20
Pre-Instruction Feature Correlations
Activation Probability Correlation 1.00 Activation Strength Correlation 1.00
0 0
1 1
2 0.75 2 0.75
3 3
4 0.50 4 0.50
5 5
6 6
7 0.25 7 0.25
Feature Index

Feature Index
8 8
9 9
10 0.00 10 0.00
11 11
12 0.25 12 0.25
13 13
14 14
15 0.50 15 0.50
16 16
17 0.75 17 0.75
18 18
19 19
1.00 1.00
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19

0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
Feature Index Feature Index
Post-Instruction Feature Correlations
Activation Probability Correlation 1.00 Activation Strength Correlation 1.00
0 0
1 1
2 0.75 2 0.75
3 3
4 0.50 4 0.50
5 5
6 6
7 0.25 7 0.25
Feature Index

Feature Index

8 8
9 9
10 0.00 10 0.00
11 11
12 0.25 12 0.25
13 13
14 14
15 0.50 15 0.50
16 16
17 0.75 17 0.75
18 18
19 19
1.00 1.00
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19

0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19

Feature Index Feature Index


Figure 11: Heatmaps for Translation(Chinese) Task.

21

You might also like