0% found this document useful (0 votes)

46 views11 pages

AI Oversight: Understanding Internal Targets

1) The document discusses the concept of "Internal Target Information" within AI systems, which is information within the system about its objectives or targets that could allow an overseer to detect misalignment before harmful outcomes occur. 2) It presents a model of an "Overseer" system that monitors an "Agent" system to prevent catastrophic outcomes from misaligned objectives. The Overseer aims to accurately understand the Agent's target using observations of its behavior and internals. 3) Internal Target Information refers to information an Agent's systems uses about its target objectives to select actions. For example, a thermostat contains the target temperature value to determine which actions heat or cool a room.

Uploaded by

Mariana Meireles

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

46 views11 pages

AI Oversight: Understanding Internal Targets

Uploaded by

Mariana Meireles

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

Internal Target Information for AI Oversight — LessWrong about:reader?url=https%3A%2F%[Link]%2Fposts%...

[Link]

Internal Target Information for AI

Oversight
Paul Colognese

10–12 minutes

Thanks to Arun Jose for discussions and feedback.

Summary

In this short post, we discuss the concept of Internal Target

Information within agentic AI systems, arguing that agentic
systems possess internal information about their targets. This
information, we propose, can potentially be detected and
interpreted by an overseer before the target outcome is realized in
the environment, offering a pathway to preempt catastrophic
outcomes posed by future agentic AI systems.

This discussion aims to highlight the key idea that motivates our
current research agenda, laying a foundation for forthcoming work.

We’ll start by introducing the inner alignment problem and why

oversight of an agent’s internals is important. We’ll then introduce
a model of an overseer overseeing an agent. Finally, we’ll
introduce and discuss the notion of Internal Target Information in
more detail and how it might be used in the oversight process.

1 of 11 12/19/23, 4:29 PM
Internal Target Information for AI Oversight — LessWrong about:reader?url=https%3A%2F%[Link]%2Fposts%...

Oversight of an AI’s Internal Target Information. The Overseer

detects that the AI’s target is to turn all humans into paperclips and
so shuts the AI down, preventing the catastrophe. Credit: DALL-E
3.

The Inner Alignment Problem and Internal

Oversight

We are concerned with the possibility of creating agents with

misaligned objectives, potentially leading to catastrophic real-world
outcomes. A conceivable solution lies in effective oversight:

2 of 11 12/19/23, 4:29 PM
Internal Target Information for AI Oversight — LessWrong about:reader?url=https%3A%2F%[Link]%2Fposts%...

detecting misalignment early enough allows for timely intervention,

preventing undesirable outcomes.

Oversight, based on behavioral observations, may fail to

confidently predict future outcomes pursued by the Agent,
especially in the face of goal misgeneralization and deceptive
alignment.

In the remainder of this post, we will explore the idea that

information about the agent's true objective may be contained in its
internals, and so may be detectable by an overseer equipped with
sufficiently strong interpretability tools.

To begin our exploration of this idea, we start by introducing a

model of the oversight process.

Oversight Model

We introduce a model of an Overseer overseeing an Agent that

begins to clarify what good oversight might entail, mainly focusing
on the objective of the Overseer: to prevent catastrophic outcomes
caused by misaligned agentic AI. Moreover, we hope that a more
refined future version of this model may help establish
benchmarks to evaluate oversight methods.

The Agent

The Agent is a system that has sensors that take in sensory

inputs/observations of the environment and actuators that produce
actions that affect the environment.

The most important property of the Agent, and that which makes it
potentially dangerous, is that it robustly selects actions that steer

3 of 11 12/19/23, 4:29 PM
Internal Target Information for AI Oversight — LessWrong about:reader?url=https%3A%2F%[Link]%2Fposts%...

the environment towards a set of target states.

An agent observes the current state of the environment and

produces actions in order to steer the environment towards its
target.

We use the abstraction of an Action-Selection Mechanism[1] to

denote whatever mechanism the Agent uses to select actions that
steer the world toward its target outcome.

Internal World Models and Planning/Optimization

We often make the assumption that the Agent will have an internal
world model and the capacity for planning/optimization.

Others have argued that advanced agents will need to have a

world model and planning capacities if they are to act competently
in a complex environment and achieve goals over long time
horizons. Briefly, internal world models allow agents to simulate
potential outcomes or strategies before taking action, which is
crucial in navigating complex, dynamic environments.

Because we’re primarily concerned with agents capable of

4 of 11 12/19/23, 4:29 PM
Internal Target Information for AI Oversight — LessWrong about:reader?url=https%3A%2F%[Link]%2Fposts%...

achieving dangerous outcomes in our world, this assumption feels

justified.

The Overseer

The Overseer is the system responsible for oversight of the Agent

during training and deployment. It might also be well described as
a particular kind of agent with its own preferences for states the
environment could be in.

The Overseer’s sensors may observe the environment but must

also allow for observations of the Agent’s behavior (and potentially
the Agent’s internals). Its actuators must be able to affect the
Agent (e.g., via the ability to shut the Agent down, update the
Agent via further training, etc.)

The Overseer’s objective is to ensure that the Agent takes actions

to steer the world according to the Overseer’s preferences. For the
Overseer to do this successfully, it requires the capacity to
accurately detect and interpret the Agent’s target and evaluate the
target outcome with respect to the Overseer’s preferences.

Importantly, the Overseer needs to be able to detect if the Agent

has a bad target (according to the Overseer’s preferences) before
the Agent can realize the target outcome in the world.

The Overseer’s advantage is that it has prior knowledge of the

Agent’s architecture, has access to the Agent’s past and current
observations and actions, and can observe the Agent's internals
(e.g., such as weights and activations in the case of an NN-based
agent). This holds during the training process and in deployment.

5 of 11 12/19/23, 4:29 PM
Internal Target Information for AI Oversight — LessWrong about:reader?url=https%3A%2F%[Link]%2Fposts%...

Extending the diagram above, we see that the Overseer observes

the Agent’s behavior and its internals. These observations can be
used to inform whether an intervention on the Agent is necessary.

Internal Target Information and Oversight

What is Internal Target Information?

For the Agent’s Action-Selection Mechanism to select actions that

lead to its target outcome, it must use some information about the
target to differentiate between actions that lead to the target
outcome and actions that don’t.

We call this Internal Target Information.

Examples of Internal Target Information

Example - Thermostat

Consider a fixed-temperature thermostat whose objective is to

keep the room at a fixed target temperature.

It uses a thermometer to measure the current room temperature

and can either turn on a radiator to increase the room’s
temperature or the AC to decrease the room’s temperature.

In order to select actions that cause the room's target temperature

to be maintained, the thermostat’s Action-Selection Mechanism
must contain some information about the target temperature;

6 of 11 12/19/23, 4:29 PM
Internal Target Information for AI Oversight — LessWrong about:reader?url=https%3A%2F%[Link]%2Fposts%...

without this information, it can’t select the correct action, even if it

knows the room’s current temperature.

In particular, suppose the thermostat's Action-Selection

Mechanism is an algorithm that looks something like:

if current_temperature < target_temperature:

TurnOnRadiator()
elif current_temperature > target_temperature:
TurnOnAC()
else:
DoNothing()

Then, the Action-Selection Mechanism's Internal Target

Information is represented as a constant (target_temperature) that
the Overseer might be able to detect via reverse engineering the
above algorithm from the Agent's internals.

Example - Action Selection via Planning/Optimization

Consider an agent that doesn’t have a predetermined heuristic for

selecting successful actions (or action sequences) that lead to its
target outcome. This could be because the environment is
sufficiently complicated and/or novel.

The Agent has a set of actions/plans to choose from but doesn’t

immediately know which one will lead to the target outcome, and
so must use an optimization process to select the correct
action/plan amongst the set of possible actions/plans it could take.
Therefore, the Agent must use some procedure by which it
deduces which outcomes result from which actions, and then must
evaluate which outcome corresponds to the target outcome. It then
uses this evaluation to select the action that leads to the target

7 of 11 12/19/23, 4:29 PM
Internal Target Information for AI Oversight — LessWrong about:reader?url=https%3A%2F%[Link]%2Fposts%...

outcome.

In this case, the evaluation function contains the target

information.[2]

Note that planning can be thought of as an example of

optimization, where the outcomes that correspond to actions/plans
are evaluated by simulating the result of the plan using a world
model.

How can the Overseer Leverage Internal Target

Information?

It seems clear that some Internal Target Information must exist

within agents; without such information, there is no way for the
Agent’s Action-Selection Mechanism to differentiate between
actions that lead to the target outcome and those that don’t.
Furthermore, this information is present within the internals of the
Agent before the target outcome is realized in the world because it
is used to select actions that lead to the target outcome.

Our hope is that the Overseer can detect this Internal Target
Information and use it to deduce what the Agent’s target is before
the target outcome is realized. This would allow the Overseer to
achieve its objective of intervening and preventing the Agent from
causing bad outcomes, as described above.[3]

See this section of this post for more ways this information could
be leveraged.

We conclude this section by pointing out that the Internal Target

Information of an arbitrary agent (such as a neural-network-based
agent) will probably be extremely hard to detect and interpret.

8 of 11 12/19/23, 4:29 PM
Internal Target Information for AI Oversight — LessWrong about:reader?url=https%3A%2F%[Link]%2Fposts%...

Despite this, our current research agenda aims at finding methods

of detecting Internal Target Information, or demonstrating that it's
too difficult of a task.

Extending the diagram above, information about the target

outcome is contained in the Agent and hence can, in theory, be
observed by the Overseer to inform its oversight.

Conclusion

In this post, we introduced the notion of Internal Target Information

and discussed how it might be used by an overseer to prevent
catastrophic outcomes from misaligned AIs.

In future work, we intend to explore further what shape Internal

Target Information takes within agents and investigate to what
extent it’s possible for an overseer to detect this Internal Target
Information.

9 of 11 12/19/23, 4:29 PM
Internal Target Information for AI Oversight — LessWrong about:reader?url=https%3A%2F%[Link]%2Fposts%...

Internal Target Information hidden deep within the agent. If we can

develop tools to detect it, this valuable information could help
prevent catastrophic outcomes. Credit: DALL-E 3.

1. ^
Or Decision-Making Process.

2. ^
We will explore the question of how this target information can be
extracted in a future post. See this post for some initial
explorations of this idea.

3. ^
The ability of the Overseer to properly evaluate the target outcome
is a separate problem that could be thought of as a version of the

10 of 11 12/19/23, 4:29 PM
Internal Target Information for AI Oversight — LessWrong about:reader?url=https%3A%2F%[Link]%2Fposts%...

outer alignment problem. We will discuss this further in an

upcoming post.

11 of 11 12/19/23, 4:29 PM

Internal Target Information For AI Oversight - LessWrong
No ratings yet
Internal Target Information For AI Oversight - LessWrong
9 pages
AISC Project - Exploring Toy Models of Agents
No ratings yet
AISC Project - Exploring Toy Models of Agents
6 pages
Scientist AI: A Safer AI Future
No ratings yet
Scientist AI: A Safer AI Future
58 pages
Corrigibility - Workshops at The 29th AAAI AI Conference Jan 2015
No ratings yet
Corrigibility - Workshops at The 29th AAAI AI Conference Jan 2015
10 pages
Carlsmith Power Seeking AI
No ratings yet
Carlsmith Power Seeking AI
57 pages
AI Safety, Ethics, and Society Guide
No ratings yet
AI Safety, Ethics, and Society Guide
568 pages
Intelligent Agents: Design and Frameworks
No ratings yet
Intelligent Agents: Design and Frameworks
6 pages
AI Ethics Framework For The Intelligence Community 1.0
No ratings yet
AI Ethics Framework For The Intelligence Community 1.0
6 pages
Chapter 2
No ratings yet
Chapter 2
55 pages
AGI Safety: Goals and Agency Analysis
No ratings yet
AGI Safety: Goals and Agency Analysis
14 pages
Risks From Internal AI Systems. IAPS-2025
No ratings yet
Risks From Internal AI Systems. IAPS-2025
54 pages
CS480 Lecture August 29th
No ratings yet
CS480 Lecture August 29th
64 pages
Jims S 24 02257
No ratings yet
Jims S 24 02257
27 pages
Lec 1 Notes
No ratings yet
Lec 1 Notes
4 pages
2 Agents+Environments
No ratings yet
2 Agents+Environments
30 pages
Understanding Intelligent Agents in AI
No ratings yet
Understanding Intelligent Agents in AI
138 pages
AI MQP Solution
No ratings yet
AI MQP Solution
30 pages
AI Ethics 1 - Workshop Resource Guide - Co-Branded 2023 - 2024
No ratings yet
AI Ethics 1 - Workshop Resource Guide - Co-Branded 2023 - 2024
11 pages
Intelligent Agent
No ratings yet
Intelligent Agent
10 pages
AI Ethics: 8 Key Questions Answered
No ratings yet
AI Ethics: 8 Key Questions Answered
13 pages
High-Level Interpretability Detecting An AI's Objectives - LessWrong
No ratings yet
High-Level Interpretability Detecting An AI's Objectives - LessWrong
31 pages
Notes 1
No ratings yet
Notes 1
24 pages
Intelligent Agents Russell
No ratings yet
Intelligent Agents Russell
31 pages
L02 Introduction
No ratings yet
L02 Introduction
29 pages
AI Cards 2021
No ratings yet
AI Cards 2021
32 pages
Chapter 2 - Intelligent Agent
No ratings yet
Chapter 2 - Intelligent Agent
38 pages
Intelligent Agent: Dept. of Computer Science Faculty of Science and Technology
No ratings yet
Intelligent Agent: Dept. of Computer Science Faculty of Science and Technology
40 pages
AIPDF
No ratings yet
AIPDF
7 pages
AI Alignment A Comprehensive Survey
No ratings yet
AI Alignment A Comprehensive Survey
98 pages
Catastrophic AI Risks Overview
No ratings yet
Catastrophic AI Risks Overview
54 pages
Audit 3.0
No ratings yet
Audit 3.0
37 pages
Internal Audit 3.0: Embracing AI Insights
No ratings yet
Internal Audit 3.0: Embracing AI Insights
37 pages
An Overview of Catastophic AI Risks
No ratings yet
An Overview of Catastophic AI Risks
54 pages
Intelligent Agents
No ratings yet
Intelligent Agents
44 pages
CS324-Artificial Intelligence: Lecture 3: Intelligent Agents
No ratings yet
CS324-Artificial Intelligence: Lecture 3: Intelligent Agents
28 pages
IntroAI Slide01
No ratings yet
IntroAI Slide01
59 pages
Introduction to AI Concepts
No ratings yet
Introduction to AI Concepts
20 pages
Aies Unit-1 16marks Answer
No ratings yet
Aies Unit-1 16marks Answer
23 pages
AIML-TheORY - Session-1 Agents and Environment
No ratings yet
AIML-TheORY - Session-1 Agents and Environment
61 pages
2 Agent and Environment
No ratings yet
2 Agent and Environment
32 pages
Dictionary of Artificial Intelligence
No ratings yet
Dictionary of Artificial Intelligence
492 pages
LLMs As Insider Threats 1750636422
No ratings yet
LLMs As Insider Threats 1750636422
32 pages
Cse 4109ch2 Peas
No ratings yet
Cse 4109ch2 Peas
61 pages
Biden-Harris Administration Outlines Coordinated Approach To Harness Power of AI For U.S. National Security 2024
No ratings yet
Biden-Harris Administration Outlines Coordinated Approach To Harness Power of AI For U.S. National Security 2024
4 pages
CS361 AI Week2 Lecture1
No ratings yet
CS361 AI Week2 Lecture1
21 pages
AI Importants
No ratings yet
AI Importants
27 pages
AI Coursework for Students
No ratings yet
AI Coursework for Students
17 pages
AI Intelligent Agents Overview
No ratings yet
AI Intelligent Agents Overview
60 pages
Intelligent Agents Russell
No ratings yet
Intelligent Agents Russell
31 pages
AI - Chapter Two
No ratings yet
AI - Chapter Two
31 pages
02 Search Agents
No ratings yet
02 Search Agents
48 pages
Week-2 - Problem Spaces Search
No ratings yet
Week-2 - Problem Spaces Search
40 pages
Chapter 2 - Intelligent Agents
No ratings yet
Chapter 2 - Intelligent Agents
19 pages
Seminar Report ON: Ai and Its Intelligent Agents
No ratings yet
Seminar Report ON: Ai and Its Intelligent Agents
17 pages
Aims
No ratings yet
Aims
17 pages
Agents: Fiona French
No ratings yet
Agents: Fiona French
26 pages
Unit-2 Problem Solving
No ratings yet
Unit-2 Problem Solving
129 pages
Ilovepdf Merged
No ratings yet
Ilovepdf Merged
180 pages
AI Course: Intelligent Agents & Problem Solving
No ratings yet
AI Course: Intelligent Agents & Problem Solving
66 pages
Amber Peruski: Science Teacher Resume
No ratings yet
Amber Peruski: Science Teacher Resume
2 pages
Nature Vs Nurture Paper
No ratings yet
Nature Vs Nurture Paper
5 pages
Performance Appraisal Template Guide
100% (1)
Performance Appraisal Template Guide
6 pages
Lesson 1
No ratings yet
Lesson 1
56 pages
Twitter Guidelines for Flemish Journalists
No ratings yet
Twitter Guidelines for Flemish Journalists
5 pages
Omnikin Games - April 14
No ratings yet
Omnikin Games - April 14
4 pages
Stereotypes & Prejudice in Perú
No ratings yet
Stereotypes & Prejudice in Perú
5 pages
Progress Report
No ratings yet
Progress Report
3 pages
Thinking Skills: Strategies
60% (5)
Thinking Skills: Strategies
8 pages
Wisdom Philo Truth
No ratings yet
Wisdom Philo Truth
10 pages
99ee PDF
No ratings yet
99ee PDF
5 pages
Naturalizing Virtue
No ratings yet
Naturalizing Virtue
28 pages
Artificial Intelligence: UNIT-1
No ratings yet
Artificial Intelligence: UNIT-1
77 pages
Sdo-Talugtug Annex: Patola Elementary School
No ratings yet
Sdo-Talugtug Annex: Patola Elementary School
1 page
Random Forest Classifier
No ratings yet
Random Forest Classifier
9 pages
Understanding The Self Syllabus
No ratings yet
Understanding The Self Syllabus
20 pages
Wordsworth vs. Coleridge on Poetic Diction
No ratings yet
Wordsworth vs. Coleridge on Poetic Diction
2 pages
#9408 - 'Crafting Your Systems Mindset - Foundr'
No ratings yet
#9408 - 'Crafting Your Systems Mindset - Foundr'
1 page
W1 D2 Lesson Plan Science 7 Matatag
No ratings yet
W1 D2 Lesson Plan Science 7 Matatag
3 pages
Probing Quest Eng
No ratings yet
Probing Quest Eng
4 pages
Math 7 Week 1 Module
No ratings yet
Math 7 Week 1 Module
12 pages
Sprint Retrospective
No ratings yet
Sprint Retrospective
8 pages
Document
No ratings yet
Document
8 pages
Sample Answer: These Answers
No ratings yet
Sample Answer: These Answers
7 pages
1 Testing, Assessing, and Teaching: Answers To The Vocabulary Quiz On Pages 1 and 2: 1 C, 2a, 3d, 4b, Sa, 6c
No ratings yet
1 Testing, Assessing, and Teaching: Answers To The Vocabulary Quiz On Pages 1 and 2: 1 C, 2a, 3d, 4b, Sa, 6c
1 page
Portfolio in Field Study: Cooperating Teacher's Profile
100% (1)
Portfolio in Field Study: Cooperating Teacher's Profile
4 pages
Tabitha Rawlings
No ratings yet
Tabitha Rawlings
4 pages
Teaching Strategies in EPP/TLE
No ratings yet
Teaching Strategies in EPP/TLE
22 pages
Top Notch Global Scale of English Teacher Booklet Fundamentals
100% (2)
Top Notch Global Scale of English Teacher Booklet Fundamentals
18 pages
EDUC 231 The School As A Social System
100% (2)
EDUC 231 The School As A Social System
32 pages