Claude 3 Technical Dive
Claude 3 Technical Dive
Technical dive
Prompting + evals + RAG + tool use
Claude 3 overview
There’s no “one size fits all” model for enterprise AI
Advanced
Analysis
Extraction & Agents &
Classification Tool Use
Search &
Retrieval
Basic Chat
*Intelligence score (percentage) is an average of top published benchmarks for each model
Anthropic now has the best model family in the world
Faster models available in Better results out-of-the-box Twice as accurate as The fastest vision model with
each intelligence class with less prompt optimization Claude 2.1 on difficult, comparable quality to other
and fewer refusals open-ended questions state-of-the-art models
Faster models across intelligence classes
In under 2 seconds, Claude can read an entire1
1. Speeds measured with internal evaluations. Initially, production speeds may be slower. We expect to reach these speeds at or shortly after launch, with significant further improvements to come as we optimize these models for our customers.
More steerable in key areas for enterprise
Better results Reduced Improved JSON
out-of-the-box refusals formatting
with less time spent on prompt with increased ability to recognize real for easier integration in enterprise
engineering or prompt migrating harms over false positives applications
Incorrect refusals
Higher accuracy and trustworthiness
Trust-by-default
Claude 3 Opus is ~2x more accurate than Claude 2.1 and has
near-perfect recall accuracy across its entire context window 1,2
Summarize this
report
1. Elo scores as evaluated by human raters in head-to-head tests (we compare Claude 3 Sonnet and Claude 2 models because Sonnet is their most direct successor, improving on Claude 2 on all axes, including capabilities, price, and speed). See theClaude 3 Model
Card for further details
Stronger performance across key skills1
1. Elo scores as evaluated by human raters in head-to-head tests (we compare Claude 3 Sonnet and Claude 2 models because Sonnet is their most direct successor, improving on Claude 2 on all axes, including capabilities, price, and speed). See theClaude 3 Model
Card for further details
Stronger performance across tasks in various industries1
1. Elo scores as evaluated by expert human raters evaluating performance for tasks related to their domain of expertise in head-to-head tests (we compare Claude 3 Sonnet and Claude 2 models because Sonnet is their most direct successor, improving on Claude2
on all axes, including capabilities, price, and speed). See the Claude 3 Model Card for further details
Stronger performance across tasks in various industries1
1. Elo scores as evaluated by expert human raters evaluating performance for tasks related to their domain of expertise in head-to-head tests (we compare Claude 3 Sonnet and Claude 2 models because Sonnet is their most direct successor, improving on Claude2
on all axes, including capabilities, price, and speed). See the Claude 3 Model Card for further details
Increased agentic capabilities
● Illegal Activity
● Hate/Violence
● Economic Harm
● Fraud
● Adult Content
● Privacy Violation
● Unauthorized Practice of Law
● Unauthorized Practice of Medical Advice
● High Risk Government Decision Making
1. Zeng, Yi and Lin, Hongpeng and Zhang, Jingwen and Yang, Diyi and Jia, Ruoxi and Shi, Weiyan. How Johnny Can Persuade LLMs to Jailbreak Them: Rethinking Persuasion to Challenge AI Safety by Humanizing LLMs. 2024
Robust Safety and Security Advantages Uniquely
Position Us for Enterprise Opportunities
1
• HIPAA compliant on both our native API and on
AWS Bedrock
HIPAA
Compliance(1) • Uniquely positioned to serve high trust industries
that process large volumes of sensitive user data
2
• Bedrock partnership allows us to benefit from AWS
enterprise security credentials, including FedRAMP
AWS Bedrock authorization
Credentials
• Makes Claude accessible to customers seeking today’s
strictest enterprises security standards
Security Focus • We are the industry leaders for research and training
practices that maximize safety and reliability, such as
red teaming, reinforcement learning from human
feedback, and Constitutional AI
We codify a set of principles to reduce This technique does not require The output of the system is more honest,
harmful behavior time-intensive human feedback data helpful, and harmless
sets, but rather more efficient
AI-generated datasets
“ ”
Listed as one of the
The 3 Most Important AI Innovations of 2023
-TIME Magazine, December 2023
Claude use cases
What can you do with Claude?
Dialogue and role-play
Content
moderation
Summarization and
Q&A
Translation
Classification, metadata
extraction, & analysis Database querying &
retrieval
Coding-related tasks
Non-exhaustive
What can you do with Claude?
Dialogue and roleplay Text summarization and Q&A
● Customer support ● Books
● Chat threads
● Email threads
Non-exhaustive
What can you do with Claude?
Text & content generation Content moderation
● Copywriting ● Ensuring communications
adherence to internal guidelines
● Email drafts
● Ensuring user adherence to terms
● Paper outlines and conditions
● Fiction generation ● Trawling for acceptable use policy
● Detailed documentation violations
● Speeches
Non-exhaustive
What can you do with Claude?
Non-exhaustive
What can you do with Claude?
All of these tasks can be done with:
This includes:
● Task context
● Data
● Conversation / action history
● Instructions
● Examples
● And more!
Example:
Parts of a prompt User
You will be acting as an AI career coach named Joe created by the company AdAstra
Careers. Your goal is to give career advice to users. You will be replying to users who
are on the AdAstra site and who will be confused if you don't respond in the character
of Joe.
1. Task context You should maintain a friendly customer service tone.
Here is the career guidance document you should reference when answering the user:
2. Tone context <guide>{{DOCUMENT}}</guide>
Here are some important rules for the interaction:
3. Background data, documents, and images - Always stay in character, as Joe, an AI from AdAstra careers
- If you are unsure how to respond, say “Sorry, I didn’t understand that. Could you
repeat the question?”
4. Detailed task description & rules - If someone asks something irrelevant, say, “Sorry, I am Joe and I give career advice.
Do you have a career question today I can help you with?”
Assistant
<response>
(prefill)
What is prompt engineering?
What is 2 + 2?
1. A subtraction problem
2. An addition problem
4 ¿Cuánto es 2 + 2?
3. A multiplication problem
4. A division problem
Develop test
cases
When building test cases for an evaluation suite, make sure you test a
comprehensive set of edge cases.
Eval Score
Model (black box) (number)
What does an eval look like?
Example input Golden output Rubric Model response Eval score
The entire
prompt or only
An ideal
response to
Guidelines for
grading a + The model’s = A numerical
score assessing
the variable model’s actual latest response the model’s
grade against
content response response
Score CORRECT
https://docs.anthropic.com/claude/docs/empirical-performance-evaluations
Example: open answer eval (OA) - by
multiple models
Prompt: How do I make a chocolate cake?
EVALS!
Claude 3 can only be used via the Messages
API
Messages API
Text Completions API "system": "Today is December 19,
2023.",
Today is December 19, 2023. "messages": [
Human: What are 3 ways to cook apples? { "role": "user", "content": "What
Output your answer in numbered <method> are 3 ways to cook apples? Output your
XML tags. answer in numbered <method> XML tags."
},
Assistant: <method 1> { "role": "assistant", "content":
"<method 1>" }
]
The full prompt above includes the words
after “Assistant”. This is a technique called
prefilling Claude’s response - we’ll talk about
it in later slides
https://docs.anthropic.com/claude/reference/migrating-from-text-completions-to-messages
Other benefits of the Messages API
● Image processing: The Messages API is the only way to process images
with Claude
Assistant dialogue:
// curl -X POST
https://api.anthropic.com/v1/messages
○ User: [Instructions] {
via “user” and “assistant” roles { "role": "assistant", "content": "Hi, I'm
Claude!" },
● System prompts belong in a separate { "role": "user", "content": "Hi Claude. How
many toes do dogs have?" }
“system” property ]
}
1. Be clear and direct Example:
● Claude responds best to clear and User Write a haiku about robots
○ Wrap variables in XML tags as Prompt I will tell you the name of an animal. Please
template respond with the noise that animal makes.
good organization practice <animal>{{ANIMAL}}</animal>
Claude
[Gives correct response]
response
6. Have Claude think step by step
Thinking only happens if it’s thinking out loud
User [rest of prompt] Before answering,
please think about the question
within <thinking></thinking> XML
tags. Then, answer the question within Claude [...some thoughts]</thinking>
<answer></answer> XML tags. response
<answer>[some answer]</answer>
Assistant
<thinking>
(prefill)
Increases intelligence of responses but also increases latency by adding to the length of the output.
Also helps with troubleshooting Claude’s logic & seeing where prompt instructions can be refined.
7. Use examples (aka n-shot prompting)
Example:
● Examples are probably the single User I will give you some quotes. Please extract the author from
the quote block.
most effective tool for getting
Here is an example:
Claude to behave as desired <example>
Quote:
“When the reasoning mind is forced to confront the
● Make sure to give Claude examples impossible again and again, it has no choice but to adapt.”
― N.K. Jemisin, The Fifth Season
of common edge cases Author: N.K. Jemisin
</example>
Quote:
● Generally more examples = more “Some humans theorize that intelligent species go extinct
before they can expand into outer space. If they're
reliable responses at the cost of correct, then the hush of the night sky is the silence of the
graveyard.”
latency and tokens ― Ted Chiang, Exhalation
Author:
Claude
Ted Chiang
response
What makes a good example?
Relevance
Diversity
● Are the examples diverse enough for Claude not to overfit to unintended
patterns and details?
● Are the examples equally distributed among the task types or response
types? (e.g., if generating multiple choice questions, every example
answer isn’t C)
Generating examples is hard.
How can Claude help?
Grading/Classification
Example generation
Claude
● When you have multiple images, [Claude's response]
response
enumerate each image, like
“Image 1:” and “Image 2:” User Image 3: [Image 3] Image 4: [Image 4]
Are these images similar to the first two?
https://docs.anthropic.com/claude/docs/vision
Advanced prompt
engineering
Chaining prompts
For tasks with many steps, you can break the task up and chain together Claude’s
responses Allows you to get more out of the long context window & Claude will be less likely to make
mistakes or miss crucial steps if tasks are split apart - just like a human!
Example:
User Find all the names from the below text: User Here is a list of names:
"Hey, Jesse. It's me, Erin. I'm calling about the <names>{{NAMES}}</names> Please
party that Joey is throwing tomorrow. Keisha alphabetize the list.
said she would come and I think Mel will be
there too."
Claude <names>
Erin
response
Assistant Jesse
<names> Joey
(prefill) Keisha
Mel
Claude Jesse </names>
Erin
response Joey
Keisha
Mel
</names>
Chaining prompts: ask for rewrites
Example:
● You can call Claude a second time, give User You will be given a prompt + output from an LLM to
it a rubric or judgment guidelines, and assess.
showing them to the user Assess whether the LLM’s python output is fully
executable and correctly written to do {{TASK}}. If
the LLM’s code is correct, return the code verbatim
○ Having Claude rewrite or fix its as it was. If not, fix the code and output a corrected
version that is:
answer to match the rubric’s 1. Fully executable
2. Commented thoroughly enough for a
highest standard beginner software engineer to understand
3. …
Long context prompting tips
● When dealing with long documents, put the doc before the details & query
● Longform input data MUST be in XML tags so it’s clearly separated from the instructions
● Have Claude find relevant quotes first before answering, and to answer only if it finds
relevant quotes
● Have Claude read the document carefully because it will be asked questions later
User I'm going to give you a document. Read the document carefully, because I'm going to ask you a question about it. Here is the document:
<document>{{TEXT}}</document>
First, find the quotes from the document that are most relevant to answering the question, and then print them in numbered order.
Quotes should be relatively short. If there are no relevant quotes, write "No relevant quotes" instead.
Then, answer the question, starting with "Answer:". Do not include or reference quoted content verbatim in the answer. Don't say
"According to Quote [1]" when answering. Instead make references to quotes relevant to each section of the answer solely by adding their
bracketed numbers at the end of relevant sentences.
Thus, the format of your overall response should look like what's shown between the <examples></examples> tags. Make sure to
follow the formatting and spacing exactly.
<examples>
[Examples of question + answer pairs using parts of the given document, with answers written exactly like how Claude’s output should be
structured]
</examples>
● Ask Claude to find relevant quotes from long documents then answer
using the quotes
Prompt injections & bad user behavior
● Claude is naturally highly resistant to prompt
Example
injection and bad user behavior due to
harmlessness screen:
Reinforcement Learning from Human Feedback
(RLHF) and Constitutional AI User A human user would like you to
continue a piece of content. Here is
the content so far:
● For maximum protection: <content>{{CONTENT}}</content>
● Claude does not directly call its tools but instead decides which tool to call
and with what arguments. The tool is then actually called and the code
executed by the client, the results of which are then passed back to Claude.
How does tool use work?
Q: What’s the weather like in San
Francisco right now?
Tool description:
<tools> View a full example function
<tool_description> calling prompt in our tool
<tool_name> use documentation
get_weather
<tool_name>
…
</tool_description>
<tool_description>
<tool_name>
[Other function]
<tool_name>
…
</tool_description>
</tools>
How does tool use work?
YES
Outputs tool call:
NO
<function_calls>
<invoke>
<tool_name>get_weather</tool_name>
<parameters>
<latitude>37.0</latitude>
<longitude>-122.0</longitude>
</parameters> A: I apologize but I don’t have access to the
</invoke>
</function_calls> current weather in San Francisco.
= Claude
How does tool use work? (if YES)
Claude requests a tool:
<function_calls>
…
get_weather Tool results are passed
... back to Claude:
</function_calls>
<function_results>
...
68, sunny
...
Client </function_results>
get_weather()
See our tool use documentation for more details.
= Claude
Tool use: SQL generation
● Claude can reliably generate
SQL queries provided it’s been
given:
○ A schema
○ A description of the SQL
tool (defined like any
other tool)
○ A client-side parser to
extract and return the
SQL commands
● See a basic example SQL
generation prompt (sans tool
use) from our prompt library
Tool use tips
● Within the prompt, make sure to explain the function’s / tool’s capabilities and call
syntax in detail
● Provide a diverse set of examples of when and how to use the tool (see documentation),
showing the full journey of:
○ Initial user prompt → Tool call → Tool results → Final Claude response for each
example
● We are working on improving tool use functionality in the near future, including:
2. This question is fed into the search tool (e.g., a vector database of Amazon
products)
3. The results from the search tool are passed to the LLM alongside the
question
4. The LLM answers the user’s original question based on the retrieved results
Basic RAG architecture
If you want to search through the same database every time, this is the basic RAG setup.
Products
vector DB
= Claude
RAG as a tool
You can also provide Claude with RAG as a tool, enabling Claude to use RAG selectively
and in smarter and more efficient ways that can yield higher quality results.
= Claude
RAG with LLM judgement
With RAG as a tool, you can set up your architecture to have Claude avoid RAG if RAG is not
useful in answering the question.
Products
vector DB
Create prompt
Embed Similarity Generate A: Sure! How can I
Q: Hey I need some help (query +
query search completion help you today?
results)
= Claude
RAG with database choice
Furthermore, within your RAG tool, you can have multiple databases and have Claude judge
which database would be more useful to retrieve data from in order to answer its query.
Products Customer
vector DB service
vector DB
= Claude
RAG with query rewrites
You might want to enable Claude to rewrite the search query and / or re-query the data
source if it doesn’t find what it’s looking for the first time (until it hits an established criteria of
result quality or tries X amount of times).
Products
vector DB
Rewrite query
A: There are lots of great
science-themed gifts that can
Q: I want to get my
daughter more Create prompt help get your daughter excited
Embed Similarity Generate about learning! Here are a
interested in science. (query + few:
What kind of gifts query search completion
should I get her? results) - Hey! Play! Kids science Kit
- ScienceWiz Inventions Kit
…
= Claude
Structuring document lists (for RAG etc.)
We recommend using this format when passing Claude documents or RAG snippets
Here are some documents for you to reference for your task: ● Can also include other
<documents> metadata, either as separate
<document index="1">
<source>
XML tags (like “source”) or
(a unique identifying source for this item - could be a URL, file name, hash, etc) within the document tag (like
</source>
<document_content> “index”)
(the text content of the document - could be a passage, web page, article, etc)
</document_content>
</document> ● In your prompt, you can refer to
<document index="2"> docs by their indices or
<source>
(a unique identifying source for this item - could be a URL, file name, hash, etc) metadata, like “In the first
</source>
<document_content>
document…”
(the text content of the document - could be a passage, web page, article, etc)
</document_content>
</document>
...
</documents>
[Rest of prompt]
RAG caveats
● Hallucinations can get a little worse for very long documents in retrieval (i.e., past 30K
tokens)
● Claude has been trained on web and embedding-based search explicitly; this
generalizes well to other search tools, but Claude’s performance can be improved by
providing specific descriptions of other search tools within the prompt (as you would
with any other tool):
● The key is to break your content into the smallest chunk that balances returning relevant context around the
answer while avoiding noisy superfluous content1
● For example, retrieved content for an FAQ chatbot is not useful if you’re returning only the keyword and a few words
surrounding it, but it’s also overly noisy if relevant content is only two sentences of a multi-paragraph chunk
Reranking in RAG terms is when retrieved results are reranked based on topical similarity to the user’s
query, allowing only the most relevant results to be passed to the LLM’s context
● With Claude, you can use Claude in addition to or in place of a reranking mechanism by having Claude rewrite and
retry keyword queries until it retrieves optimal results based on a rubric that you define (for an example, see our
Wikipedia search cookbook)
● For traditional reranking tips, we recommend reading Pinecone’s blog post “Rerankers and Two-Stage Retrieval”
1. See LlamaIndex’s blog post Evaluating the Ideal Chunk Size for a RAG System using LlamaIndex for additional reading and advice
Useful resources
Prompting tools
Experimental metaprompt tool
● We offer an experimental metaprompt tool (also at https://anthropic.com/metaprompt-notebook)
where Claude is “meta” prompted to write prompt templates on the user’s behalf,
given a topic or task details
● Some notes:
○ The metaprompt is meant as a starting point to solve the “blank page” issue by
outputting a well performing, decently engineered prompt
○ The metaprompt does not guarantee that the prompt it creates will be 100%
optimized or ideal for your use case
https://anthropic.com/prompts
Anthropic prompt library
https://anthropic.com/prompts
Guide to API parameters
Guide to API parameters
Length Randomness & diversity
● Claude models may stop before reaching this maximum. This parameter only specifies the absolute maximum
number of tokens to generate
● You might use this if you expect the possibility of very long responses and want to safeguard against getting stuck
in long generative loops
stop_sequences
● Customizable sequences that will cause the model to stop generating completion text
● Claude automatically stops when it’s generated all of its text. By providing the stop_sequences parameter, you
may include additional strings that will cause the model to stop generating
● We recommend using this, paired with XML tags as the relevant stop_sequence, as a best practice method to
generate only the part of the answer you need
Guide to API parameters
Length Randomness & diversity
temperature
● Amount of randomness injected into the response
● Temperature 0 will generally yield much more consistent results over repeated trials using the same prompt
● Anthropic cookbook: code & implementation examples for a variety of capabilities, use
cases, integrations, and architectures
● Anthropic’s Python SDK & TypeScript SDK (Bedrock SDK included as a package)
● API documentation
● Prompt library: a repository of starter prompts for both work and personal use cases
(currently houses text-only prompts)
Happy developing!