Retrieval-Augmented Fine-Tuning (RAFT)

Last Updated : 06 Nov, 2025

LLMs tend to be very good at general tasks related to NLP, they usually underperform for specialized domains, such as medicine or law. RAFT bridges this gap by bringing together retrieval augmented generation and fine-tuning. This hybrid method enables the model to recall domain-specific knowledge and reason contextually; it surpasses the more conventional methods on specialized datasets.

How RAFT Improves on RAG and Fine-Tuning

LLMs are always trained on very large public datasets and are, therefore, good at generic tasks. However, in domains such as medicine, law or enterprise APIs, precision and context are more important than general knowledge. Two major different paths have been explored to address this challenge: RAG and SFT

Retrieval-Augmented Generation: RAG adds a retrieval layer that lets LLMs fetch relevant documents before answering like an open-book exam. It enables real-time lookup but struggles to reason over or adapt to domain-specific content.
Fine-Tuning (SFT): Direct fine-tuning trains the model on domain data directly, much like studying from the book itself. While this internalizes specialized knowledge, it fails when confronted with new or evolving data. It is also resource-intensive and likely to go outdated.

RAFT creates a fusion of the best of both worlds the external retrieval of RAG and the deep learning of fine-tuning. It basically trains LLMs on how to retrieve, understand and infer domain specific documents effectively. Models that are more reliable, adaptive and high performing in knowledge heavy fields such as biomedicine, law and software documentation.

Understanding the RAFT Approach

Fine-tuning is like taking a closed-book exam you rely only on what you remember.
RAG is like an open book exam you can consult materials but might not know which pages matter.
RAFT is like studying the book in depth and taking an open book exam you know the subject matter and where to find the right answers. RAFT brings these two worlds together to create models that not only "know the book" but also "know how to use it."

Why RAFT Works Better ?

Blends RAG and Fine-Tuning: Combines retrieval reasoning with domain learning for better context understanding.
Smart Student Analogy: Like mastering the book before an open book exam RAFT models know and use knowledge wisely.
Split Training: Mixes relevant and noisy data to strengthen reasoning and reduce reliance on retrieval.
Retrieval Robustness: Irrelevant or incorrect documents are handled well.
Domain Adaptation: Particularly excels in domain specific sectors such as biomedicine, law and APIs.
Evidence Based Learning: It focuses on reasoning from retrieved content, rather than memorization.
Proven Results: Outperforms standard RAG and fine-tuned models .

Key Components

Question (Q): The primary input that elicits a domain-specific response.
Document Set (Dk): Includes Relevant Documents (D*) with accurate info and Distractor Documents (Di) that teach the model to ignore noise.
Chain-of-Thought Answer (A*): Reasoning-based response showing how to connect retrieved facts logically.
Balanced Training Mix: Combines Relevant and Distractor data to identify key info and Distractor Only data to strengthen independent reasoning.
Fine-Tuning: SFT trains for structured, step-by-step reasoning using the data provided.
Retrieval Awareness: Improves comprehension of the content retrieved and resilience to mistakes.
Inference: Uses retrieved documents at runtime to craft comprehensive, evidence-based answers.
Objective: Integrate retrieval accuracy with deep reasoning for context-aware domain-specific responses.

Step-By-Step Implementation

1. Fine-Tuning LLM

Step 1: Import Required Libraries

Import essential libraries from Hugging Face and LangChain.
transformers for model fine-tuning, langchain for embeddings and vector storage and datasets for dataset management.

Python

import os
import json
import torch
from datasets import Dataset
from transformers import AutoTokenizer, AutoModelForCausalLM, Trainer, TrainingArguments
from langchain.vectorstores import Chroma
from langchain.embeddings import HuggingFaceEmbeddings
from langchain.text_splitter import RecursiveCharacterTextSplitter

Step 2: Load Dataset and Prepare Question-Answer Pairs

Load the Dataset file containing Q&A data.
Each data point contains a question and its answer.
Combine them into a unified format suitable for fine-tuning.

Python

data_path = "Dataset.json"

with open(data_path, "r") as f:
    qa_data = json.load(f)

pairs = [{"question": item["question"], "answer": item["answer"], "text": f"Question: {item['question']}\nAnswer: {item['answer']}"} for item in qa_data]
all_text = "\n".join([p["answer"] for p in pairs])

Step 3: Create Vector Database Using Embeddings

Use HuggingFaceEmbeddings to convert text chunks into embeddings.
Store embeddings in a Chroma vector database for similarity search.

Python

embedding_model = HuggingFaceEmbeddings(model_name="sentence-transformers/all-MiniLM-L6-v2")
db = Chroma.from_texts(chunks, embedding_model)

Output:

Step 4: Retrieve Context for Each Question

Define a function to fetch the top-k relevant text chunks for a question.
This context will later be used to enhance model inputs.

Python

def retrieve_context(question, k=2):
    results = db.similarity_search(question, k=k)
    return " ".join([r.page_content for r in results])

Step 5: Create RAFT Training Dataset

Each training example includes context and question as input and answer as target.
Helps the model learn to use context efficiently.

Python

raft_data = []
for item in pairs:
    context = retrieve_context(item["question"])
    raft_data.append({
        "input_text": f"Context: {context}\nQuestion: {item['question']}",
        "target_text": item["answer"]
    })

dataset = Dataset.from_list(raft_data)

Step 6: Load Model and Tokenizer

Load a language model (distilgpt2).
Set padding token to avoid training errors.

Python

model_name = "distilgpt2"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name)
tokenizer.pad_token = tokenizer.eos_token

Step 7: Tokenize the Dataset

Tokenize both input and target text for the model.
Labels correspond to the tokenized answers.

Python

def tokenize(batch):
    inputs = tokenizer(batch["input_text"], truncation=True, padding="max_length", max_length=256)
    labels = tokenizer(batch["target_text"], truncation=True, padding="max_length", max_length=256)["input_ids"]
    inputs["labels"] = labels
    return inputs

tokenized_ds = dataset.map(tokenize, batched=True)

Step 8: Define Training Parameters

Configure fine-tuning parameters: epochs, batch size, learning rate and checkpoint strategy.
Trainer API handles model training.

Python

training_args = TrainingArguments(
    output_dir="./raft_finetuned_model",
    num_train_epochs=3,
    per_device_train_batch_size=2,
    logging_steps=10,
    save_strategy="epoch",
    eval_strategy="no",
    learning_rate=5e-5,
    weight_decay=0.01,
    warmup_steps=10,
    logging_dir="./logs",
)

trainer = Trainer(model=model, args=training_args, train_dataset=tokenized_ds)
trainer.train()

Step 9: Save Fine-Tuned RAFT Model

Save both the fine-tuned model and tokenizer locally for reuse

Python

model.save_pretrained("./raft_finetuned_model")
tokenizer.save_pretrained("./raft_finetuned_model")

2. RAG over Fine Tuned Model

Step 10: Load RAFT Fine-Tuned Model for RAG

Reload the fine-tuned model and tokenizer for RAG integration.
Define a corpus (knowledge base) for context retrieval.

Python

import torch
from transformers import AutoTokenizer, AutoModelForCausalLM
from langchain.vectorstores import Chroma
from langchain.embeddings import HuggingFaceEmbeddings
from langchain.text_splitter import RecursiveCharacterTextSplitter

model_path = "./raft_finetuned_model"
model = AutoModelForCausalLM.from_pretrained(model_path)
tokenizer = AutoTokenizer.from_pretrained(model_path)
tokenizer.pad_token = tokenizer.eos_token

corpus_text = """......................"""

Step 11: Split and Embed Corpus for RAG

Convert corpus into retrievable chunks and store in Chroma vector DB

Python

splitter = RecursiveCharacterTextSplitter(chunk_size=300, chunk_overlap=50)
chunks = splitter.split_text(corpus_text)

embedding_model = HuggingFaceEmbeddings(model_name="sentence-transformers/all-MiniLM-L6-v2")
vector_db = Chroma.from_texts(chunks, embedding_model)

Step 12: Define RAG Retrieval Function

Retrieve relevant text segments from the knowledge base

Python

def retrieve_documents(query, k=2):
    results = vector_db.similarity_search(query, k=k)
    return " ".join([r.page_content for r in results])

Step 13: Generate Contextual Answers with RAG

Use retrieved context and question as input to the fine-tuned model.
Generate answers that are both factual and context-aware.

Python

def rag_generate_answer(question):
    context = retrieve_documents(question)
    input_text = f"Context:\n{context}\n\nQuestion: {question}\nAnswer:"
    inputs = tokenizer(input_text, return_tensors="pt").to(model.device)
    outputs = model.generate(**inputs, max_new_tokens=150, do_sample=True, top_p=0.9, temperature=0.7, repetition_penalty=1.2)
    response = tokenizer.decode(outputs[0], skip_special_tokens=True)
    return response

Step 14: Generate Contextual Answers with RAG

The function ask_question_with_raft() takes a user’s question as input and prints it for reference.
It then calls rag_generate_answer(question), which retrieves relevant context from the database and generates an accurate, context-aware answer using the fine-tuned RAFT model.

Python

def ask_question_with_raft(question):
    print(f"\n Question: {question}")
    answer = rag_generate_answer(question)
    print("\n Generated Answer:")
    return answer

question = "What did Einstein contribute to modern physics?"
print(ask_question_with_raft(question))

Output:

You can download full code from here.

Applications

Healthcare and Biomedicine: Interprets medical research, clinical notes and improves precision on datasets.
Software and API: Retrieves and reasons over documentation to generate executable API calls for frameworks like TensorFlow or HuggingFace.
Legal and Finance: Summarises regulations, retrieves precedents and supports compliance-based reasoning.
Enterprise Knowledge Management: Powers chatbots and assistants to give accurate, context-aware responses from internal data.
Scientific Research: Helps with literature reviews, extraction of insights and hypothesis validation from domain papers.
Customer Support: Integrates product knowledge with real-time retrieval for fast and accurate query handling.
Education and e-Learning: Personalizes tutoring through referencing relevant materials and reasoning over student questions.

Challenges

Complex Setup of Data: Creation and maintenance of domain specific retrieval databases are expensive and time consuming.
Balance Issues: Over reliance on retrieval hurts generalization, whereas underuse weakens domain precision.
High Resource Consumption: Two stages for retrieval and fine-tuning require much resource-intensive computing.
Evaluation Difficulty: Hard to isolate RAFT's gains from standard RAG or fine-tuning methods.
Knowledge Drift: Outdated data degrades retrieval accuracy over time.
Integration Overhead: RAFT is technically challenging to adopt into an enterprise system.
Lack of Standards: There is no uniform framework yet, hence the implementations are not standardized.

anishbhww98

Improve

Article Tags :

Retrieval-Augmented Fine-Tuning (RAFT)

How RAFT Improves on RAG and Fine-Tuning

Understanding the RAFT Approach

Why RAFT Works Better ?

Key Components

Step-By-Step Implementation

1. Fine-Tuning LLM

2. RAG over Fine Tuned Model

Applications

Challenges

Explore

Introduction to AI

AI Concepts

Machine Learning in AI

Robotics and AI

Generative AI

AI Practice

Thank You!

What kind of Experience do you want to share?