{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "f96b815e",
   "metadata": {},
   "source": [
    "# Semantic search and Retrieval augmented generation using Elasticsearch and OpenAI"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "e0f537af",
   "metadata": {},
   "source": [
    "[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/elastic/elasticsearch-labs/blob/main/notebooks/integrations/openai/openai-KNN-RAG.ipynb)\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "349e0e74",
   "metadata": {},
   "source": [
    "This notebook demonstrates how to: \n",
    "- Index the OpenAI Wikipedia vector dataset into Elasticsearch \n",
    "- Embed a question with the OpenAI [`embeddings`](https://platform.openai.com/docs/api-reference/embeddings) endpoint\n",
    "- Perform semantic search on the Elasticsearch index using the encoded question\n",
    "- Send the top search results to the OpenAI [Chat Completions](https://platform.openai.com/docs/guides/gpt/chat-completions-api) API endpoint for retrieval augmented generation (RAG)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "aa9576ca",
   "metadata": {},
   "source": [
    "## Install packages and import modules "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "8c304b93",
   "metadata": {},
   "outputs": [],
   "source": [
    "# install packages\n",
    "\n",
    "!python3 -m pip install -qU openai pandas wget elasticsearch\n",
    "\n",
    "# import modules\n",
    "\n",
    "from getpass import getpass\n",
    "from elasticsearch import Elasticsearch, helpers\n",
    "import wget, zipfile, pandas as pd, json, openai"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "de32a789",
   "metadata": {},
   "source": [
    "## Connect to Elasticsearch\n",
    "\n",
    "ℹ️ We're using an Elastic Cloud deployment of Elasticsearch for this notebook.\n",
    "If you don't already have an Elastic deployment, you can sign up for a free [Elastic Cloud trial](https://cloud.elastic.co/registration?onboarding_token=vectorsearch&utm_source=github&utm_content=openai-cookbook).\n",
    "\n",
    "To connect to Elasticsearch, you need to create a client instance with the Cloud ID and password for your deployment.\n",
    "\n",
    "Find the Cloud ID for your deployment by going to https://cloud.elastic.co/deployments and selecting your deployment."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "3a57b6a8",
   "metadata": {},
   "outputs": [],
   "source": [
    "# https://www.elastic.co/search-labs/tutorials/install-elasticsearch/elastic-cloud#finding-your-cloud-id\n",
    "ELASTIC_CLOUD_ID = getpass(\"Elastic Cloud ID: \")\n",
    "\n",
    "# https://www.elastic.co/search-labs/tutorials/install-elasticsearch/elastic-cloud#creating-an-api-key\n",
    "ELASTIC_API_KEY = getpass(\"Elastic Api Key: \")\n",
    "\n",
    "client = Elasticsearch(cloud_id=ELASTIC_CLOUD_ID, api_key=ELASTIC_API_KEY)\n",
    "\n",
    "# Test connection to Elasticsearch\n",
    "print(client.info())"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "80b55952",
   "metadata": {},
   "source": [
    "## Download the dataset \n",
    "\n",
    "In this step we download the OpenAI Wikipedia embeddings dataset, and extract the zip file."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "c584f15c",
   "metadata": {},
   "outputs": [],
   "source": [
    "embeddings_url = \"https://cdn.openai.com/API/examples/data/vector_database_wikipedia_articles_embedded.zip\"\n",
    "wget.download(embeddings_url)\n",
    "\n",
    "with zipfile.ZipFile(\"vector_database_wikipedia_articles_embedded.zip\", \"r\") as zip_ref:\n",
    "    zip_ref.extractall(\"data\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "9654ac08",
   "metadata": {},
   "source": [
    "##  Read CSV file into a Pandas DataFrame\n",
    "\n",
    "Next we use the Pandas library to read the unzipped CSV file into a DataFrame. This step makes it easier to index the data into Elasticsearch in bulk."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "76347d10",
   "metadata": {},
   "outputs": [],
   "source": [
    "wikipedia_dataframe = pd.read_csv(\n",
    "    \"data/vector_database_wikipedia_articles_embedded.csv\"\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "6af9f5ad",
   "metadata": {},
   "source": [
    "## Create index with mapping\n",
    "\n",
    "Now we need to create an Elasticsearch index with the necessary mappings. This will enable us to index the data into Elasticsearch.\n",
    "\n",
    "We use the `dense_vector` field type for the `title_vector` and  `content_vector` fields. This is a special field type that allows us to store dense vectors in Elasticsearch.\n",
    "\n",
    "Later, we'll need to target the `dense_vector` field for kNN search.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "681989b3",
   "metadata": {},
   "outputs": [],
   "source": [
    "index_mapping = {\n",
    "    \"properties\": {\n",
    "        \"title_vector\": {\n",
    "            \"type\": \"dense_vector\",\n",
    "            \"dims\": 1536,\n",
    "            \"index\": \"true\",\n",
    "            \"similarity\": \"cosine\",\n",
    "        },\n",
    "        \"content_vector\": {\n",
    "            \"type\": \"dense_vector\",\n",
    "            \"dims\": 1536,\n",
    "            \"index\": \"true\",\n",
    "            \"similarity\": \"cosine\",\n",
    "        },\n",
    "        \"text\": {\"type\": \"text\"},\n",
    "        \"title\": {\"type\": \"text\"},\n",
    "        \"url\": {\"type\": \"keyword\"},\n",
    "        \"vector_id\": {\"type\": \"long\"},\n",
    "    }\n",
    "}\n",
    "client.indices.create(index=\"wikipedia_vector_index\", mappings=index_mapping)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "c2fb582e",
   "metadata": {},
   "source": [
    "## Index data into Elasticsearch \n",
    "\n",
    "The following function generates the required bulk actions that can be passed to Elasticsearch's Bulk API, so we can index multiple documents efficiently in a single request.\n",
    "\n",
    "For each row in the DataFrame, the function yields a dictionary representing a single document to be indexed. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "efee9b97",
   "metadata": {},
   "outputs": [],
   "source": [
    "def dataframe_to_bulk_actions(df):\n",
    "    for index, row in df.iterrows():\n",
    "        yield {\n",
    "            \"_index\": \"wikipedia_vector_index\",\n",
    "            \"_id\": row[\"id\"],\n",
    "            \"_source\": {\n",
    "                \"url\": row[\"url\"],\n",
    "                \"title\": row[\"title\"],\n",
    "                \"text\": row[\"text\"],\n",
    "                \"title_vector\": json.loads(row[\"title_vector\"]),\n",
    "                \"content_vector\": json.loads(row[\"content_vector\"]),\n",
    "                \"vector_id\": row[\"vector_id\"],\n",
    "            },\n",
    "        }"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "b8164b38",
   "metadata": {},
   "source": [
    "As the dataframe is large, we will index data in batches of `100`. We index the data into Elasticsearch using the Python client's [helpers](https://www.elastic.co/guide/en/elasticsearch/client/python-api/current/client-helpers.html#bulk-helpers) for the bulk API."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "aacb5e9c",
   "metadata": {},
   "outputs": [],
   "source": [
    "start = 0\n",
    "end = len(wikipedia_dataframe)\n",
    "batch_size = 100\n",
    "for batch_start in range(start, end, batch_size):\n",
    "    batch_end = min(batch_start + batch_size, end)\n",
    "    batch_dataframe = wikipedia_dataframe.iloc[batch_start:batch_end]\n",
    "    actions = dataframe_to_bulk_actions(batch_dataframe)\n",
    "    helpers.bulk(client, actions)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "091ffc51",
   "metadata": {},
   "source": [
    "Let's test the index with a simple match query."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "2ccc8955",
   "metadata": {},
   "outputs": [],
   "source": [
    "print(\n",
    "    client.search(\n",
    "        index=\"wikipedia_vector_index\",\n",
    "        query={\"match\": {\"text\": {\"query\": \"Hummingbird\"}}},\n",
    "    )\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "992b6804",
   "metadata": {},
   "source": [
    "## Encode a question with OpenAI embedding model\n",
    "\n",
    "To perform semantic search, we need to encode queries with the same embedding model used to encode the documents at index time.\n",
    "In this example, we need to use the `text-embedding-ada-002` model.\n",
    "\n",
    "You'll need your OpenAI [API key](https://platform.openai.com/account/api-keys) to generate the embeddings."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "57385c69",
   "metadata": {},
   "outputs": [],
   "source": [
    "# https://platform.openai.com/api-keys\n",
    "OPENAI_API_KEY = getpass(\"OpenAI API key: \")\n",
    "\n",
    "# Set API key\n",
    "openai.api_key = OPENAI_API_KEY\n",
    "\n",
    "# Define model\n",
    "EMBEDDING_MODEL = \"text-embedding-ada-002\"\n",
    "\n",
    "# Define question\n",
    "question = \"How big is the Atlantic ocean?\"\n",
    "\n",
    "# Create embedding\n",
    "question_embedding = openai.Embedding.create(input=question, model=EMBEDDING_MODEL)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "c7e6bf5d",
   "metadata": {},
   "source": [
    "## Run semantic search queries\n",
    "\n",
    "Now we're ready to run queries against our Elasticsearch index using our encoded question. We'll be doing a k-nearest neighbors search, using the Elasticsearch [kNN query](https://www.elastic.co/guide/en/elasticsearch/reference/current/knn-search.html) option.\n",
    "\n",
    "First, we define a small function to pretty print the results."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "b1291434",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Function to pretty print Elasticsearch results\n",
    "\n",
    "\n",
    "def pretty_response(response):\n",
    "    for hit in response[\"hits\"][\"hits\"]:\n",
    "        id = hit[\"_id\"]\n",
    "        score = hit[\"_score\"]\n",
    "        title = hit[\"_source\"][\"title\"]\n",
    "        text = hit[\"_source\"][\"text\"]\n",
    "        pretty_output = f\"\\nID: {id}\\nTitle: {title}\\nSummary: {text}\\nScore: {score}\"\n",
    "        print(pretty_output)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "ed8a497f",
   "metadata": {},
   "source": [
    "Now let's run our `kNN` query."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "fc834fdd",
   "metadata": {},
   "outputs": [],
   "source": [
    "response = client.search(\n",
    "    index=\"wikipedia_vector_index\",\n",
    "    knn={\n",
    "        \"field\": \"content_vector\",\n",
    "        \"query_vector\": question_embedding[\"data\"][0][\"embedding\"],\n",
    "        \"k\": 10,\n",
    "        \"num_candidates\": 100,\n",
    "    },\n",
    ")\n",
    "pretty_response(response)\n",
    "top_hit_summary = response[\"hits\"][\"hits\"][0][\"_source\"][\n",
    "    \"text\"\n",
    "]  # Store content of top hit for final step"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "276c1147",
   "metadata": {},
   "source": [
    "Success! Now you know how to use Elasticsearch as a vector database to store embeddings, encode queries by calling the OpenAI [`embeddings`](https://platform.openai.com/docs/api-reference/embeddings) endpoint, and run semantic search using [kNN search](https://www.elastic.co/guide/en/elasticsearch/reference/current/knn-search.html) to find the top results.\n",
    "\n",
    "Play around with different queries, and if you want to try with your own data, you can experiment with different embedding models.\n",
    "\n",
    "Now we can use the [Chat completions](https://platform.openai.com/docs/api-reference/chat) API to work some generative AI magic using the top search result as additional context.\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "8abac103",
   "metadata": {},
   "source": [
    "## Use Chat Completions API for retrieval augmented generation\n",
    "\n",
    "\n",
    "Now we can send the question and the text to OpenAI's [Chat completions](https://platform.openai.com/docs/api-reference/chat) API.\n",
    "\n",
    "Using a LLM model together with a retrieval model is known as retrieval augmented generation (RAG). We're using Elasticsearch to do what it does best, retrieve relevant documents. Then we use the LLM to do what it does best, tasks like generating summaries and answering questions, using the retrieved documents as context. \n",
    "\n",
    "The model will generate a response to the question, using the top kNN hit as context. Use the `messages` list to shape your prompt to the model. In this example, we're using the `gpt-3.5-turbo` model."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "summary = openai.ChatCompletion.create(\n",
    "    model=\"gpt-3.5-turbo\",\n",
    "    messages=[\n",
    "        {\"role\": \"system\", \"content\": \"You are a helpful assistant.\"},\n",
    "        {\n",
    "            \"role\": \"user\",\n",
    "            \"content\": \"Answer the following question:\"\n",
    "            + question\n",
    "            + \"by using the following text:\"\n",
    "            + top_hit_summary,\n",
    "        },\n",
    "    ],\n",
    ")\n",
    "\n",
    "choices = summary.choices\n",
    "\n",
    "for choice in choices:\n",
    "    print(\"------------------------------------------------------------\")\n",
    "    print(choice.message.content)\n",
    "    print(\"------------------------------------------------------------\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Code explanation\n",
    "\n",
    "Here's what that code does:\n",
    "\n",
    "- Uses OpenAI's model to generate a response\n",
    "- Sends a conversation containing a system message and a user message to the model\n",
    "- The system message sets the assistant's role as \"helpful assistant\"\n",
    "- The user message contains a question as specified in the original kNN query and some input text\n",
    "- The response from the model is stored in the `summary.choices` variable"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Next steps\n",
    "\n",
    "Now you know how to use Elasticsearch as a vector database to store embeddings, encode queries by calling the OpenAI [`embeddings`](https://platform.openai.com/docs/api-reference/embeddings) endpoint, and run semantic search using [kNN search](https://www.elastic.co/guide/en/elasticsearch/reference/current/knn-search.html) to find the top results.\n",
    "\n",
    "That was just one example of how to combine Elasticsearch with the power of OpenAI's models, to store embeddings, run a semantic search and enable retrieval augmented generation. RAG allows you to avoid the costly and complex process of training or fine-tuning models, by leveraging out-of-the-box models, enhanced with additional context.\n",
    "\n",
    "Use this as a blueprint for your own experiments.\n",
    "\n",
    "To adapt the conversation for different use cases, customize the system message to define the assistant's behavior or persona. Adjust the user message to specify the task, such as summarization or question answering, along with the desired format of the response."
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3.11.3 64-bit",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.11.4"
  },
  "orig_nbformat": 4,
  "vscode": {
   "interpreter": {
    "hash": "b0fa6594d8f4cbf19f97940f81e996739fb7646882a419484c72d19e05852a7e"
   }
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}