# Similarity search with Langchain and Open AI
[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/elastic/elasticsearch-labs/blob/main/notebooks/integrations/langchain/langchain-vector-store.ipynb)


This workbook shows example of similiarity search for a search query and demonstrates filtering of metadata. First we split the documents into chunks using `langchain` and then index into elasticsearch through [`ElasticsearchStore.from_documents`](https://api.python.langchain.com/en/latest/vectorstores/langchain.vectorstores.elasticsearch.ElasticsearchStore.html#langchain.vectorstores.elasticsearch.ElasticsearchStore.from_documents).


## Install packages and import modules


In [1]:
# install packages
!python3 -m pip install -qU langchain langchain-elasticsearch openai tiktoken

# import modules
from getpass import getpass
from langchain_elasticsearch import ElasticsearchStore
from langchain.embeddings.openai import OpenAIEmbeddings
from urllib.request import urlopen
from langchain.text_splitter import RecursiveCharacterTextSplitter
import json

## Connect to Elasticsearch

ℹ️ We're using an Elastic Cloud deployment of Elasticsearch for this notebook. If you don't have an Elastic Cloud deployment, sign up [here](https://cloud.elastic.co/registration?onboarding_token=vectorsearch&utm_source=github&utm_content=elasticsearch-labs-notebook) for a free trial. 

We'll use the **Cloud ID** to identify our deployment, because we are using Elastic Cloud deployment. To find the Cloud ID for your deployment, go to https://cloud.elastic.co/deployments and select your deployment.


We will use [ElasticsearchStore](https://api.python.langchain.com/en/latest/vectorstores/langchain.vectorstores.elasticsearch.ElasticsearchStore.html) to connect to our elastic cloud deployment. This would help create and index data easily. In the ElasticsearchStore instance, will set `embedding` to [OpenAIEmbeddings](https://api.python.langchain.com/en/latest/embeddings/langchain.embeddings.openai.OpenAIEmbeddings.html) to embed the texts and also set the elasticsearch index name that will be used in this example.

In [78]:
# https://www.elastic.co/search-labs/tutorials/install-elasticsearch/elastic-cloud#finding-your-cloud-id
ELASTIC_CLOUD_ID = getpass("Elastic Cloud ID: ")

# https://www.elastic.co/search-labs/tutorials/install-elasticsearch/elastic-cloud#creating-an-api-key
ELASTIC_API_KEY = getpass("Elastic Api Key: ")

# https://platform.openai.com/api-keys
OPENAI_API_KEY = getpass("OpenAI API key: ")

embeddings = OpenAIEmbeddings(openai_api_key=OPENAI_API_KEY)

vector_store = ElasticsearchStore(
 es_cloud_id=ELASTIC_CLOUD_ID,
 es_api_key=ELASTIC_API_KEY,
 index_name="workplace_index",
 embedding=embeddings,
)

## Download the dataset 

Let's download the sample dataset and deserialize the document.

In [101]:
url = "https://raw.githubusercontent.com/elastic/elasticsearch-labs/main/example-apps/chatbot-rag-app/data/data.json"

response = urlopen(url)

workplace_docs = json.loads(response.read())

## Split Documents into Passages


We will chunk these documents into 800 token passages with an overlap of 0 tokens using a simple splitter. 


In [None]:
metadata = []
content = []

for doc in workplace_docs:
 content.append(doc["content"])
 metadata.append(
 {
 "name": doc["name"],
 "summary": doc["summary"],
 "rolePermissions": doc["rolePermissions"],
 }
 )

text_splitter = RecursiveCharacterTextSplitter.from_tiktoken_encoder(
 chunk_size=512, chunk_overlap=256
)
docs = text_splitter.create_documents(content, metadatas=metadata)

## Index data into elasticsearch

Next we will index data to elasticsearch using [ElasticsearchStore.from_documents](https://api.python.langchain.com/en/latest/vectorstores/langchain.vectorstores.elasticsearch.ElasticsearchStore.html#langchain.vectorstores.elasticsearch.ElasticsearchStore.from_documents). We will use Cloud ID, Password and Index name values set in the `Create cloud deployment` step.




In [131]:
documents = vector_store.from_documents(
 docs,
 embeddings,
 es_cloud_id=ELASTIC_CLOUD_ID,
 es_api_key=ELASTIC_API_KEY,
 index_name="workplace_index",
)

# Results functions
Next, we will create a small function to show the results of our query in human-readable outputs. This function would be used in our examples to display the results.

In [159]:
def showResults(output):
 print("Total results: ", len(output))
 for index in range(len(output)):
 print(output[index])

## Querying the dataset with similarity_search

Now that we have indexed our sample data to elasticsearch, we will perform a similarity search on query - `How does the compensation work?`. By default returns top `4` documents.

In [160]:
query = "How does the compensation work?"
results = documents.similarity_search(query)

showResults(results)

Total results: 4
page_content='Compensation Bands:\nBased on the job levels, the following compensation bands have been established:\na. Entry-Level Band: This band encompasses salary ranges for employees in entry-level positions. It aims to provide competitive compensation for individuals starting their careers within the company.\n\nb. Intermediate-Level Band: This band covers salary ranges for employees who have gained moderate experience and expertise in their respective roles. It rewards employees for their growing skill set and contributions.\n\nc. Senior-Level Band: The senior-level band includes salary ranges for experienced employees who have attained advanced skills and have a proven track record of delivering results. It reflects the increased responsibilities and expectations placed upon these individuals.' metadata={'name': 'Compensation Framework For It Teams', 'summary': 'This document outlines a compensation framework for IT teams. It includes job levels, compensation b

## Querying the dataset show top 10 documents

Now we will set `k=10` and try same query to see top 10 documents. 



In [161]:
query = "How does the compensation work?"
results = documents.similarity_search(query, k=10)

showResults(results)

Total results: 10
page_content='Compensation Bands:\nBased on the job levels, the following compensation bands have been established:\na. Entry-Level Band: This band encompasses salary ranges for employees in entry-level positions. It aims to provide competitive compensation for individuals starting their careers within the company.\n\nb. Intermediate-Level Band: This band covers salary ranges for employees who have gained moderate experience and expertise in their respective roles. It rewards employees for their growing skill set and contributions.\n\nc. Senior-Level Band: The senior-level band includes salary ranges for experienced employees who have attained advanced skills and have a proven track record of delivering results. It reflects the increased responsibilities and expectations placed upon these individuals.' metadata={'name': 'Compensation Framework For It Teams', 'summary': 'This document outlines a compensation framework for IT teams. It includes job levels, compensation 

## Querying the dataset with filtering Metadata
We will now add metadata filtering by Keyword at query time, to match `rolePermissions` as `manager`. 

In [163]:
query = "How does the compensation work?"
results = documents.similarity_search(
 query, filter=[{"match": {"metadata.rolePermissions": "manager"}}]
)

showResults(results)

Total results: 4
page_content='Compensation Bands:\nBased on the job levels, the following compensation bands have been established:\na. Entry-Level Band: This band encompasses salary ranges for employees in entry-level positions. It aims to provide competitive compensation for individuals starting their careers within the company.\n\nb. Intermediate-Level Band: This band covers salary ranges for employees who have gained moderate experience and expertise in their respective roles. It rewards employees for their growing skill set and contributions.\n\nc. Senior-Level Band: The senior-level band includes salary ranges for experienced employees who have attained advanced skills and have a proven track record of delivering results. It reflects the increased responsibilities and expectations placed upon these individuals.' metadata={'name': 'Compensation Framework For It Teams', 'summary': 'This document outlines a compensation framework for IT teams. It includes job levels, compensation b