Skip to content
This repository was archived by the owner on May 7, 2026. It is now read-only.
Merged
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Prev Previous commit
Next Next commit
reduce number of complaints to 5000
  • Loading branch information
Henry J Solberg committed Nov 8, 2023
commit bbbf66d98120c6e41f67a967ad327a224856f2b4
10 changes: 5 additions & 5 deletions notebooks/generative_ai/bq_dataframes_llm_kmeans.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -59,7 +59,7 @@
"\n",
"The goal of this notebook is to demonstrate a comment characterization algorithm for an online business. We will accomplish this using [Google's PaLM 2](https://ai.google/discover/palm2/) and [KMeans clustering](https://en.wikipedia.org/wiki/K-means_clustering) in three steps:\n",
"\n",
"1. Use PaLM2TextEmbeddingGenerator to [generate text embeddings](https://cloud.google.com/vertex-ai/docs/generative-ai/embeddings/get-text-embeddings) for each of 10000 complaints sent to an online bank. If you're not familiar with what a text embedding is, it's a list of numbers that are like coordinates in an imaginary \"meaning space\" for sentences. (It's like [word embeddings](https://en.wikipedia.org/wiki/Word_embedding), but for more general text.) The important point for our purposes is that similar sentences are close to each other in this imaginary space.\n",
"1. Use PaLM2TextEmbeddingGenerator to [generate text embeddings](https://cloud.google.com/vertex-ai/docs/generative-ai/embeddings/get-text-embeddings) for each of 5000 complaints sent to an online bank. If you're not familiar with what a text embedding is, it's a list of numbers that are like coordinates in an imaginary \"meaning space\" for sentences. (It's like [word embeddings](https://en.wikipedia.org/wiki/Word_embedding), but for more general text.) The important point for our purposes is that similar sentences are close to each other in this imaginary space.\n",
"2. Use KMeans clustering to group together complaints whose text embeddings are near to eachother. This will give us sets of similar complaints, but we don't yet know _why_ these complaints are similar.\n",
"3. Prompt PaLM2TextGenerator in English asking what the difference is between the groups of complaints that we got. Thanks to the power of modern LLMs, the response might give us a very good idea of what these complaints are all about, but remember to [\"understand the limits of your dataset and model.\"](https://ai.google/responsibility/responsible-ai-practices/#:~:text=Understand%20the%20limitations%20of%20your%20dataset%20and%20model)\n",
"\n",
Expand Down Expand Up @@ -394,8 +394,8 @@
},
"outputs": [],
"source": [
"# Choose 10,000 complaints randomly and store them in a column in a DataFrame\n",
"downsampled_issues_df = issues_df.sample(n=10000)"
"# Choose 5,000 complaints randomly and store them in a column in a DataFrame\n",
"downsampled_issues_df = issues_df.sample(n=5000)"
]
},
{
Expand Down Expand Up @@ -429,7 +429,7 @@
},
"outputs": [],
"source": [
"# Will take ~5 minutes to compute the embeddings\n",
"# Will take ~3 minutes to compute the embeddings\n",
"predicted_embeddings = model.predict(downsampled_issues_df)\n",
"# Notice the lists of numbers that are our text embeddings for each complaint\n",
"predicted_embeddings.head() "
Expand Down Expand Up @@ -494,7 +494,7 @@
},
"outputs": [],
"source": [
"# Use KMeans clustering to calculate our groups. Will take ~5 minutes.\n",
"# Use KMeans clustering to calculate our groups. Will take ~3 minutes.\n",
"cluster_model.fit(combined_df[[\"text_embedding\"]])\n",
"clustered_result = cluster_model.predict(combined_df[[\"text_embedding\"]])\n",
"# Notice the CENTROID_ID column, which is the ID number of the group that\n",
Expand Down