Re: Regarding preparing the training Dataset for t...

Legmys_Buddy

Am building the ChatBot for our Crop Databases. Where we would like to have, users interact with the Databases using the NL. And the fine-tuned model (using the t5-large as baseline). And then, use the RAG technique to retrieve the column and table and schema names from the relational DB (postgres in this case). And After the fine-tuned model maps the SQL query corresponding to the NL question. It should output with complete reasoning along with the value. Therefore, this would be clear + informative for the user.

My question: since, have almost 200 tables and 16 schemas - this is for one database. And have prepared the training dataset for 15 tables and there relationship b/w those as well. As we all know, the training dataset might be small. But, still will give it a try and see what are the factors contributing that in-order to improve the model performance ??

my current challenge, am spending significant time on preparing the training dataset. What would be the best way that can build the more diversified NL-to-SQL queries including the edge case specific to our DB.

What would be the best way to move-forward, please let me know the best tools, that can use to build more robust training dataset ??
I love this community, and thanks in-advance for all your efforts + input for writing the responses !!

-Rhett

Hi @Legmys_Buddy,

Welcome to the Google Cloud Community!

I’m glad to hear you find the community helpful. If you’d like to enhance and build better datasets for chatbot training, there are a lot of different ways you can effectively achieve this on Google Cloud.

Am building the ChatBot for our Crop Databases. Where we would like to have, users interact with the Databases using the NL. And the fine-tuned model (using the t5-large as baseline). And then, use the RAG technique to retrieve the column and table and schema names from the relational DB (postgres in this case). And After the fine-tuned model maps the SQL query corresponding to the NL question. It should output with complete reasoning along with the value. Therefore, this would be clear + informative for the user.

I’d like to understand your use case better. Can you share a specific example of a prompt and its expected output?

my current challenge, am spending significant time on preparing the training dataset. What would be the best way that can build the more diversified NL-to-SQL queries including the edge case specific to our DB.

For using natural language with your queries, I recommend checking out these articles:

Generate SQL from questions in natural language using NLP2SQL: A comprehensive, 3-part guide on how to implement NL2SQL on Google Cloud by @SKJB.
Getting started with NL2SQL (natural language to SQL) with Gemini and BigQuery: BigQuery is a fully managed, serverless data warehouse offered by Google Cloud. It is highly capable of creating queries using NL.
Quickstart: Setup the Natural Language API: If you’re interested in developing, there is a Natural Language API available on Google Cloud.

Google Cloud also offers AlloyDB for PostgreSQL. This database is a fully managed, PostgreSQL-compatible database service that delivers high performance and reliability for demanding workloads through a Google-built engine and multi-node cloud architecture. It also features AlloyDB AI natural language, which lets you create user-facing generative AI applications using natural language to query databases. For more info, check out the following documentation:

Was this helpful? If so, please accept this answer as “Solution”. If you need additional assistance, reply here within 2 business days and I’ll be happy to help.

Regarding preparing the training Dataset for the Text2SQL