Skip to content

quge2023/SHARE

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

11 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

SHARE: An SLM-based Hierarchical Action CorREction Assistant for Text-to-SQL

Data Link Leaderboard Python 3.10 vllm 0.9.1 openai 1.93.0

This is the official repository for the paper "SHARE: An SLM-based Hierarchical Action CorREction Assistant for Text-to-SQL", which has been accepted to ACL Main 2025.

✨ Overview

In this work, we propose an assistant-based framework where generator LLMs create initial outputs and implement self-correction guided by assistants. Our primary contribution, SHARE (SLM-based Hierarchical Action CorREction Assistant), orchestrates three specialized Small Language Models (SLMs), each under 8B parameters, in a sequential pipeline. Specifically, the Base Action Model (BAM) transforms raw SQL queries into action trajectories that capture reasoning paths; the Schema Augmentation Model (SAM) and the Logic Optimization Model (LOM) further perform orchestrated inference to rectify schema-related and logical errors, respectively, within action trajectories. SHARE improves error detection precision and correction efficacy while reducing computational overhead compared to conventional LLM approaches. Additionally, we also incorporate a novel hierarchical self-evolution strategy that enhances data efficiency during training.

✨ Environment Setup

• Use the following command to configure local environment:

 $ conda create -n share python=3.10
 $ conda activate share
 $ pip3 install -r requirements.txt

• Set environment variables for the Azure OpenAI API or modify your own OpenAI config in ./src/call_models/call_apis.py:

export OPENAI_API_BASE="YOUR_OPENAI_API_BASE"
export OPENAI_API_KEY="YOUR_OPENAI_API_KEY"
export ENGINE_ID="YOUR_ENGINE_ID"

• Set environment variable to specify which GPU devices are visible to your program:

export CUDA_VISIBLE_DEVICES="[AVAILABLE_DEVICE_IDS]"

✨ Data Preparation

📍 Bird Dataset

The BIRD dataset used in the paper could be directly downloaded from the BIRD Leaderboard. After downloading and unzipping, please place the contents into the following directories: ./data/bird/train/ and ./data/bird/dev/.

📍 Spider Dataset

The Spider dataset can be downloaded from the Spider Leaderboard. After downloading and extracting it, place the contents into: ./data/spider/

📍 Column Meaning JSON

The column_meaning.json file for each dataset is stored under the corresponding path in ./data/. It is formatted as a Python dictionary, where each key follows the structure {database_id}|{table_name}|{column_name}. Each value contains key information about the corresponding column, including summaries of its values derived from the raw CSV files. This file can be directly used during training and inference.

For the expected file structure under each dataset directory, please refer to: ./configs/data_config.json.

✨ Infer Results

To directly use the fine-tuned models in SHARE for inference, you can either load them via their model cards from Models on HuggingFace, or download the models and place them in the ./model/ directory for local loading.

You could then execute the command line by following the instruction (You may need to adjust paths and parameters with your preference.):

$ sh ./scripts/run_inference.sh

In this script, as described in Section 3.5, we execute the SHARE workflow to generate refined action trajectories for each instance. These refined trajectories then serve as self-correction signals, enabling the language model to regenerate more accurate and contextually appropriate SQL queries. Outputs during inference are stored in ./outputs/infer/.

✨ Train From Scratch

To train the three models used in SHARE from scratch, you could either use the corresponding processed training data available on HuggingFace Datasets, or generate the training data locally by running the provided data processing scripts. For example, to generate the training data for the BAM module, run:

$ sh ./scripts/get_bam_data.sh

All the generated data will be stored in ./ft_data/. We adopt LlamaFactory as the primary framework for model training. An example training script is provided in ./scripts/example_train.sh, which can be customized by modifying parameters and file paths to suit your specific needs.

✨ Evaluation

We follow the official evaluation methods proposed by the BIRD and SPIDER. The corresponding evaluation scripts can be obtained from their official code repositories.

Model Performance Results with SHARE on BIRD

Updates on the performance of the latest advanced model with SHARE-8B will be provided here.

Model Simple Moderate Challenging Total
🔥 Dev Set
GPT-5 71.57 57.45 51.71 65.45
🔥 Test Set
GPT-5 77.87 68.47 58.25 71.83

✨ Project Structure

├─data/
|  ├─bird/ 
|  |  ├─dev/ # data of BIRD dev databases
|  |  ├─train/ # data of BIRD train databases
|  ├─spider/ # data of SPIDER
├─ft_data/
|  ├─dataset_info.json  # Dataset information for training data
├─LLaMa-Factory/
├─model/ #Directory for the fine-tuned models
├─outputs/ 
|  ├─infer/  # Outputs generated during inference
|  ├─train/  # Outputs generated during training data processing
├─scripts/
|  ├─example_train.sh            # Launches a sample training run using LlamaFactory
|  ├─get_bam_data.sh            # Preprocesses and prepares training data for the BAM module
|  ├─get_sam_data.sh  
|  ├─get_lom_data.sh  
|  ├─run_inference.sh 
├─src/
|  ├─call_models/
|  |  ├─call_apis.py # OpenAI API call
|  |  ├─vllm_infer.py # Batch infer using VLLM
|  ├─infer/
|  |  ├─run.py 
|  |  ├─share_models.py # Models of SHARE
|  ├─prompts/
|  |  ├─for_infer.py
|  |  ├─for_train.py
|  ├─train/
|  |  ├─prepare_data_for_bam.py # Construct the training data for BAM
|  |  ├─prepare_data_for_sam.py
|  |  ├─prepare_data_for_lom.py
|  |  ├─sql2sr_for_bam.py # Leverages BAM to synthesize data for subsequent model training
|  |  ├─sql2sr_for_teacher.py # Leverages strong teacher model to synthesize data for BAM training
|  |  ├─sr_validation.py
|  |  ├─sr2sql_convertion.py # Convert trajectories generated by the teacher model back to SQL queries for validation
|  ├─utils.py
├─README.md
├─requirements.txt

✨ Citation

Please cite the repo if you think our work is helpful to you.

@inproceedings{qu-etal-2025-share,
    title = "{SHARE}: An {SLM}-based Hierarchical Action {C}or{RE}ction Assistant for Text-to-{SQL}",
    author = "Qu, Ge  and
      Li, Jinyang  and
      Qin, Bowen  and
      Li, Xiaolong  and
      Huo, Nan  and
      Ma, Chenhao  and
      Cheng, Reynold",
    year = "2025",
    publisher = "Association for Computational Linguistics"
}

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published