This is the official repository for the paper "SHARE: An SLM-based Hierarchical Action CorREction Assistant for Text-to-SQL", which has been accepted to ACL Main 2025.
In this work, we propose an assistant-based framework where generator LLMs create initial outputs and implement self-correction guided by assistants. Our primary contribution, SHARE (SLM-based Hierarchical Action CorREction Assistant), orchestrates three specialized Small Language Models (SLMs), each under 8B parameters, in a sequential pipeline. Specifically, the Base Action Model (BAM) transforms raw SQL queries into action trajectories that capture reasoning paths; the Schema Augmentation Model (SAM) and the Logic Optimization Model (LOM) further perform orchestrated inference to rectify schema-related and logical errors, respectively, within action trajectories. SHARE improves error detection precision and correction efficacy while reducing computational overhead compared to conventional LLM approaches. Additionally, we also incorporate a novel hierarchical self-evolution strategy that enhances data efficiency during training.
• Use the following command to configure local environment:
$ conda create -n share python=3.10
$ conda activate share
$ pip3 install -r requirements.txt• Set environment variables for the Azure OpenAI API or modify your own OpenAI config in ./src/call_models/call_apis.py:
export OPENAI_API_BASE="YOUR_OPENAI_API_BASE"
export OPENAI_API_KEY="YOUR_OPENAI_API_KEY"
export ENGINE_ID="YOUR_ENGINE_ID"• Set environment variable to specify which GPU devices are visible to your program:
export CUDA_VISIBLE_DEVICES="[AVAILABLE_DEVICE_IDS]"The BIRD dataset used in the paper could be directly downloaded from the BIRD Leaderboard. After downloading and unzipping, please place the contents into the following directories: ./data/bird/train/ and ./data/bird/dev/.
The Spider dataset can be downloaded from the Spider Leaderboard. After downloading and extracting it, place the contents into: ./data/spider/
The column_meaning.json file for each dataset is stored under the corresponding path in ./data/. It is formatted as a Python dictionary, where each key follows the structure {database_id}|{table_name}|{column_name}. Each value contains key information about the corresponding column, including summaries of its values derived from the raw CSV files. This file can be directly used during training and inference.
For the expected file structure under each dataset directory, please refer to: ./configs/data_config.json.
To directly use the fine-tuned models in SHARE for inference, you can either load them via their model cards from Models on HuggingFace, or download the models and place them in the ./model/ directory for local loading.
You could then execute the command line by following the instruction (You may need to adjust paths and parameters with your preference.):
$ sh ./scripts/run_inference.sh
In this script, as described in Section 3.5, we execute the SHARE workflow to generate refined action trajectories for each instance. These refined trajectories then serve as self-correction signals, enabling the language model to regenerate more accurate and contextually appropriate SQL queries. Outputs during inference are stored in ./outputs/infer/.
To train the three models used in SHARE from scratch, you could either use the corresponding processed training data available on HuggingFace Datasets, or generate the training data locally by running the provided data processing scripts. For example, to generate the training data for the BAM module, run:
$ sh ./scripts/get_bam_data.sh
All the generated data will be stored in ./ft_data/. We adopt LlamaFactory as the primary framework for model training. An example training script is provided in ./scripts/example_train.sh, which can be customized by modifying parameters and file paths to suit your specific needs.
We follow the official evaluation methods proposed by the BIRD and SPIDER. The corresponding evaluation scripts can be obtained from their official code repositories.
Updates on the performance of the latest advanced model with SHARE-8B will be provided here.
| Model | Simple | Moderate | Challenging | Total |
|---|---|---|---|---|
| 🔥 Dev Set | ||||
| GPT-5 | 71.57 | 57.45 | 51.71 | 65.45 |
| 🔥 Test Set | ||||
| GPT-5 | 77.87 | 68.47 | 58.25 | 71.83 |
├─data/
| ├─bird/
| | ├─dev/ # data of BIRD dev databases
| | ├─train/ # data of BIRD train databases
| ├─spider/ # data of SPIDER
├─ft_data/
| ├─dataset_info.json # Dataset information for training data
├─LLaMa-Factory/
├─model/ #Directory for the fine-tuned models
├─outputs/
| ├─infer/ # Outputs generated during inference
| ├─train/ # Outputs generated during training data processing
├─scripts/
| ├─example_train.sh # Launches a sample training run using LlamaFactory
| ├─get_bam_data.sh # Preprocesses and prepares training data for the BAM module
| ├─get_sam_data.sh
| ├─get_lom_data.sh
| ├─run_inference.sh
├─src/
| ├─call_models/
| | ├─call_apis.py # OpenAI API call
| | ├─vllm_infer.py # Batch infer using VLLM
| ├─infer/
| | ├─run.py
| | ├─share_models.py # Models of SHARE
| ├─prompts/
| | ├─for_infer.py
| | ├─for_train.py
| ├─train/
| | ├─prepare_data_for_bam.py # Construct the training data for BAM
| | ├─prepare_data_for_sam.py
| | ├─prepare_data_for_lom.py
| | ├─sql2sr_for_bam.py # Leverages BAM to synthesize data for subsequent model training
| | ├─sql2sr_for_teacher.py # Leverages strong teacher model to synthesize data for BAM training
| | ├─sr_validation.py
| | ├─sr2sql_convertion.py # Convert trajectories generated by the teacher model back to SQL queries for validation
| ├─utils.py
├─README.md
├─requirements.txtPlease cite the repo if you think our work is helpful to you.
@inproceedings{qu-etal-2025-share,
title = "{SHARE}: An {SLM}-based Hierarchical Action {C}or{RE}ction Assistant for Text-to-{SQL}",
author = "Qu, Ge and
Li, Jinyang and
Qin, Bowen and
Li, Xiaolong and
Huo, Nan and
Ma, Chenhao and
Cheng, Reynold",
year = "2025",
publisher = "Association for Computational Linguistics"
}
