Skip to content

Commit

Permalink
Merge pull request #121 from eosphoros-ai/refactor
Browse files Browse the repository at this point in the history
docs , update the evaluate method ,give  explanation about our metric
  • Loading branch information
csunny authored Nov 5, 2023
2 parents fe90150 + 85c8bba commit 180220b
Show file tree
Hide file tree
Showing 3 changed files with 19 additions and 9 deletions.
15 changes: 10 additions & 5 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -51,11 +51,11 @@

DB-GPT-Hub is an experimental project utilizing LLMs (Large Language Models) to achieve Text-to-SQL parsing. The project primarily encompasses data collection, data preprocessing, model selection and building, and fine-tuning of weights. Through this series of processes, we aim to enhance Text-to-SQL capabilities while reducing the model training costs, allowing more developers to contribute to the improvement of Text-to-SQL accuracy. Our ultimate goal is to realize automated question-answering capabilities based on databases, enabling users to execute complex database queries through natural language descriptions.

So far, we have successfully integrated multiple large models and established a complete workflow, including data processing, model SFT (Supervised Fine-Tuning) training, prediction output, and evaluation. The code is readily reusable within this project.
So far, we have successfully integrated multiple large models and established a complete workflow, including data processing, model SFT (Supervised Fine-Tuning) training, prediction output, and evaluation. The code is readily reusable within this project.

As of October 10, 2023, by fine-tuning an open-source model of 13 billion parameters using this project, **the execution accuracy on the Spider evaluation dataset has surpassed that of GPT-4!**

Part of the experimental results have been compiled into the [document](docs/eval_llm_result.md) in this project. By utilizing this project and combining more related data, the execution accuracy on the Spider evaluation set has already reached **0.825**.
As of 20231010, we used this project to fine-tune the open source 13B size model, combined with more relevant data, and under the zero-shot prompt, Spider-based [test-suite](https://github.com/taoyds/test-suite -sql-eval), the execution accuracy of the database (size-1.27G) can reach **0.764**, and the execution accuracy of the database (size-95M) pointed to by the Spider official [website](https://yale-lily.github.io/spider) is 0.825.


## 2. Fine-tuning Text-to-SQL

Expand Down Expand Up @@ -232,7 +232,8 @@ Run the following command:
```bash
python dbgpt_hub/eval/evaluation.py --plug_value --input Your_model_pred_file
```
You can find the results of our latest review and part of experiment results [here](docs/eval_llm_result.md)
You can find the results of our latest review and part of experiment results [here](docs/eval_llm_result.md)
**Note**: The database pointed to by the default code is a 95M database downloaded from [Spider official website] (https://yale-lily.github.io/spider). If you need to use Spider database (size 1.27G) in [test-suite](https://github.com/taoyds/test-suite-sql-eval), please download the database in the link to the custom directory first, and run the above evaluation command which add parameters and values ​​like `--db Your_download_db_path`.

## 4. RoadMap

Expand Down Expand Up @@ -263,7 +264,7 @@ The whole process we will divide into three phases:

## 5. Contributions

We welcome more folks to participate and provide feedback in areas like datasets, model fine-tuning, performance evaluation, paper recommendations, code reproduction, etc. Feel free to open issues or PRs and we'll actively respond.Before submitting the code, please format it using the black style.
We welcome more folks to participate and provide feedback in areas like datasets, model fine-tuning, performance evaluation, paper recommendations, code reproduction, etc. Feel free to open issues or PRs and we'll actively respond.Before submitting the code, please format it using the black style in command `black .`.

## 6. Acknowledgements

Expand All @@ -282,6 +283,10 @@ Our work is primarily based on the foundation of numerous open-source contributi
* [test-suite-sql-eval](https://github.com/taoyds/test-suite-sql-eval)
* [LLaMa-Efficient-Tuning](https://github.com/hiyouga/LLaMA-Efficient-Tuning)

Thanks for all contributors !
**20231104** ,especially thanks a lot for @[JBoRu](https://github.com/JBoRu) raised the [issue](https://github.com/eosphoros-ai/DB-GPT-Hub/issues/119) which remind us the add the updated evaluate way,like the paper 《SQL-PALM: IMPROVED LARGE LANGUAGE MODEL ADAPTATION FOR TEXT-TO-SQL》 mentioned, "We consider two commonly-used evaluation metrics: execution accuracy (EX) and test-suite accuracy (TS) [32]. EX measures whether SQL execution outcome matches ground truth (GT), whereas TS measures whether the SQL passes all EX evaluation for multiple tests, generated by database-augmentation. Since EX contains false positives, we consider TS as a more reliable evaluation metric" .


## 7、Licence

The MIT License (MIT)
Expand Down
10 changes: 7 additions & 3 deletions README.zh.md
Original file line number Diff line number Diff line change
Expand Up @@ -50,8 +50,8 @@

DB-GPT-Hub是一个利用LLMs实现Text-to-SQL解析的实验项目,主要包含数据集收集、数据预处理、模型选择与构建和微调权重等步骤,通过这一系列的处理可以在提高Text-to-SQL能力的同时降低模型训练成本,让更多的开发者参与到Text-to-SQL的准确度提升工作当中,最终实现基于数据库的自动问答能力,让用户可以通过自然语言描述完成复杂数据库的查询操作等工作。
目前我们已经基于多个大模型打通从数据处理、模型SFT训练、预测输出和评估的整个流程,**代码在本项目中均可以直接复用**
截止20231010,我们利用本项目基于开源的13B大小的模型微调后,在Spider的评估集上的执行准确率,**已经超越GPT-4!**
部分实验结果已汇总到了本项目的相关[文档](docs/eval_llm_result.md)利用本项目结合更多相关数据在Spider评估集上的执行准确率已经可以达到**0.825**.
截止20231010,我们利用本项目基于开源的13B大小的模型微调,结合更多相关数据,在零样本提示下,基于Spider的[test-suite](https://github.com/taoyds/test-suite-sql-eval)中的数据库(大小1.27G)执行准确率可以达到**0.764**,基于Spider[官方网站](https://yale-lily.github.io/spider)指向的数据库(大小95M)的执行准确率为0.825。
部分实验结果已汇总到了本项目的相关[文档](docs/eval_llm_result.md)可供参考。

## 二、Text-to-SQL微调

Expand Down Expand Up @@ -219,6 +219,8 @@ sh ./dbgpt_hub/scripts/export_merge.sh
python dbgpt_hub/eval/evaluation.py --plug_value --input Your_model_pred_file
```
你可以在[这里](docs/eval_llm_result.md)找到我们最新的评估和实验结果。
**注意**: 默认的代码中指向的数据库为从[Spider官方网站](https://yale-lily.github.io/spider)下载的大小为95M的database,如果你需要使用基于Spider的[test-suite](https://github.com/taoyds/test-suite-sql-eval)中的数据库(大小1.27G),请先下载链接中的数据库到自定义目录,并在上述评估命令中增加参数和值,形如`--db Your_download_db_path`

## 四、发展路线
整个过程我们会分为三个阶段:

Expand Down Expand Up @@ -248,7 +250,7 @@ python dbgpt_hub/eval/evaluation.py --plug_value --input Your_model_pred_file

## 五、贡献

欢迎更多小伙伴在数据集、模型微调、效果评测、论文推荐与复现等方面参与和反馈,如提issues或者pr反馈,我们会积极给出回应。提交代码前请先将代码按black格式化。
欢迎更多小伙伴在数据集、模型微调、效果评测、论文推荐与复现等方面参与和反馈,如提issues或者pr反馈,我们会积极给出回应。提交代码前请先将代码按black格式化,运行下`black .`

## 六、感谢

Expand All @@ -267,6 +269,8 @@ python dbgpt_hub/eval/evaluation.py --plug_value --input Your_model_pred_file
* [test-suite-sql-eval](https://github.com/taoyds/test-suite-sql-eval)
* [LLaMa-Efficient-Tuning](https://github.com/hiyouga/LLaMA-Efficient-Tuning)

非常感谢所有的contributors!
**20231104** ,尤其感谢 @[JBoRu](https://github.com/JBoRu) 提的[issue](https://github.com/eosphoros-ai/DB-GPT-Hub/issues/119), 指出我们的之前按照官方网站的95M的数据库去评估的方式的不足,如论文《SQL-PALM: IMPROVED LARGE LANGUAGE MODEL ADAPTATION FOR TEXT-TO-SQL》 指出的 "We consider two commonly-used evaluation metrics: execution accuracy (EX) and test-suite accuracy (TS) [32]. EX measures whether SQL execution outcome matches ground truth (GT), whereas TS measures whether the SQL passes all EX evaluation for multiple tests, generated by database-augmentation. Since EX contains false positives, we consider TS as a more reliable evaluation metric" 。

## 七、Licence

Expand Down
3 changes: 2 additions & 1 deletion docs/eval_llm_result.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,8 @@

This doc aims to summarize the performance of publicly available big language models when evaluated on the spider dev dataset. We hope it will provide a point of reference for folks using these big models for Text-to-SQL tasks. We'll keep sharing eval results from models we've tested and seen others use, and very welcome any contributions to make this more comprehensive.

## 1.LLMs Text-to-SQL capability evaluation
## LLMs Text-to-SQL capability evaluation before 20231104
the follow our experiment execution accuracy of Spider is base on the database which is download from the Spider official [website](https://yale-lily.github.io/spider) ,size only 95M.
| name | Execution Accuracy | reference |
| ------------------------------ | ------------------ | ---------------------------------------------------------------------------------- |
| **GPT-4** | **0.762** | [numbersstation-eval-res](https://www.numbersstation.ai/post/nsql-llama-2-7b) |
Expand Down

0 comments on commit 180220b

Please sign in to comment.