From 85c8bbac521966350edb71ed35571c81c0e78742 Mon Sep 17 00:00:00 2001 From: wangzaistone Date: Sat, 4 Nov 2023 23:25:30 +0800 Subject: [PATCH] docs , update the evaluate method ,give explanation about our metric --- README.md | 15 ++++++++++----- README.zh.md | 10 +++++++--- docs/eval_llm_result.md | 3 ++- 3 files changed, 19 insertions(+), 9 deletions(-) diff --git a/README.md b/README.md index 2939eac..33b2468 100644 --- a/README.md +++ b/README.md @@ -51,11 +51,11 @@ DB-GPT-Hub is an experimental project utilizing LLMs (Large Language Models) to achieve Text-to-SQL parsing. The project primarily encompasses data collection, data preprocessing, model selection and building, and fine-tuning of weights. Through this series of processes, we aim to enhance Text-to-SQL capabilities while reducing the model training costs, allowing more developers to contribute to the improvement of Text-to-SQL accuracy. Our ultimate goal is to realize automated question-answering capabilities based on databases, enabling users to execute complex database queries through natural language descriptions. -So far, we have successfully integrated multiple large models and established a complete workflow, including data processing, model SFT (Supervised Fine-Tuning) training, prediction output, and evaluation. The code is readily reusable within this project. +So far, we have successfully integrated multiple large models and established a complete workflow, including data processing, model SFT (Supervised Fine-Tuning) training, prediction output, and evaluation. The code is readily reusable within this project. -As of October 10, 2023, by fine-tuning an open-source model of 13 billion parameters using this project, **the execution accuracy on the Spider evaluation dataset has surpassed that of GPT-4!** -Part of the experimental results have been compiled into the [document](docs/eval_llm_result.md) in this project. By utilizing this project and combining more related data, the execution accuracy on the Spider evaluation set has already reached **0.825**. +As of 20231010, we used this project to fine-tune the open source 13B size model, combined with more relevant data, and under the zero-shot prompt, Spider-based [test-suite](https://github.com/taoyds/test-suite -sql-eval), the execution accuracy of the database (size-1.27G) can reach **0.764**, and the execution accuracy of the database (size-95M) pointed to by the Spider official [website](https://yale-lily.github.io/spider) is 0.825. + ## 2. Fine-tuning Text-to-SQL @@ -232,7 +232,8 @@ Run the following command: ```bash python dbgpt_hub/eval/evaluation.py --plug_value --input Your_model_pred_file ``` -You can find the results of our latest review and part of experiment results [here](docs/eval_llm_result.md) +You can find the results of our latest review and part of experiment results [here](docs/eval_llm_result.md) +**Note**: The database pointed to by the default code is a 95M database downloaded from [Spider official website] (https://yale-lily.github.io/spider). If you need to use Spider database (size 1.27G) in [test-suite](https://github.com/taoyds/test-suite-sql-eval), please download the database in the link to the custom directory first, and run the above evaluation command which add parameters and values ​​like `--db Your_download_db_path`. ## 4. RoadMap @@ -263,7 +264,7 @@ The whole process we will divide into three phases: ## 5. Contributions -We welcome more folks to participate and provide feedback in areas like datasets, model fine-tuning, performance evaluation, paper recommendations, code reproduction, etc. Feel free to open issues or PRs and we'll actively respond.Before submitting the code, please format it using the black style. +We welcome more folks to participate and provide feedback in areas like datasets, model fine-tuning, performance evaluation, paper recommendations, code reproduction, etc. Feel free to open issues or PRs and we'll actively respond.Before submitting the code, please format it using the black style in command `black .`. ## 6. Acknowledgements @@ -282,6 +283,10 @@ Our work is primarily based on the foundation of numerous open-source contributi * [test-suite-sql-eval](https://github.com/taoyds/test-suite-sql-eval) * [LLaMa-Efficient-Tuning](https://github.com/hiyouga/LLaMA-Efficient-Tuning) +Thanks for all contributors ! + **20231104** ,especially thanks a lot for @[JBoRu](https://github.com/JBoRu) raised the [issue](https://github.com/eosphoros-ai/DB-GPT-Hub/issues/119) which remind us the add the updated evaluate way,like the paper 《SQL-PALM: IMPROVED LARGE LANGUAGE MODEL ADAPTATION FOR TEXT-TO-SQL》 mentioned, "We consider two commonly-used evaluation metrics: execution accuracy (EX) and test-suite accuracy (TS) [32]. EX measures whether SQL execution outcome matches ground truth (GT), whereas TS measures whether the SQL passes all EX evaluation for multiple tests, generated by database-augmentation. Since EX contains false positives, we consider TS as a more reliable evaluation metric" . + + ## 7、Licence The MIT License (MIT) diff --git a/README.zh.md b/README.zh.md index 7cdf039..e18e096 100644 --- a/README.zh.md +++ b/README.zh.md @@ -50,8 +50,8 @@ DB-GPT-Hub是一个利用LLMs实现Text-to-SQL解析的实验项目,主要包含数据集收集、数据预处理、模型选择与构建和微调权重等步骤,通过这一系列的处理可以在提高Text-to-SQL能力的同时降低模型训练成本,让更多的开发者参与到Text-to-SQL的准确度提升工作当中,最终实现基于数据库的自动问答能力,让用户可以通过自然语言描述完成复杂数据库的查询操作等工作。 目前我们已经基于多个大模型打通从数据处理、模型SFT训练、预测输出和评估的整个流程,**代码在本项目中均可以直接复用**。 -截止20231010,我们利用本项目基于开源的13B大小的模型微调后,在Spider的评估集上的执行准确率,**已经超越GPT-4!** -部分实验结果已汇总到了本项目的相关[文档](docs/eval_llm_result.md) ,利用本项目结合更多相关数据在Spider评估集上的执行准确率已经可以达到**0.825**. +截止20231010,我们利用本项目基于开源的13B大小的模型微调,结合更多相关数据,在零样本提示下,基于Spider的[test-suite](https://github.com/taoyds/test-suite-sql-eval)中的数据库(大小1.27G)执行准确率可以达到**0.764**,基于Spider[官方网站](https://yale-lily.github.io/spider)指向的数据库(大小95M)的执行准确率为0.825。 +部分实验结果已汇总到了本项目的相关[文档](docs/eval_llm_result.md) ,可供参考。 ## 二、Text-to-SQL微调 @@ -219,6 +219,8 @@ sh ./dbgpt_hub/scripts/export_merge.sh python dbgpt_hub/eval/evaluation.py --plug_value --input Your_model_pred_file ``` 你可以在[这里](docs/eval_llm_result.md)找到我们最新的评估和实验结果。 +**注意**: 默认的代码中指向的数据库为从[Spider官方网站](https://yale-lily.github.io/spider)下载的大小为95M的database,如果你需要使用基于Spider的[test-suite](https://github.com/taoyds/test-suite-sql-eval)中的数据库(大小1.27G),请先下载链接中的数据库到自定义目录,并在上述评估命令中增加参数和值,形如`--db Your_download_db_path`。 + ## 四、发展路线 整个过程我们会分为三个阶段: @@ -248,7 +250,7 @@ python dbgpt_hub/eval/evaluation.py --plug_value --input Your_model_pred_file ## 五、贡献 -欢迎更多小伙伴在数据集、模型微调、效果评测、论文推荐与复现等方面参与和反馈,如提issues或者pr反馈,我们会积极给出回应。提交代码前请先将代码按black格式化。 +欢迎更多小伙伴在数据集、模型微调、效果评测、论文推荐与复现等方面参与和反馈,如提issues或者pr反馈,我们会积极给出回应。提交代码前请先将代码按black格式化,运行下`black .`。 ## 六、感谢 @@ -267,6 +269,8 @@ python dbgpt_hub/eval/evaluation.py --plug_value --input Your_model_pred_file * [test-suite-sql-eval](https://github.com/taoyds/test-suite-sql-eval) * [LLaMa-Efficient-Tuning](https://github.com/hiyouga/LLaMA-Efficient-Tuning) +非常感谢所有的contributors! + **20231104** ,尤其感谢 @[JBoRu](https://github.com/JBoRu) 提的[issue](https://github.com/eosphoros-ai/DB-GPT-Hub/issues/119), 指出我们的之前按照官方网站的95M的数据库去评估的方式的不足,如论文《SQL-PALM: IMPROVED LARGE LANGUAGE MODEL ADAPTATION FOR TEXT-TO-SQL》 指出的 "We consider two commonly-used evaluation metrics: execution accuracy (EX) and test-suite accuracy (TS) [32]. EX measures whether SQL execution outcome matches ground truth (GT), whereas TS measures whether the SQL passes all EX evaluation for multiple tests, generated by database-augmentation. Since EX contains false positives, we consider TS as a more reliable evaluation metric" 。 ## 七、Licence diff --git a/docs/eval_llm_result.md b/docs/eval_llm_result.md index afc1340..2fdb99c 100644 --- a/docs/eval_llm_result.md +++ b/docs/eval_llm_result.md @@ -2,7 +2,8 @@ This doc aims to summarize the performance of publicly available big language models when evaluated on the spider dev dataset. We hope it will provide a point of reference for folks using these big models for Text-to-SQL tasks. We'll keep sharing eval results from models we've tested and seen others use, and very welcome any contributions to make this more comprehensive. -## 1.LLMs Text-to-SQL capability evaluation +## LLMs Text-to-SQL capability evaluation before 20231104 + the follow our experiment execution accuracy of Spider is base on the database which is download from the Spider official [website](https://yale-lily.github.io/spider) ,size only 95M. | name | Execution Accuracy | reference | | ------------------------------ | ------------------ | ---------------------------------------------------------------------------------- | | **GPT-4** | **0.762** | [numbersstation-eval-res](https://www.numbersstation.ai/post/nsql-llama-2-7b) |