From 85c8bbac521966350edb71ed35571c81c0e78742 Mon Sep 17 00:00:00 2001
From: wangzaistone <zhaowanghappy@163.com>
Date: Sat, 4 Nov 2023 23:25:30 +0800
Subject: [PATCH] docs , update the evaluate method ,give  explanation about
 our metric

---
 README.md               | 15 ++++++++++-----
 README.zh.md            | 10 +++++++---
 docs/eval_llm_result.md |  3 ++-
 3 files changed, 19 insertions(+), 9 deletions(-)

diff --git a/README.md b/README.md
index 2939eac..33b2468 100644
--- a/README.md
+++ b/README.md
@@ -51,11 +51,11 @@
 
 DB-GPT-Hub is an experimental project utilizing LLMs (Large Language Models) to achieve Text-to-SQL parsing. The project primarily encompasses data collection, data preprocessing, model selection and building, and fine-tuning of weights. Through this series of processes, we aim to enhance Text-to-SQL capabilities while reducing the model training costs, allowing more developers to contribute to the improvement of Text-to-SQL accuracy. Our ultimate goal is to realize automated question-answering capabilities based on databases, enabling users to execute complex database queries through natural language descriptions.    
 
-So far, we have successfully integrated multiple large models and established a complete workflow, including data processing, model SFT (Supervised Fine-Tuning) training, prediction output, and evaluation. The code is readily reusable within this project.    
+So far, we have successfully integrated multiple large models and established a complete workflow, including data processing, model SFT (Supervised Fine-Tuning) training, prediction output, and evaluation. The code is readily reusable within this project.   
 
-As of October 10, 2023, by fine-tuning an open-source model of 13 billion parameters using this project, **the execution accuracy on the Spider evaluation dataset has surpassed that of GPT-4!**  
 
-Part of the experimental results have been compiled into the [document](docs/eval_llm_result.md) in this project. By utilizing this project and combining more related data, the execution accuracy on the Spider evaluation set has already reached **0.825**.
+As of 20231010, we used this project to fine-tune the open source 13B size model, combined with more relevant data, and under the zero-shot prompt, Spider-based [test-suite](https://github.com/taoyds/test-suite -sql-eval), the execution accuracy of the database (size-1.27G) can reach **0.764**, and the execution accuracy of the database (size-95M) pointed to by the Spider official [website](https://yale-lily.github.io/spider) is 0.825.
+
 
 ## 2. Fine-tuning Text-to-SQL
 
@@ -232,7 +232,8 @@ Run the following command:
 ```bash
 python dbgpt_hub/eval/evaluation.py --plug_value --input Your_model_pred_file
 ```
-You can find the results of our latest review and part of experiment results [here](docs/eval_llm_result.md)
+You can find the results of our latest review and part of experiment results [here](docs/eval_llm_result.md)  
+**Note**: The database pointed to by the default code is a 95M database downloaded from [Spider official website] (https://yale-lily.github.io/spider). If you need to use Spider database (size 1.27G) in [test-suite](https://github.com/taoyds/test-suite-sql-eval), please download the database in the link to the custom directory first, and run the above evaluation command which add parameters and values ​​like `--db Your_download_db_path`.
 
 ## 4. RoadMap 
 
@@ -263,7 +264,7 @@ The whole process we will divide into three phases:
 
 ## 5. Contributions
 
-We welcome more folks to participate and provide feedback in areas like datasets, model fine-tuning, performance evaluation, paper recommendations, code reproduction, etc. Feel free to open issues or PRs and we'll actively respond.Before submitting the code, please format it using the black style.
+We welcome more folks to participate and provide feedback in areas like datasets, model fine-tuning, performance evaluation, paper recommendations, code reproduction, etc. Feel free to open issues or PRs and we'll actively respond.Before submitting the code, please format it using the black style in command `black .`.
 
 ## 6. Acknowledgements
 
@@ -282,6 +283,10 @@ Our work is primarily based on the foundation of numerous open-source contributi
 * [test-suite-sql-eval](https://github.com/taoyds/test-suite-sql-eval)
 * [LLaMa-Efficient-Tuning](https://github.com/hiyouga/LLaMA-Efficient-Tuning) 
 
+Thanks for all contributors !
+ **20231104** ,especially thanks a lot  for @[JBoRu](https://github.com/JBoRu)  raised  the [issue](https://github.com/eosphoros-ai/DB-GPT-Hub/issues/119) which remind us the add the updated evaluate way,like the paper 《SQL-PALM: IMPROVED LARGE LANGUAGE MODEL ADAPTATION FOR TEXT-TO-SQL》 mentioned, "We consider two commonly-used evaluation metrics: execution accuracy (EX) and test-suite accuracy (TS) [32]. EX measures whether SQL execution outcome matches ground truth (GT), whereas TS measures whether the SQL passes all EX evaluation for multiple tests, generated by database-augmentation. Since EX contains false positives, we consider TS as a more reliable evaluation metric" .
+
+
 ## 7、Licence
 
 The MIT License (MIT)
diff --git a/README.zh.md b/README.zh.md
index 7cdf039..e18e096 100644
--- a/README.zh.md
+++ b/README.zh.md
@@ -50,8 +50,8 @@
 
 DB-GPT-Hub是一个利用LLMs实现Text-to-SQL解析的实验项目，主要包含数据集收集、数据预处理、模型选择与构建和微调权重等步骤，通过这一系列的处理可以在提高Text-to-SQL能力的同时降低模型训练成本，让更多的开发者参与到Text-to-SQL的准确度提升工作当中，最终实现基于数据库的自动问答能力，让用户可以通过自然语言描述完成复杂数据库的查询操作等工作。     
 目前我们已经基于多个大模型打通从数据处理、模型SFT训练、预测输出和评估的整个流程，**代码在本项目中均可以直接复用**。   
-截止20231010，我们利用本项目基于开源的13B大小的模型微调后，在Spider的评估集上的执行准确率，**已经超越GPT-4!**    
-部分实验结果已汇总到了本项目的相关[文档](docs/eval_llm_result.md) ，利用本项目结合更多相关数据在Spider评估集上的执行准确率已经可以达到**0.825**. 
+截止20231010，我们利用本项目基于开源的13B大小的模型微调，结合更多相关数据，在零样本提示下，基于Spider的[test-suite](https://github.com/taoyds/test-suite-sql-eval)中的数据库(大小1.27G)执行准确率可以达到**0.764**，基于Spider[官方网站](https://yale-lily.github.io/spider)指向的数据库(大小95M)的执行准确率为0.825。
+部分实验结果已汇总到了本项目的相关[文档](docs/eval_llm_result.md) ，可供参考。
 
 ## 二、Text-to-SQL微调
 
@@ -219,6 +219,8 @@ sh ./dbgpt_hub/scripts/export_merge.sh
 python dbgpt_hub/eval/evaluation.py --plug_value --input  Your_model_pred_file
 ```
 你可以在[这里](docs/eval_llm_result.md)找到我们最新的评估和实验结果。
+**注意**： 默认的代码中指向的数据库为从[Spider官方网站](https://yale-lily.github.io/spider)下载的大小为95M的database，如果你需要使用基于Spider的[test-suite](https://github.com/taoyds/test-suite-sql-eval)中的数据库(大小1.27G)，请先下载链接中的数据库到自定义目录，并在上述评估命令中增加参数和值，形如`--db Your_download_db_path`。
+
 ## 四、发展路线    
 整个过程我们会分为三个阶段：
 
@@ -248,7 +250,7 @@ python dbgpt_hub/eval/evaluation.py --plug_value --input  Your_model_pred_file
 
 ## 五、贡献
 
-欢迎更多小伙伴在数据集、模型微调、效果评测、论文推荐与复现等方面参与和反馈，如提issues或者pr反馈，我们会积极给出回应。提交代码前请先将代码按black格式化。
+欢迎更多小伙伴在数据集、模型微调、效果评测、论文推荐与复现等方面参与和反馈，如提issues或者pr反馈，我们会积极给出回应。提交代码前请先将代码按black格式化，运行下`black .`。
 
 ## 六、感谢
 
@@ -267,6 +269,8 @@ python dbgpt_hub/eval/evaluation.py --plug_value --input  Your_model_pred_file
 * [test-suite-sql-eval](https://github.com/taoyds/test-suite-sql-eval)
 * [LLaMa-Efficient-Tuning](https://github.com/hiyouga/LLaMA-Efficient-Tuning) 
 
+非常感谢所有的contributors! 
+ **20231104** ,尤其感谢 @[JBoRu](https://github.com/JBoRu) 提的[issue](https://github.com/eosphoros-ai/DB-GPT-Hub/issues/119)， 指出我们的之前按照官方网站的95M的数据库去评估的方式的不足，如论文《SQL-PALM: IMPROVED LARGE LANGUAGE MODEL ADAPTATION FOR TEXT-TO-SQL》 指出的 "We consider two commonly-used evaluation metrics: execution accuracy (EX) and test-suite accuracy (TS) [32]. EX measures whether SQL execution outcome matches ground truth (GT), whereas TS measures whether the SQL passes all EX evaluation for multiple tests, generated by database-augmentation. Since EX contains false positives, we consider TS as a more reliable evaluation metric" 。
 
 ## 七、Licence
 
diff --git a/docs/eval_llm_result.md b/docs/eval_llm_result.md
index afc1340..2fdb99c 100644
--- a/docs/eval_llm_result.md
+++ b/docs/eval_llm_result.md
@@ -2,7 +2,8 @@
 
 This doc aims to summarize the performance of publicly available big language models when evaluated on the spider dev dataset. We hope it will provide a point of reference for folks using these big models for Text-to-SQL tasks. We'll keep sharing eval results from models we've tested and seen others use, and very welcome any contributions to make this more comprehensive.
 
-## 1.LLMs Text-to-SQL capability evaluation
+## LLMs Text-to-SQL capability evaluation  before 20231104
+ the follow  our experiment execution accuracy of Spider is base on the database which  is download from the  Spider official [website](https://yale-lily.github.io/spider) ,size only 95M.
 | name                           | Execution Accuracy | reference                                                                          |
 | ------------------------------ | ------------------ | ---------------------------------------------------------------------------------- |
 | **GPT-4**                         | **0.762**              | [numbersstation-eval-res](https://www.numbersstation.ai/post/nsql-llama-2-7b)    |