diff --git a/README.md b/README.md index c651bb0..70206f6 100644 --- a/README.md +++ b/README.md @@ -4,27 +4,21 @@ ## 1. What is DB-GPT-Hub -Text-to-SQL is a very important research direction in the direction of database development, with a very strong business requirement. Currently, various teams and companies are actively involved in the research work in this direction. However, in the actual progress, some of the main problems encountered are as follows: - -* They have a certain amount of SQL data corpus in their hands but are weak in the logic of the underlying algorithm of LLM. -* Algorithm students who are familiar with the underlying algorithm and framework of LLM have some depth of algorithm, but lack of understanding and depth of database, and lack of SQL corpus, Text2SQL fine-tuning understanding is not enough. -* There are relatively few integrated teams with both database and algorithm theory, leading to a high barrier to entry in this field. - -DB-GPT-Hub is an experimental project to implement Text-to-SQL parsing using LLMs, which mainly includes steps of dataset collection, data pre-processing, model selection and construction, and fine-tuning weights, etc. Through this series of processing, we can reduce the model training cost while improving Text-to-SQL capability, and allow more students to participate in Text-to This series of processing can improve the Text-to-SQL capability while reducing the training cost of the model, allowing more students to participate in the work of improving the accuracy of Text-to-SQL, and finally realizing the automatic database based question and answer capability, allowing users to complete complex database query operations through natural language descriptions, etc. +DB-GPT-Hub is an experimental project to implement Text-to-SQL parsing using LLMs, which mainly includes dataset collection, data pre-processing, model selection and construction, and fine-tuning weights, etc. Through this series of processing, we can reduce the model training cost while improving Text-to-SQL capability, and allow more developers to participate in the work of improving the accuracy of Text-to-SQL, and finally realizing the automatic database based question and answer capability, allowing users to complete complex database query operations through natural language descriptions. ## 2. Fine-tuning Text-to-SQL -Large Language Models (LLMs) have achieved impressive results in existing benchmark tests of Text-to-SQL. However, these models remain challenging in the face of large databases and noisy content, and the mysteries behind the huge database values need external knowledge and reasoning to be revealed. We enhance Text-to-SQL based on a large language model sustained SFT +Large Language Models (LLMs) have achieved impressive results in existing benchmark tests of Text-to-SQL. However, these models remain challenging in the face of large databases and noisy content, and the mysteries behind the huge database values need external knowledge and reasoning to be revealed. We enhance Text-to-SQL based on a large language models sustained SFT ### 2.1. Dataset -The following publicly available text2sql datasets were used for this project: +The following publicly available text-to-sql datasets are used for this project: - [WikiSQL:](https://github.com/salesforce/WikiSQL) A large semantic parsing dataset consisting of 80,654 natural statement expressions and sql annotations of 24,241 tables. Each query in WikiSQL is limited to the same table and does not contain complex operations such as sorting, grouping The queries in WikiSQL are limited to the same table and do not include complex operations such as sorting, grouping, subqueries, etc. - [SPIDER](https://yale-lily.github.io/spider): A complex text2sql dataset across domains, containing 10,181 natural language queries, 5,693 SQL distributed across 200 separate databases, covering 138 different domains. - [CHASE](https://xjtu-intsoft.github.io/chase/): A cross-domain multi-round interactive text2sql Chinese dataset containing a list of 5,459 multi-round questions consisting of 17,940 binary groups across 280 different domain databases. -- [BIRD-SQL:](https://bird-bench.github.io/) dataset is a large-scale cross-domain text-to-SQL benchmark in English, with a particular focus on large database content. The dataset contains 12,751 text-to-SQL data pairs and 95 databases with a total size of 33.4 GB across 37 occupational domains. The BIRD-SQL dataset bridges the gap between text-to-SQL research and real-world applications by exploring three additional challenges, namely dealing with large and messy database values, external knowledge inference and optimising SQL execution efficiency. -- [CoSQL:](https://yale-lily.github.io/cosql) is a corpus for building cross-domain conversational text-to-SQL systems. It is a conversational version of the Spider and SParC tasks. CoSQL consists of 30k+ rounds and 10k+ annotated SQL queries from Wizard-of-Oz's collection of 3k conversations querying 200 complex databases across 138 domains. Each conversation simulates a realistic DB query scenario in which a staff member explores the database as a user and a SQL expert uses SQL to retrieve answers, clarify ambiguous questions, or otherwise inform. +- [BIRD-SQL:](https://bird-bench.github.io/) A large-scale cross-domain text-to-SQL benchmark in English, with a particular focus on large database content. The dataset contains 12,751 text-to-SQL data pairs and 95 databases with a total size of 33.4 GB across 37 occupational domains. The BIRD-SQL dataset bridges the gap between text-to-SQL research and real-world applications by exploring three additional challenges, namely dealing with large and messy database values, external knowledge inference and optimising SQL execution efficiency. +- [CoSQL:](https://yale-lily.github.io/cosql) A corpus for building cross-domain conversational text-to-SQL systems. It is a conversational version of the Spider and SParC tasks. CoSQL consists of 30k+ rounds and 10k+ annotated SQL queries from Wizard-of-Oz's collection of 3k conversations querying 200 complex databases across 138 domains. Each conversation simulates a realistic DB query scenario in which a staff member explores the database as a user and a SQL expert uses SQL to retrieve answers, clarify ambiguous questions, or otherwise inform. ### 2.2. Model @@ -142,7 +136,7 @@ SQL_PROMPT_DICT = { ### 3.3. Model fine-tuning -Model fine-tuning uses the qlora method, where we can run the following command to fine-tune the model: +Model fine-tuning uses the QLoRA method, where we can run the following command to fine-tune the model: ```bash python src/train/train_qlora.py --model_name_or_path @@ -158,7 +152,7 @@ Run the following command to generate the final merged model: python src/utils/merge_peft_adapters.py --base_model_name_or_path ``` -## 4. The development path +## 4. RoadMap The whole process we will divide into three phases: diff --git a/README.zh.md b/README.zh.md index db189a9..fd71292 100644 --- a/README.zh.md +++ b/README.zh.md @@ -4,13 +4,7 @@ ## 一、什么是DB-GPT-Hub -当前Text-to-SQL是大语言模型在围绕数据库发展方向上一个非常重要的研究方向, 有非常强大的业务诉求。当前在各个团队、公司也在积极参与到此方向的研究工作中。但是在实际的进展中,主要遇到以下一些问题: - -* 对SQL更理解的是DBA、BI、业务运营等同学, 他们手里有一定的SQL数据语料但是在LLM底层算法逻辑上比较薄弱, 在困难突破与攻克方面, 需要不断去理解底层算法原理与微调手段来提升模型效果。 -* 对LLM底层算法与框架熟悉的算法同学,具备一定的算法深度,但是缺少数据库这个领域的理解与深度,同时也缺少SQL语料,Text2SQL的微调理解不够。 -* 同时具备数据库和算法理论的综合团队相对较少,导致该领域存在较高的进入门槛。 - -DB-GPT-Hub是一个利用LLMs实现Text-to-SQL解析的实验项目,主要包含数据集收集、数据预处理、模型选择与构建和微调权重等步骤,通过这一系列的处理可以在提高Text-to-SQL能力的同时降低模型训练成本,让更多的同学参与到Text-to-SQL的准确度提升工作当中,最终实现基于数据库的自动问答能力,让用户可以通过自然语言描述完成复杂数据库的查询操作等工作。 +DB-GPT-Hub是一个利用LLMs实现Text-to-SQL解析的实验项目,主要包含数据集收集、数据预处理、模型选择与构建和微调权重等步骤,通过这一系列的处理可以在提高Text-to-SQL能力的同时降低模型训练成本,让更多的开发者参与到Text-to-SQL的准确度提升工作当中,最终实现基于数据库的自动问答能力,让用户可以通过自然语言描述完成复杂数据库的查询操作等工作。 ## 二、Text-to-SQL微调