[Refactor] Refactor the preprocess of dataset (#158)

* support system for template * fix alpaca map_fn * refactor map_fns * refactor internlm-7b cfgs * fix moss_sft * fix templates * fix cfgs * add system for evaluate_chat_hook * use template system * fix bugs * add task * fix bugs * rename * fix * update templates * update cfgs * Update dataset_format.md * Update dataset_format.md * Update dataset_format.md * Update dataset_format.md * Update dataset_format.md * Update dataset_format.md * Update dataset_format.md * Update dataset_format.md * Update single_turn_conversation.md * update * fix pre-commit * update * add toc * chat supports system * Update README.md * Update README_zh-CN.md * fix typo * remove chat docs * Update README_zh-CN.md * Update README.md * Update README.md * Update README_zh-CN.md * fix pre-commit * fix language * update help msg * add eos_token for qwen * fix * Update single_turn_conversation.md * Update single_turn_conversation.md
InternLM · Oct 12, 2023 · f7a5c9c · f7a5c9c
1 parent d118ac4
commit f7a5c9c
Show file tree

Hide file tree

Showing 292 changed files with 1,755 additions and 1,562 deletions.
diff --git a/README.md b/README.md
@@ -195,10 +195,18 @@ XTuner provides tools to chat with pretrained / fine-tuned LLMs.
 xtuner chat ${NAME_OR_PATH_TO_LLM} --adapter {NAME_OR_PATH_TO_ADAPTER} [optional arguments]
 ```
 
-For example, we can start the chat with Llama2-7b with adapter trained from MOSS-003-SFT by
+For example, we can start the chat with
+
+InternLM-7B with adapter trained from Alpaca-enzh:
+
+```shell
+xtuner chat internlm/internlm-7b --adapter xtuner/internlm-7b-qlora-alpaca-enzh --prompt-template internlm_chat --system-template alpaca
+```
+
+Llama2-7b with adapter trained from MOSS-003-SFT:
 
 ```shell
-xtuner chat meta-llama/Llama-2-7b-hf --adapter xtuner/Llama-2-7b-qlora-moss-003-sft --bot-name Llama2 --prompt-template moss_sft --with-plugins calculate solve search --command-stop-word "<eoc>" --answer-stop-word "<eom>" --no-streamer
+xtuner chat meta-llama/Llama-2-7b-hf --adapter xtuner/Llama-2-7b-qlora-moss-003-sft --bot-name Llama2 --prompt-template moss_sft --system-template moss_sft --with-plugins calculate solve search --command-stop-word "<eoc>" --answer-stop-word "<eom>" --no-streamer
 ```
 
 For more examples, please see [chat.md](./docs/en/user_guides/chat.md).

diff --git a/README_zh-CN.md b/README_zh-CN.md
@@ -194,10 +194,18 @@ XTuner 提供与大语言模型对话的工具。
 xtuner chat ${NAME_OR_PATH_TO_LLM} --adapter {NAME_OR_PATH_TO_ADAPTER} [optional arguments]
 ```
 
-例如，与 Llama2-7b + MOSS-003-SFT adapter 对话：
+例如：
+
+与 InternLM-7B + Alpaca-enzh adapter 对话：
+
+```shell
+xtuner chat internlm/internlm-7b --adapter xtuner/internlm-7b-qlora-alpaca-enzh --prompt-template internlm_chat --system-template alpaca
+```
+
+与 Llama2-7b + MOSS-003-SFT adapter 对话：
 
 ```shell
-xtuner chat meta-llama/Llama-2-7b-hf --adapter xtuner/Llama-2-7b-qlora-moss-003-sft --bot-name Llama2 --prompt-template moss_sft --with-plugins calculate solve search --command-stop-word "<eoc>" --answer-stop-word "<eom>" --no-streamer
+xtuner chat meta-llama/Llama-2-7b-hf --adapter xtuner/Llama-2-7b-qlora-moss-003-sft --bot-name Llama2 --prompt-template moss_sft --system-template moss_sft --with-plugins calculate solve search --command-stop-word "<eoc>" --answer-stop-word "<eom>" --no-streamer
 ```
 
 更多示例，请查阅[文档](./docs/zh_cn/user_guides/chat.md)。

diff --git a/docs/en/user_guides/chat.md b/docs/en/user_guides/chat.md
@@ -1,129 +1,3 @@
 # Chat with fine-tuned LLMs
 
-## Chat with [InternLM](https://github.com/InternLM/InternLM)
-
-- InternLM-7B, oasst1
-
-  ```shell
-  xtuner chat internlm/internlm-7b --adapter xtuner/internlm-7b-qlora-oasst1 --prompt-template openassistant
-  ```
-
-- InternLM-7B, Arxiv Gentitle
-
-  ```shell
-  xtuner chat internlm/internlm-7b --adapter xtuner/internlm-7b-qlora-arxiv-gentitle --prompt-template title
-  ```
-
-- InternLM-7B, Colorist
-
-  ```shell
-  xtuner chat internlm/internlm-7b --adapter xtuner/internlm-7b-qlora-colorist --prompt-template colorist
-  ```
-
-- InternLM-7B, Coder
-
-  ```shell
-  xtuner chat internlm/internlm-7b --adapter xtuner/internlm-7b-qlora-coder --prompt-template code
-  ```
-
-- InternLM-7B, SQL
-
-  ```shell
-  xtuner chat internlm/internlm-7b --adapter xtuner/internlm-7b-qlora-sql --prompt-template sql
-  ```
-
-- InternLM-7B, Lawyer
-
-  ```shell
-  xtuner chat internlm/internlm-7b --adapter xtuner/internlm-7b-qlora-lawyer --prompt-template lawyer
-  ```
-
-- InternLM-7B, Open-Platypus
-
-  ```shell
-  xtuner chat internlm/internlm-7b --adapter xtuner/internlm-7b-qlora-open-platypus --prompt-template alpaca
-  ```
-
-- InternLM-7B, Alpaca-enzh
-
-  ```shell
-  xtuner chat internlm/internlm-7b --adapter xtuner/internlm-7b-qlora-alpaca-enzh --prompt-template alpaca
-  ```
-
-## Chat with [Llama2](https://github.com/facebookresearch/llama)
-
-> Don't forget to use `huggingface-cli login` and input your access token first to access Llama2! See [here](https://huggingface.co/docs/hub/security-tokens#user-access-tokens) to learn how to obtain your access token.
-
-- Llama2-7B, MOSS-003-SFT **(plugins!)**
-
-  ```shell
-  export SERPER_API_KEY="xxx"  # Please get the key from https://serper.dev to support google search!
-  xtuner chat meta-llama/Llama-2-7b-hf --adapter xtuner/Llama-2-7b-qlora-moss-003-sft --bot-name Llama2 --prompt-template moss_sft --with-plugins calculate solve search --command-stop-word "<eoc>" --answer-stop-word "<eom>" --no-streamer
-  ```
-
-- Llama2-7B, Arxiv Gentitle
-
-  ```shell
-  xtuner chat meta-llama/Llama-2-7b-hf --adapter xtuner/Llama-2-7b-qlora-arxiv-gentitle --prompt-template title
-  ```
-
-- Llama2-7B, Colorist
-
-  ```shell
-  xtuner chat meta-llama/Llama-2-7b-hf --adapter xtuner/Llama-2-7b-qlora-colorist --prompt-template colorist
-  ```
-
-## Chat with [Qwen](https://github.com/QwenLM)
-
-- Qwen-7B, MOSS-003-SFT **(plugins!)**
-
-  ```shell
-  export SERPER_API_KEY="xxx"  # Please get the key from https://serper.dev to support google search!
-  xtuner chat Qwen/Qwen-7B --adapter xtuner/Qwen-7B-qlora-moss-003-sft --bot-name Qwen --prompt-template moss_sft --with-plugins calculate solve search --command-stop-word "<eoc>" --answer-stop-word "<eom>"
-  ```
-
-- Qwen-7B, oasst1
-
-  ```shell
-  xtuner chat Qwen/Qwen-7B --adapter xtuner/Qwen-7B-qlora-oasst1 --prompt-template openassistant --answer-stop-word '<|endoftext|>'
-  ```
-
-- Qwen-7B, Arxiv Gentitle
-
-  ```shell
-  xtuner chat Qwen/Qwen-7B --adapter xtuner/Qwen-7B-qlora-arxiv-gentitle --prompt-template title --answer-stop-word '<|endoftext|>'
-  ```
-
-- Qwen-7B, Alpaca-enzh
-
-  ```shell
-  xtuner chat Qwen/Qwen-7B --adapter xtuner/Qwen-7B-qlora-alpaca-enzh --prompt-template alpaca --answer-stop-word '<|endoftext|>'
-  ```
-
-## Chat with [Baichuan](https://github.com/baichuan-inc)
-
-- Baichuan-7B, oasst1
-
-  ```shell
-  xtuner chat baichuan-inc/Baichuan-7B --adapter xtuner/Baichuan-7B-qlora-oasst1 --prompt-template openassistant
-  ```
-
-- Baichuan-7B, Arxiv Gentitle
-
-  ```shell
-  xtuner chat baichuan-inc/Baichuan-7B --adapter xtuner/Baichuan-7B-qlora-arxiv-gentitle --prompt-template title --no-streamer
-  ```
-
-- Baichuan-7B, Alpaca-enzh
-
-  ```shell
-  xtuner chat baichuan-inc/Baichuan-7B --adapter xtuner/Baichuan-7B-qlora-alpaca-enzh --prompt-template alpaca
-  ```
-
-  ## Chat with [CodeLlama](https://github.com/facebookresearch/codellama)
-
-- CodeLlama-7B, Instruct
-
-  ```shell
-  xtuner chat codellama/CodeLlama-7b-Instruct-hf --prompt-template code_llama_chat
-  ```
+Coming soon.
diff --git a/docs/en/user_guides/dataset_format.md b/docs/en/user_guides/dataset_format.md
@@ -1,18 +1,26 @@
 # Dataset Format
 
+- [Incremental Pre-training Dataset Format](#incremental-pre-training-dataset-format)
+- [Single-turn Dialogue Dataset Format](#single-turn-dialogue-dataset-format)
+- [Multi-turn Dialogue Dataset Format](#multi-turn-dialogue-dataset-format)
+  - [Method 1](#method-1)
+  - [Method 2](#method-2)
+  - [Method in XTuner](#method-in-xtuner)
+
 The Supervised Finetune (SFT) of large language models aims to improve the performance of pre-trained models on specific tasks through supervised fine-tuning. To support as many downstream tasks as possible, XTuner supports three dataset formats: incremental pre-training, single-turn dialogue, and multi-turn dialogue.
 
 - The incremental pre-training dataset is used to enhance the model's capabilities in a specific domain or task.
 - Single-turn and multi-turn dialogue datasets are often used in the instruction tuning stage to enhance the model's ability to respond to specific instructions.
 
-In the instruction tuning phase, our goal is to train the language model to answer based on human instructions. **Therefore, generally only the loss of the response part (Output) is used for gradient backpropagation, while the loss of the instruction part (Input) is not used for weight updates.** Based on this, we introduce "input" and "output" fields when preprocessing the dataset. The "input" field is used to save fields that do not need to compute loss, such as user instructions, whereas the "output" field is used to save fields that do need to compute loss, such as the GroundTruth answers corresponding to input instructions.
+In the instruction tuning phase, our goal is to train the language model to answer based on human instructions. **Therefore, generally only the loss of the response part (Output) is used for gradient backpropagation, while the loss of the instruction part (System, Input) is not used for weight updates.** Based on this, we introduce "system", "input" and "output" fields when preprocessing the dataset. The "system", "input" fields are used to save fields that do not need to compute loss, such as system and user instructions, whereas the "output" field is used to save fields that do need to compute loss, such as the GroundTruth answers corresponding to input instructions.
 
 To unify the incremental pre-training, single-turn dialogue, and multi-turn dialogue dataset formats, we set the dataset format to the following form:
 
 ```json
 [{
     "conversation":[
         {
+            "system": "xxx",
             "input": "xxx",
             "output": "xxx"
         }
@@ -21,6 +29,7 @@ To unify the incremental pre-training, single-turn dialogue, and multi-turn dial
 {
     "conversation":[
         {
+            "system": "xxx",
             "input": "xxx",
             "output": "xxx"
         },
@@ -32,22 +41,23 @@ To unify the incremental pre-training, single-turn dialogue, and multi-turn dial
 }]
 ```
 
-Throughout the training phase, we amalgamate several "input" and "output" pairs from a single data instance, which we then feed into the model. Loss is computed concurrently at each position, yet only the loss associated with the "output" component participates in the gradient backpropagation process. This process is elucidated in the figure below.
+Throughout the training phase, we amalgamate several "system", "input" and "output" pairs from a single data instance, which we then feed into the model. Loss is computed concurrently at each position, yet only the loss associated with the "output" component participates in the gradient backpropagation process. This process is elucidated in the figure below.
 
 <div  align="center">
-<img src="https://github.com/open-mmlab/mmrazor/assets/41630003/d5d696de-c026-494c-8b95-b1ba4b492939" alt="Image" width="700" />
+<img src="https://github.com/LZHgrla/xtuner/assets/36994684/5ac1ef47-e7e3-43c3-b6b5-5df1aceef970" alt="Image" width="700" />
 </div>
 
 Note that the <EOS> token and <BOS> token are used to indicate the start and end of a sentence or text.
 
 ## Incremental Pre-training Dataset Format
 
-As incremental pre-training is intended to help the model learn language knowledge and expressive abilities tailored for specific downstream tasks, the loss corresponding to the entire content of the dataset should be used for gradient backpropagation. Therefore, the "input" of the dataset is left empty, while the "output" consists of an entire piece of corpus data. The dataset format corresponding to the incremental pre-training task is shown as follows:
+As incremental pre-training is intended to help the model learn language knowledge and expressive abilities tailored for specific downstream tasks, the loss corresponding to the entire content of the dataset should be used for gradient backpropagation. Therefore, the "system" and "input" of the dataset are left empty, while the "output" consists of an entire piece of corpus data. The dataset format corresponding to the incremental pre-training task is shown as follows:
 
 ```json
 [{
     "conversation":[
         {
+            "system": "",
             "input": "",
             "output": "I am an artificial intelligence (AI) assistant named Puyu. I was created by the Shanghai AI Laboratory and my purpose is to assist users with various tasks through natural language processing technology."
         }
@@ -56,6 +66,7 @@ As incremental pre-training is intended to help the model learn language knowled
 {
     "conversation":[
         {
+            "system": "",
             "input": "",
             "output": "I am an artificial intelligence programmed to assist with various types of tasks, including answering questions, providing information, and performing automated processes."
         }
@@ -69,38 +80,39 @@ As incremental pre-training is intended to help the model learn language knowled
 
 ## Single-turn Dialogue Dataset Format
 
-The single-turn dialogue dataset typically consists of a single instruction (or question) and its corresponding GroundTruth answer. Since only the answer part should be used for gradient backpropagation, the "input" field of the dataset is the input instruction, and the "output" field is the corresponding answer. The format of the single-turn dialogue dataset is shown as follows:
+The single-turn dialogue dataset typically consists of a single instruction (or question) and its corresponding GroundTruth answer. Since only the answer part should be used for gradient backpropagation, the "system" and "input" fields of the dataset are the input instruction, and the "output" field is the corresponding answer. The format of the single-turn dialogue dataset is shown as follows:
 
 ```json
 [{
-    "conversation":
-        [
-            {
-                "input": "Give three tips for staying healthy.",
-                "output": "1.Eat a balanced diet. 2. Exercise regularly. 3. Get enough sleep."
-            }
-        ]
+    "conversation":[
+        {
+            "system": "You are an AI asssistant."
+            "input": "Give three tips for staying healthy.",
+            "output": "1.Eat a balanced diet. 2. Exercise regularly. 3. Get enough sleep."
+        }
+    ]
 },
 {
-    "conversation":
-        [
-            {
-                "input": "How to study English?",
-                "output": "1. Set clear goals. 2. Create a study plan. 3. Build vocabulary. 4. Practice speaking."
-            }
-        ]
+    "conversation":[
+        {
+            "system": "You are an AI asssistant."
+            "input": "How to study English?",
+            "output": "1. Set clear goals. 2. Create a study plan. 3. Build vocabulary. 4. Practice speaking."
+        }
+    ]
 }]
 ```
 
 <div  align="center">
-<img src="https://github.com/open-mmlab/mmrazor/assets/41630003/91499b4e-faa2-4e7c-92ee-2fe614a8243f" alt="Image" width="700" />
+<img src="https://github.com/LZHgrla/xtuner/assets/36994684/6eed31aa-70e4-47c7-bfdb-20fa7a1312ea" alt="Image" width="700" />
 </div>
 
 ## Multi-turn Dialogue Dataset Format
 
-The multi-turn dialogue dataset typically consists of multiple rounds of instructions (or questions) and their corresponding GroundTruth answers. Suppose we have a piece of multi-turn dialogue data. For ease of introduction, for the nth round of dialogue, we set the output corresponding to User and Assistant as Usern and Assistantn.
+The multi-turn dialogue dataset typically consists of multiple rounds of instructions (or questions) and their corresponding GroundTruth answers. Suppose we have a piece of multi-turn dialogue data. For ease of introduction, for the nth round of dialogue, we set the output corresponding to User and Assistant as UserN and AssistantN.
 
 ```text
+System: You are an AI asssistant.
 User1：Hello?
 Assistant1：Hello! How can I help you?
 User2：What's the date today?
@@ -113,10 +125,10 @@ How can we use the above multi-turn dialogue data to train large models? Current
 
 ### Method 1
 
-The text of User1, Assistant1, User2, Assistant2, and User3 is all considered as the input part of the model, while the text of Assistant3 is viewed as the prediction part of the model. Only the loss from the Assistant3 part is involved in the weight update.
+The text of System, User1, Assistant1, User2, Assistant2, and User3 is all considered as the input part of the model, while the text of Assistant3 is viewed as the prediction part of the model. Only the loss from the Assistant3 part is involved in the weight update.
 
 <div  align="center">
-<img src="https://github.com/open-mmlab/mmrazor/assets/41630003/ff4a44c4-43d7-45a7-8749-19b545f90207" alt="Image" width=1100" />
+<img src="https://github.com/LZHgrla/xtuner/assets/36994684/ce869cd5-c1ca-4bc8-9bc3-14f63abb7a5f" alt="Image" width=1100" />
 </div>
 
 The downside of this method is that it does not fully utilize the multi-turn dialogue training data because the content of Assistant1 and Assistant2 does not participate in model training, leading to a low utilization rate of training data.
@@ -126,7 +138,7 @@ The downside of this method is that it does not fully utilize the multi-turn dia
 Split a piece of multi-turn dialogue data into multiple pieces of data. For example, the above instance can be split into the following three pieces of data.
 
 <div  align="center">
-<img src="https://github.com/open-mmlab/mmrazor/assets/41630003/c0efbf9b-94bc-46ce-b500-e062c2cb59f7" alt="Image" width=1100" />
+<img src="https://github.com/LZHgrla/xtuner/assets/36994684/9fd714fc-20bd-4d4c-a4cf-3f95712f1db8" alt="Image" width=1100" />
 </div>
 
 Compared to Method 1, Method 2 can fully utilize the data from each round of dialogue, but it requires splitting one piece of data containing n rounds of dialogue into n pieces of data, which reduces the training efficiency by 1/n.
@@ -136,7 +148,7 @@ Compared to Method 1, Method 2 can fully utilize the data from each round of dia
 When XTuner trains multi-turn dialogue models, it adopts a more comprehensive and efficient method, as shown in the figure below.
 
 <div align="center">
-<img src="https://github.com/open-mmlab/mmrazor/assets/41630003/caaac51f-e982-46db-8f68-6ce28f343183" alt="Image" width=1100" />
+<img src="https://github.com/LZHgrla/xtuner/assets/36994684/ec67b610-a3b2-4fa7-91ad-a9a235fdb820" alt="Image" width=1100" />
 </div>
 
 We concatenate multi-turn dialogues, then input them into the model. The loss at each position is computed in parallel, but only the loss from the Output part participates in backpropagation. Therefore, the format of the multi-turn dialogue dataset in XTuner is shown as follows:
@@ -145,6 +157,7 @@ We concatenate multi-turn dialogues, then input them into the model. The loss at
 [{
     "conversation":[
         {
+            "system": "You are an AI asssistant."
             "input": "Hello?",
             "output": "Hello! How can I help you?"
         },
@@ -161,6 +174,7 @@ We concatenate multi-turn dialogues, then input them into the model. The loss at
 {
     "conversation":[
         {
+            "system": "You are an AI asssistant."
             "input": "Hello?",
             "output": "Hello! How can I help you?"
         },

diff --git a/docs/en/user_guides/dataset_prepare.md b/docs/en/user_guides/dataset_prepare.md
@@ -1,5 +1,11 @@
 # Dataset Prepare
 
+- [HuggingFace datasets](#huggingface-datasets)
+- [Others](#others)
+  - [Arxiv Gentitle](#arxiv-gentitle)
+  - [MOSS-003-SFT](#moss-003-sft)
+  - [Chinese Lawyer](#chinese-lawyer)
+
 ## HuggingFace datasets
 
 For datasets on HuggingFace Hub, such as [alpaca](https://huggingface.co/datasets/tatsu-lab/alpaca), you can quickly utilize them. For more details, please refer to [single_turn_conversation.md](./single_turn_conversation.md) and [multi_turn_conversation.md](./multi_turn_conversation.md).