Skip to content

Commit

Permalink
[Refactor] Refactor the preprocess of dataset (#158)
Browse files Browse the repository at this point in the history
* support system for template

* fix alpaca map_fn

* refactor map_fns

* refactor internlm-7b cfgs

* fix moss_sft

* fix templates

* fix cfgs

* add system for evaluate_chat_hook

* use template system

* fix bugs

* add task

* fix bugs

* rename

* fix

* update templates

* update cfgs

* Update dataset_format.md

* Update dataset_format.md

* Update dataset_format.md

* Update dataset_format.md

* Update dataset_format.md

* Update dataset_format.md

* Update dataset_format.md

* Update dataset_format.md

* Update single_turn_conversation.md

* update

* fix pre-commit

* update

* add toc

* chat supports system

* Update README.md

* Update README_zh-CN.md

* fix typo

* remove chat docs

* Update README_zh-CN.md

* Update README.md

* Update README.md

* Update README_zh-CN.md

* fix pre-commit

* fix language

* update help msg

* add eos_token for qwen

* fix

* Update single_turn_conversation.md

* Update single_turn_conversation.md
  • Loading branch information
LZHgrla authored Oct 12, 2023
1 parent d118ac4 commit f7a5c9c
Show file tree
Hide file tree
Showing 292 changed files with 1,755 additions and 1,562 deletions.
12 changes: 10 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -195,10 +195,18 @@ XTuner provides tools to chat with pretrained / fine-tuned LLMs.
xtuner chat ${NAME_OR_PATH_TO_LLM} --adapter {NAME_OR_PATH_TO_ADAPTER} [optional arguments]
```

For example, we can start the chat with Llama2-7b with adapter trained from MOSS-003-SFT by
For example, we can start the chat with

InternLM-7B with adapter trained from Alpaca-enzh:

```shell
xtuner chat internlm/internlm-7b --adapter xtuner/internlm-7b-qlora-alpaca-enzh --prompt-template internlm_chat --system-template alpaca
```

Llama2-7b with adapter trained from MOSS-003-SFT:

```shell
xtuner chat meta-llama/Llama-2-7b-hf --adapter xtuner/Llama-2-7b-qlora-moss-003-sft --bot-name Llama2 --prompt-template moss_sft --with-plugins calculate solve search --command-stop-word "<eoc>" --answer-stop-word "<eom>" --no-streamer
xtuner chat meta-llama/Llama-2-7b-hf --adapter xtuner/Llama-2-7b-qlora-moss-003-sft --bot-name Llama2 --prompt-template moss_sft --system-template moss_sft --with-plugins calculate solve search --command-stop-word "<eoc>" --answer-stop-word "<eom>" --no-streamer
```

For more examples, please see [chat.md](./docs/en/user_guides/chat.md).
Expand Down
12 changes: 10 additions & 2 deletions README_zh-CN.md
Original file line number Diff line number Diff line change
Expand Up @@ -194,10 +194,18 @@ XTuner 提供与大语言模型对话的工具。
xtuner chat ${NAME_OR_PATH_TO_LLM} --adapter {NAME_OR_PATH_TO_ADAPTER} [optional arguments]
```

例如,与 Llama2-7b + MOSS-003-SFT adapter 对话:
例如:

与 InternLM-7B + Alpaca-enzh adapter 对话:

```shell
xtuner chat internlm/internlm-7b --adapter xtuner/internlm-7b-qlora-alpaca-enzh --prompt-template internlm_chat --system-template alpaca
```

与 Llama2-7b + MOSS-003-SFT adapter 对话:

```shell
xtuner chat meta-llama/Llama-2-7b-hf --adapter xtuner/Llama-2-7b-qlora-moss-003-sft --bot-name Llama2 --prompt-template moss_sft --with-plugins calculate solve search --command-stop-word "<eoc>" --answer-stop-word "<eom>" --no-streamer
xtuner chat meta-llama/Llama-2-7b-hf --adapter xtuner/Llama-2-7b-qlora-moss-003-sft --bot-name Llama2 --prompt-template moss_sft --system-template moss_sft --with-plugins calculate solve search --command-stop-word "<eoc>" --answer-stop-word "<eom>" --no-streamer
```

更多示例,请查阅[文档](./docs/zh_cn/user_guides/chat.md)
Expand Down
128 changes: 1 addition & 127 deletions docs/en/user_guides/chat.md
Original file line number Diff line number Diff line change
@@ -1,129 +1,3 @@
# Chat with fine-tuned LLMs

## Chat with [InternLM](https://github.com/InternLM/InternLM)

- InternLM-7B, oasst1

```shell
xtuner chat internlm/internlm-7b --adapter xtuner/internlm-7b-qlora-oasst1 --prompt-template openassistant
```

- InternLM-7B, Arxiv Gentitle

```shell
xtuner chat internlm/internlm-7b --adapter xtuner/internlm-7b-qlora-arxiv-gentitle --prompt-template title
```

- InternLM-7B, Colorist

```shell
xtuner chat internlm/internlm-7b --adapter xtuner/internlm-7b-qlora-colorist --prompt-template colorist
```

- InternLM-7B, Coder

```shell
xtuner chat internlm/internlm-7b --adapter xtuner/internlm-7b-qlora-coder --prompt-template code
```

- InternLM-7B, SQL

```shell
xtuner chat internlm/internlm-7b --adapter xtuner/internlm-7b-qlora-sql --prompt-template sql
```

- InternLM-7B, Lawyer

```shell
xtuner chat internlm/internlm-7b --adapter xtuner/internlm-7b-qlora-lawyer --prompt-template lawyer
```

- InternLM-7B, Open-Platypus

```shell
xtuner chat internlm/internlm-7b --adapter xtuner/internlm-7b-qlora-open-platypus --prompt-template alpaca
```

- InternLM-7B, Alpaca-enzh

```shell
xtuner chat internlm/internlm-7b --adapter xtuner/internlm-7b-qlora-alpaca-enzh --prompt-template alpaca
```

## Chat with [Llama2](https://github.com/facebookresearch/llama)

> Don't forget to use `huggingface-cli login` and input your access token first to access Llama2! See [here](https://huggingface.co/docs/hub/security-tokens#user-access-tokens) to learn how to obtain your access token.
- Llama2-7B, MOSS-003-SFT **(plugins!)**

```shell
export SERPER_API_KEY="xxx" # Please get the key from https://serper.dev to support google search!
xtuner chat meta-llama/Llama-2-7b-hf --adapter xtuner/Llama-2-7b-qlora-moss-003-sft --bot-name Llama2 --prompt-template moss_sft --with-plugins calculate solve search --command-stop-word "<eoc>" --answer-stop-word "<eom>" --no-streamer
```

- Llama2-7B, Arxiv Gentitle

```shell
xtuner chat meta-llama/Llama-2-7b-hf --adapter xtuner/Llama-2-7b-qlora-arxiv-gentitle --prompt-template title
```

- Llama2-7B, Colorist

```shell
xtuner chat meta-llama/Llama-2-7b-hf --adapter xtuner/Llama-2-7b-qlora-colorist --prompt-template colorist
```

## Chat with [Qwen](https://github.com/QwenLM)

- Qwen-7B, MOSS-003-SFT **(plugins!)**

```shell
export SERPER_API_KEY="xxx" # Please get the key from https://serper.dev to support google search!
xtuner chat Qwen/Qwen-7B --adapter xtuner/Qwen-7B-qlora-moss-003-sft --bot-name Qwen --prompt-template moss_sft --with-plugins calculate solve search --command-stop-word "<eoc>" --answer-stop-word "<eom>"
```

- Qwen-7B, oasst1

```shell
xtuner chat Qwen/Qwen-7B --adapter xtuner/Qwen-7B-qlora-oasst1 --prompt-template openassistant --answer-stop-word '<|endoftext|>'
```

- Qwen-7B, Arxiv Gentitle

```shell
xtuner chat Qwen/Qwen-7B --adapter xtuner/Qwen-7B-qlora-arxiv-gentitle --prompt-template title --answer-stop-word '<|endoftext|>'
```

- Qwen-7B, Alpaca-enzh

```shell
xtuner chat Qwen/Qwen-7B --adapter xtuner/Qwen-7B-qlora-alpaca-enzh --prompt-template alpaca --answer-stop-word '<|endoftext|>'
```

## Chat with [Baichuan](https://github.com/baichuan-inc)

- Baichuan-7B, oasst1

```shell
xtuner chat baichuan-inc/Baichuan-7B --adapter xtuner/Baichuan-7B-qlora-oasst1 --prompt-template openassistant
```

- Baichuan-7B, Arxiv Gentitle

```shell
xtuner chat baichuan-inc/Baichuan-7B --adapter xtuner/Baichuan-7B-qlora-arxiv-gentitle --prompt-template title --no-streamer
```

- Baichuan-7B, Alpaca-enzh

```shell
xtuner chat baichuan-inc/Baichuan-7B --adapter xtuner/Baichuan-7B-qlora-alpaca-enzh --prompt-template alpaca
```

## Chat with [CodeLlama](https://github.com/facebookresearch/codellama)

- CodeLlama-7B, Instruct

```shell
xtuner chat codellama/CodeLlama-7b-Instruct-hf --prompt-template code_llama_chat
```
Coming soon.
64 changes: 39 additions & 25 deletions docs/en/user_guides/dataset_format.md
Original file line number Diff line number Diff line change
@@ -1,18 +1,26 @@
# Dataset Format

- [Incremental Pre-training Dataset Format](#incremental-pre-training-dataset-format)
- [Single-turn Dialogue Dataset Format](#single-turn-dialogue-dataset-format)
- [Multi-turn Dialogue Dataset Format](#multi-turn-dialogue-dataset-format)
- [Method 1](#method-1)
- [Method 2](#method-2)
- [Method in XTuner](#method-in-xtuner)

The Supervised Finetune (SFT) of large language models aims to improve the performance of pre-trained models on specific tasks through supervised fine-tuning. To support as many downstream tasks as possible, XTuner supports three dataset formats: incremental pre-training, single-turn dialogue, and multi-turn dialogue.

- The incremental pre-training dataset is used to enhance the model's capabilities in a specific domain or task.
- Single-turn and multi-turn dialogue datasets are often used in the instruction tuning stage to enhance the model's ability to respond to specific instructions.

In the instruction tuning phase, our goal is to train the language model to answer based on human instructions. **Therefore, generally only the loss of the response part (Output) is used for gradient backpropagation, while the loss of the instruction part (Input) is not used for weight updates.** Based on this, we introduce "input" and "output" fields when preprocessing the dataset. The "input" field is used to save fields that do not need to compute loss, such as user instructions, whereas the "output" field is used to save fields that do need to compute loss, such as the GroundTruth answers corresponding to input instructions.
In the instruction tuning phase, our goal is to train the language model to answer based on human instructions. **Therefore, generally only the loss of the response part (Output) is used for gradient backpropagation, while the loss of the instruction part (System, Input) is not used for weight updates.** Based on this, we introduce "system", "input" and "output" fields when preprocessing the dataset. The "system", "input" fields are used to save fields that do not need to compute loss, such as system and user instructions, whereas the "output" field is used to save fields that do need to compute loss, such as the GroundTruth answers corresponding to input instructions.

To unify the incremental pre-training, single-turn dialogue, and multi-turn dialogue dataset formats, we set the dataset format to the following form:

```json
[{
"conversation":[
{
"system": "xxx",
"input": "xxx",
"output": "xxx"
}
Expand All @@ -21,6 +29,7 @@ To unify the incremental pre-training, single-turn dialogue, and multi-turn dial
{
"conversation":[
{
"system": "xxx",
"input": "xxx",
"output": "xxx"
},
Expand All @@ -32,22 +41,23 @@ To unify the incremental pre-training, single-turn dialogue, and multi-turn dial
}]
```

Throughout the training phase, we amalgamate several "input" and "output" pairs from a single data instance, which we then feed into the model. Loss is computed concurrently at each position, yet only the loss associated with the "output" component participates in the gradient backpropagation process. This process is elucidated in the figure below.
Throughout the training phase, we amalgamate several "system", "input" and "output" pairs from a single data instance, which we then feed into the model. Loss is computed concurrently at each position, yet only the loss associated with the "output" component participates in the gradient backpropagation process. This process is elucidated in the figure below.

<div align="center">
<img src="https://github.com/open-mmlab/mmrazor/assets/41630003/d5d696de-c026-494c-8b95-b1ba4b492939" alt="Image" width="700" />
<img src="https://github.com/LZHgrla/xtuner/assets/36994684/5ac1ef47-e7e3-43c3-b6b5-5df1aceef970" alt="Image" width="700" />
</div>

Note that the <EOS> token and <BOS> token are used to indicate the start and end of a sentence or text.

## Incremental Pre-training Dataset Format

As incremental pre-training is intended to help the model learn language knowledge and expressive abilities tailored for specific downstream tasks, the loss corresponding to the entire content of the dataset should be used for gradient backpropagation. Therefore, the "input" of the dataset is left empty, while the "output" consists of an entire piece of corpus data. The dataset format corresponding to the incremental pre-training task is shown as follows:
As incremental pre-training is intended to help the model learn language knowledge and expressive abilities tailored for specific downstream tasks, the loss corresponding to the entire content of the dataset should be used for gradient backpropagation. Therefore, the "system" and "input" of the dataset are left empty, while the "output" consists of an entire piece of corpus data. The dataset format corresponding to the incremental pre-training task is shown as follows:

```json
[{
"conversation":[
{
"system": "",
"input": "",
"output": "I am an artificial intelligence (AI) assistant named Puyu. I was created by the Shanghai AI Laboratory and my purpose is to assist users with various tasks through natural language processing technology."
}
Expand All @@ -56,6 +66,7 @@ As incremental pre-training is intended to help the model learn language knowled
{
"conversation":[
{
"system": "",
"input": "",
"output": "I am an artificial intelligence programmed to assist with various types of tasks, including answering questions, providing information, and performing automated processes."
}
Expand All @@ -69,38 +80,39 @@ As incremental pre-training is intended to help the model learn language knowled

## Single-turn Dialogue Dataset Format

The single-turn dialogue dataset typically consists of a single instruction (or question) and its corresponding GroundTruth answer. Since only the answer part should be used for gradient backpropagation, the "input" field of the dataset is the input instruction, and the "output" field is the corresponding answer. The format of the single-turn dialogue dataset is shown as follows:
The single-turn dialogue dataset typically consists of a single instruction (or question) and its corresponding GroundTruth answer. Since only the answer part should be used for gradient backpropagation, the "system" and "input" fields of the dataset are the input instruction, and the "output" field is the corresponding answer. The format of the single-turn dialogue dataset is shown as follows:

```json
[{
"conversation":
[
{
"input": "Give three tips for staying healthy.",
"output": "1.Eat a balanced diet. 2. Exercise regularly. 3. Get enough sleep."
}
]
"conversation":[
{
"system": "You are an AI asssistant."
"input": "Give three tips for staying healthy.",
"output": "1.Eat a balanced diet. 2. Exercise regularly. 3. Get enough sleep."
}
]
},
{
"conversation":
[
{
"input": "How to study English?",
"output": "1. Set clear goals. 2. Create a study plan. 3. Build vocabulary. 4. Practice speaking."
}
]
"conversation":[
{
"system": "You are an AI asssistant."
"input": "How to study English?",
"output": "1. Set clear goals. 2. Create a study plan. 3. Build vocabulary. 4. Practice speaking."
}
]
}]
```

<div align="center">
<img src="https://github.com/open-mmlab/mmrazor/assets/41630003/91499b4e-faa2-4e7c-92ee-2fe614a8243f" alt="Image" width="700" />
<img src="https://github.com/LZHgrla/xtuner/assets/36994684/6eed31aa-70e4-47c7-bfdb-20fa7a1312ea" alt="Image" width="700" />
</div>

## Multi-turn Dialogue Dataset Format

The multi-turn dialogue dataset typically consists of multiple rounds of instructions (or questions) and their corresponding GroundTruth answers. Suppose we have a piece of multi-turn dialogue data. For ease of introduction, for the nth round of dialogue, we set the output corresponding to User and Assistant as Usern and Assistantn.
The multi-turn dialogue dataset typically consists of multiple rounds of instructions (or questions) and their corresponding GroundTruth answers. Suppose we have a piece of multi-turn dialogue data. For ease of introduction, for the nth round of dialogue, we set the output corresponding to User and Assistant as UserN and AssistantN.

```text
System: You are an AI asssistant.
User1:Hello?
Assistant1:Hello! How can I help you?
User2:What's the date today?
Expand All @@ -113,10 +125,10 @@ How can we use the above multi-turn dialogue data to train large models? Current

### Method 1

The text of User1, Assistant1, User2, Assistant2, and User3 is all considered as the input part of the model, while the text of Assistant3 is viewed as the prediction part of the model. Only the loss from the Assistant3 part is involved in the weight update.
The text of System, User1, Assistant1, User2, Assistant2, and User3 is all considered as the input part of the model, while the text of Assistant3 is viewed as the prediction part of the model. Only the loss from the Assistant3 part is involved in the weight update.

<div align="center">
<img src="https://github.com/open-mmlab/mmrazor/assets/41630003/ff4a44c4-43d7-45a7-8749-19b545f90207" alt="Image" width=1100" />
<img src="https://github.com/LZHgrla/xtuner/assets/36994684/ce869cd5-c1ca-4bc8-9bc3-14f63abb7a5f" alt="Image" width=1100" />
</div>

The downside of this method is that it does not fully utilize the multi-turn dialogue training data because the content of Assistant1 and Assistant2 does not participate in model training, leading to a low utilization rate of training data.
Expand All @@ -126,7 +138,7 @@ The downside of this method is that it does not fully utilize the multi-turn dia
Split a piece of multi-turn dialogue data into multiple pieces of data. For example, the above instance can be split into the following three pieces of data.

<div align="center">
<img src="https://github.com/open-mmlab/mmrazor/assets/41630003/c0efbf9b-94bc-46ce-b500-e062c2cb59f7" alt="Image" width=1100" />
<img src="https://github.com/LZHgrla/xtuner/assets/36994684/9fd714fc-20bd-4d4c-a4cf-3f95712f1db8" alt="Image" width=1100" />
</div>

Compared to Method 1, Method 2 can fully utilize the data from each round of dialogue, but it requires splitting one piece of data containing n rounds of dialogue into n pieces of data, which reduces the training efficiency by 1/n.
Expand All @@ -136,7 +148,7 @@ Compared to Method 1, Method 2 can fully utilize the data from each round of dia
When XTuner trains multi-turn dialogue models, it adopts a more comprehensive and efficient method, as shown in the figure below.

<div align="center">
<img src="https://github.com/open-mmlab/mmrazor/assets/41630003/caaac51f-e982-46db-8f68-6ce28f343183" alt="Image" width=1100" />
<img src="https://github.com/LZHgrla/xtuner/assets/36994684/ec67b610-a3b2-4fa7-91ad-a9a235fdb820" alt="Image" width=1100" />
</div>

We concatenate multi-turn dialogues, then input them into the model. The loss at each position is computed in parallel, but only the loss from the Output part participates in backpropagation. Therefore, the format of the multi-turn dialogue dataset in XTuner is shown as follows:
Expand All @@ -145,6 +157,7 @@ We concatenate multi-turn dialogues, then input them into the model. The loss at
[{
"conversation":[
{
"system": "You are an AI asssistant."
"input": "Hello?",
"output": "Hello! How can I help you?"
},
Expand All @@ -161,6 +174,7 @@ We concatenate multi-turn dialogues, then input them into the model. The loss at
{
"conversation":[
{
"system": "You are an AI asssistant."
"input": "Hello?",
"output": "Hello! How can I help you?"
},
Expand Down
6 changes: 6 additions & 0 deletions docs/en/user_guides/dataset_prepare.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,11 @@
# Dataset Prepare

- [HuggingFace datasets](#huggingface-datasets)
- [Others](#others)
- [Arxiv Gentitle](#arxiv-gentitle)
- [MOSS-003-SFT](#moss-003-sft)
- [Chinese Lawyer](#chinese-lawyer)

## HuggingFace datasets

For datasets on HuggingFace Hub, such as [alpaca](https://huggingface.co/datasets/tatsu-lab/alpaca), you can quickly utilize them. For more details, please refer to [single_turn_conversation.md](./single_turn_conversation.md) and [multi_turn_conversation.md](./multi_turn_conversation.md).
Expand Down
Loading

0 comments on commit f7a5c9c

Please sign in to comment.