Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Refactor] Refactor the preprocess of dataset #158

Merged
merged 45 commits into from
Oct 12, 2023
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
45 commits
Select commit Hold shift + click to select a range
726e400
support system for template
LZHgrla Oct 7, 2023
a574ded
fix alpaca map_fn
LZHgrla Oct 7, 2023
67da13f
refactor map_fns
LZHgrla Oct 8, 2023
7c27083
refactor internlm-7b cfgs
LZHgrla Oct 8, 2023
372046f
fix moss_sft
LZHgrla Oct 8, 2023
2f38336
fix templates
LZHgrla Oct 8, 2023
7fca1c9
fix cfgs
LZHgrla Oct 8, 2023
72ace32
add system for evaluate_chat_hook
LZHgrla Oct 8, 2023
218b58f
use template system
LZHgrla Oct 8, 2023
ec4ac45
fix bugs
LZHgrla Oct 8, 2023
dd70b76
add task
LZHgrla Oct 8, 2023
79b9f39
fix bugs
LZHgrla Oct 8, 2023
625aded
rename
LZHgrla Oct 9, 2023
06023d7
fix
LZHgrla Oct 9, 2023
e412558
update templates
LZHgrla Oct 9, 2023
8641b07
update cfgs
LZHgrla Oct 9, 2023
a60389d
Update dataset_format.md
LZHgrla Oct 10, 2023
9a533cf
Update dataset_format.md
LZHgrla Oct 10, 2023
752a798
Update dataset_format.md
LZHgrla Oct 10, 2023
7ae9a4e
Update dataset_format.md
LZHgrla Oct 10, 2023
961f0f9
Update dataset_format.md
LZHgrla Oct 10, 2023
f1d19bd
Update dataset_format.md
LZHgrla Oct 10, 2023
f4d7323
Update dataset_format.md
LZHgrla Oct 10, 2023
fee3832
Update dataset_format.md
LZHgrla Oct 10, 2023
cd941f2
Update single_turn_conversation.md
LZHgrla Oct 10, 2023
d3cbbb6
update
LZHgrla Oct 10, 2023
4ba1168
fix pre-commit
LZHgrla Oct 10, 2023
2e63bd5
update
LZHgrla Oct 10, 2023
8e5b38e
add toc
LZHgrla Oct 10, 2023
9f49204
chat supports system
LZHgrla Oct 10, 2023
13706d3
Update README.md
LZHgrla Oct 10, 2023
b68c3a2
Update README_zh-CN.md
LZHgrla Oct 10, 2023
9eb9356
fix typo
LZHgrla Oct 10, 2023
da3ffec
remove chat docs
LZHgrla Oct 10, 2023
624812b
Update README_zh-CN.md
LZHgrla Oct 10, 2023
a1427d4
Update README.md
LZHgrla Oct 10, 2023
253961d
Update README.md
LZHgrla Oct 10, 2023
d011593
Update README_zh-CN.md
LZHgrla Oct 10, 2023
618d833
fix pre-commit
LZHgrla Oct 10, 2023
37a0827
fix language
LZHgrla Oct 10, 2023
292be5f
update help msg
LZHgrla Oct 11, 2023
fe62737
add eos_token for qwen
LZHgrla Oct 11, 2023
390be58
fix
LZHgrla Oct 11, 2023
3a08d72
Update single_turn_conversation.md
LZHgrla Oct 11, 2023
3045aa4
Update single_turn_conversation.md
LZHgrla Oct 11, 2023
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
12 changes: 10 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -195,10 +195,18 @@ XTuner provides tools to chat with pretrained / fine-tuned LLMs.
xtuner chat ${NAME_OR_PATH_TO_LLM} --adapter {NAME_OR_PATH_TO_ADAPTER} [optional arguments]
```

For example, we can start the chat with Llama2-7b with adapter trained from MOSS-003-SFT by
For example, we can start the chat with

InternLM-7B with adapter trained from Alpaca-enzh:

```shell
xtuner chat internlm/internlm-7b --adapter xtuner/internlm-7b-qlora-alpaca-enzh --prompt-template internlm_chat --system-template alpaca
```

Llama2-7b with adapter trained from MOSS-003-SFT:

```shell
xtuner chat meta-llama/Llama-2-7b-hf --adapter xtuner/Llama-2-7b-qlora-moss-003-sft --bot-name Llama2 --prompt-template moss_sft --with-plugins calculate solve search --command-stop-word "<eoc>" --answer-stop-word "<eom>" --no-streamer
xtuner chat meta-llama/Llama-2-7b-hf --adapter xtuner/Llama-2-7b-qlora-moss-003-sft --bot-name Llama2 --prompt-template moss_sft --system-template moss_sft --with-plugins calculate solve search --command-stop-word "<eoc>" --answer-stop-word "<eom>" --no-streamer
```

For more examples, please see [chat.md](./docs/en/user_guides/chat.md).
Expand Down
12 changes: 10 additions & 2 deletions README_zh-CN.md
Original file line number Diff line number Diff line change
Expand Up @@ -194,10 +194,18 @@ XTuner 提供与大语言模型对话的工具。
xtuner chat ${NAME_OR_PATH_TO_LLM} --adapter {NAME_OR_PATH_TO_ADAPTER} [optional arguments]
```

例如,与 Llama2-7b + MOSS-003-SFT adapter 对话:
例如:

与 InternLM-7B + Alpaca-enzh adapter 对话:

```shell
xtuner chat internlm/internlm-7b --adapter xtuner/internlm-7b-qlora-alpaca-enzh --prompt-template internlm_chat --system-template alpaca
```

与 Llama2-7b + MOSS-003-SFT adapter 对话:

```shell
xtuner chat meta-llama/Llama-2-7b-hf --adapter xtuner/Llama-2-7b-qlora-moss-003-sft --bot-name Llama2 --prompt-template moss_sft --with-plugins calculate solve search --command-stop-word "<eoc>" --answer-stop-word "<eom>" --no-streamer
xtuner chat meta-llama/Llama-2-7b-hf --adapter xtuner/Llama-2-7b-qlora-moss-003-sft --bot-name Llama2 --prompt-template moss_sft --system-template moss_sft --with-plugins calculate solve search --command-stop-word "<eoc>" --answer-stop-word "<eom>" --no-streamer
```

更多示例,请查阅[文档](./docs/zh_cn/user_guides/chat.md)。
Expand Down
128 changes: 1 addition & 127 deletions docs/en/user_guides/chat.md
Original file line number Diff line number Diff line change
@@ -1,129 +1,3 @@
# Chat with fine-tuned LLMs

## Chat with [InternLM](https://github.com/InternLM/InternLM)

- InternLM-7B, oasst1

```shell
xtuner chat internlm/internlm-7b --adapter xtuner/internlm-7b-qlora-oasst1 --prompt-template openassistant
```

- InternLM-7B, Arxiv Gentitle

```shell
xtuner chat internlm/internlm-7b --adapter xtuner/internlm-7b-qlora-arxiv-gentitle --prompt-template title
```

- InternLM-7B, Colorist

```shell
xtuner chat internlm/internlm-7b --adapter xtuner/internlm-7b-qlora-colorist --prompt-template colorist
```

- InternLM-7B, Coder

```shell
xtuner chat internlm/internlm-7b --adapter xtuner/internlm-7b-qlora-coder --prompt-template code
```

- InternLM-7B, SQL

```shell
xtuner chat internlm/internlm-7b --adapter xtuner/internlm-7b-qlora-sql --prompt-template sql
```

- InternLM-7B, Lawyer

```shell
xtuner chat internlm/internlm-7b --adapter xtuner/internlm-7b-qlora-lawyer --prompt-template lawyer
```

- InternLM-7B, Open-Platypus

```shell
xtuner chat internlm/internlm-7b --adapter xtuner/internlm-7b-qlora-open-platypus --prompt-template alpaca
```

- InternLM-7B, Alpaca-enzh

```shell
xtuner chat internlm/internlm-7b --adapter xtuner/internlm-7b-qlora-alpaca-enzh --prompt-template alpaca
```

## Chat with [Llama2](https://github.com/facebookresearch/llama)

> Don't forget to use `huggingface-cli login` and input your access token first to access Llama2! See [here](https://huggingface.co/docs/hub/security-tokens#user-access-tokens) to learn how to obtain your access token.

- Llama2-7B, MOSS-003-SFT **(plugins!)**

```shell
export SERPER_API_KEY="xxx" # Please get the key from https://serper.dev to support google search!
xtuner chat meta-llama/Llama-2-7b-hf --adapter xtuner/Llama-2-7b-qlora-moss-003-sft --bot-name Llama2 --prompt-template moss_sft --with-plugins calculate solve search --command-stop-word "<eoc>" --answer-stop-word "<eom>" --no-streamer
```

- Llama2-7B, Arxiv Gentitle

```shell
xtuner chat meta-llama/Llama-2-7b-hf --adapter xtuner/Llama-2-7b-qlora-arxiv-gentitle --prompt-template title
```

- Llama2-7B, Colorist

```shell
xtuner chat meta-llama/Llama-2-7b-hf --adapter xtuner/Llama-2-7b-qlora-colorist --prompt-template colorist
```

## Chat with [Qwen](https://github.com/QwenLM)

- Qwen-7B, MOSS-003-SFT **(plugins!)**

```shell
export SERPER_API_KEY="xxx" # Please get the key from https://serper.dev to support google search!
xtuner chat Qwen/Qwen-7B --adapter xtuner/Qwen-7B-qlora-moss-003-sft --bot-name Qwen --prompt-template moss_sft --with-plugins calculate solve search --command-stop-word "<eoc>" --answer-stop-word "<eom>"
```

- Qwen-7B, oasst1

```shell
xtuner chat Qwen/Qwen-7B --adapter xtuner/Qwen-7B-qlora-oasst1 --prompt-template openassistant --answer-stop-word '<|endoftext|>'
```

- Qwen-7B, Arxiv Gentitle

```shell
xtuner chat Qwen/Qwen-7B --adapter xtuner/Qwen-7B-qlora-arxiv-gentitle --prompt-template title --answer-stop-word '<|endoftext|>'
```

- Qwen-7B, Alpaca-enzh

```shell
xtuner chat Qwen/Qwen-7B --adapter xtuner/Qwen-7B-qlora-alpaca-enzh --prompt-template alpaca --answer-stop-word '<|endoftext|>'
```

## Chat with [Baichuan](https://github.com/baichuan-inc)

- Baichuan-7B, oasst1

```shell
xtuner chat baichuan-inc/Baichuan-7B --adapter xtuner/Baichuan-7B-qlora-oasst1 --prompt-template openassistant
```

- Baichuan-7B, Arxiv Gentitle

```shell
xtuner chat baichuan-inc/Baichuan-7B --adapter xtuner/Baichuan-7B-qlora-arxiv-gentitle --prompt-template title --no-streamer
```

- Baichuan-7B, Alpaca-enzh

```shell
xtuner chat baichuan-inc/Baichuan-7B --adapter xtuner/Baichuan-7B-qlora-alpaca-enzh --prompt-template alpaca
```

## Chat with [CodeLlama](https://github.com/facebookresearch/codellama)

- CodeLlama-7B, Instruct

```shell
xtuner chat codellama/CodeLlama-7b-Instruct-hf --prompt-template code_llama_chat
```
Coming soon.
64 changes: 39 additions & 25 deletions docs/en/user_guides/dataset_format.md
Original file line number Diff line number Diff line change
@@ -1,18 +1,26 @@
# Dataset Format

- [Incremental Pre-training Dataset Format](#incremental-pre-training-dataset-format)
- [Single-turn Dialogue Dataset Format](#single-turn-dialogue-dataset-format)
- [Multi-turn Dialogue Dataset Format](#multi-turn-dialogue-dataset-format)
- [Method 1](#method-1)
- [Method 2](#method-2)
- [Method in XTuner](#method-in-xtuner)

The Supervised Finetune (SFT) of large language models aims to improve the performance of pre-trained models on specific tasks through supervised fine-tuning. To support as many downstream tasks as possible, XTuner supports three dataset formats: incremental pre-training, single-turn dialogue, and multi-turn dialogue.

- The incremental pre-training dataset is used to enhance the model's capabilities in a specific domain or task.
- Single-turn and multi-turn dialogue datasets are often used in the instruction tuning stage to enhance the model's ability to respond to specific instructions.

In the instruction tuning phase, our goal is to train the language model to answer based on human instructions. **Therefore, generally only the loss of the response part (Output) is used for gradient backpropagation, while the loss of the instruction part (Input) is not used for weight updates.** Based on this, we introduce "input" and "output" fields when preprocessing the dataset. The "input" field is used to save fields that do not need to compute loss, such as user instructions, whereas the "output" field is used to save fields that do need to compute loss, such as the GroundTruth answers corresponding to input instructions.
In the instruction tuning phase, our goal is to train the language model to answer based on human instructions. **Therefore, generally only the loss of the response part (Output) is used for gradient backpropagation, while the loss of the instruction part (System, Input) is not used for weight updates.** Based on this, we introduce "system", "input" and "output" fields when preprocessing the dataset. The "system", "input" fields are used to save fields that do not need to compute loss, such as system and user instructions, whereas the "output" field is used to save fields that do need to compute loss, such as the GroundTruth answers corresponding to input instructions.

To unify the incremental pre-training, single-turn dialogue, and multi-turn dialogue dataset formats, we set the dataset format to the following form:

```json
[{
"conversation":[
{
"system": "xxx",
"input": "xxx",
"output": "xxx"
}
Expand All @@ -21,6 +29,7 @@ To unify the incremental pre-training, single-turn dialogue, and multi-turn dial
{
"conversation":[
{
"system": "xxx",
"input": "xxx",
"output": "xxx"
},
Expand All @@ -32,22 +41,23 @@ To unify the incremental pre-training, single-turn dialogue, and multi-turn dial
}]
```

Throughout the training phase, we amalgamate several "input" and "output" pairs from a single data instance, which we then feed into the model. Loss is computed concurrently at each position, yet only the loss associated with the "output" component participates in the gradient backpropagation process. This process is elucidated in the figure below.
Throughout the training phase, we amalgamate several "system", "input" and "output" pairs from a single data instance, which we then feed into the model. Loss is computed concurrently at each position, yet only the loss associated with the "output" component participates in the gradient backpropagation process. This process is elucidated in the figure below.

<div align="center">
<img src="https://github.com/open-mmlab/mmrazor/assets/41630003/d5d696de-c026-494c-8b95-b1ba4b492939" alt="Image" width="700" />
<img src="https://github.com/LZHgrla/xtuner/assets/36994684/5ac1ef47-e7e3-43c3-b6b5-5df1aceef970" alt="Image" width="700" />
</div>

Note that the <EOS> token and <BOS> token are used to indicate the start and end of a sentence or text.

## Incremental Pre-training Dataset Format

As incremental pre-training is intended to help the model learn language knowledge and expressive abilities tailored for specific downstream tasks, the loss corresponding to the entire content of the dataset should be used for gradient backpropagation. Therefore, the "input" of the dataset is left empty, while the "output" consists of an entire piece of corpus data. The dataset format corresponding to the incremental pre-training task is shown as follows:
As incremental pre-training is intended to help the model learn language knowledge and expressive abilities tailored for specific downstream tasks, the loss corresponding to the entire content of the dataset should be used for gradient backpropagation. Therefore, the "system" and "input" of the dataset are left empty, while the "output" consists of an entire piece of corpus data. The dataset format corresponding to the incremental pre-training task is shown as follows:

```json
[{
"conversation":[
{
"system": "",
"input": "",
"output": "I am an artificial intelligence (AI) assistant named Puyu. I was created by the Shanghai AI Laboratory and my purpose is to assist users with various tasks through natural language processing technology."
}
Expand All @@ -56,6 +66,7 @@ As incremental pre-training is intended to help the model learn language knowled
{
"conversation":[
{
"system": "",
"input": "",
"output": "I am an artificial intelligence programmed to assist with various types of tasks, including answering questions, providing information, and performing automated processes."
}
Expand All @@ -69,38 +80,39 @@ As incremental pre-training is intended to help the model learn language knowled

## Single-turn Dialogue Dataset Format

The single-turn dialogue dataset typically consists of a single instruction (or question) and its corresponding GroundTruth answer. Since only the answer part should be used for gradient backpropagation, the "input" field of the dataset is the input instruction, and the "output" field is the corresponding answer. The format of the single-turn dialogue dataset is shown as follows:
The single-turn dialogue dataset typically consists of a single instruction (or question) and its corresponding GroundTruth answer. Since only the answer part should be used for gradient backpropagation, the "system" and "input" fields of the dataset are the input instruction, and the "output" field is the corresponding answer. The format of the single-turn dialogue dataset is shown as follows:

```json
[{
"conversation":
[
{
"input": "Give three tips for staying healthy.",
"output": "1.Eat a balanced diet. 2. Exercise regularly. 3. Get enough sleep."
}
]
"conversation":[
{
"system": "You are an AI asssistant."
"input": "Give three tips for staying healthy.",
"output": "1.Eat a balanced diet. 2. Exercise regularly. 3. Get enough sleep."
}
]
},
{
"conversation":
[
{
"input": "How to study English?",
"output": "1. Set clear goals. 2. Create a study plan. 3. Build vocabulary. 4. Practice speaking."
}
]
"conversation":[
{
"system": "You are an AI asssistant."
"input": "How to study English?",
"output": "1. Set clear goals. 2. Create a study plan. 3. Build vocabulary. 4. Practice speaking."
}
]
}]
```

<div align="center">
<img src="https://github.com/open-mmlab/mmrazor/assets/41630003/91499b4e-faa2-4e7c-92ee-2fe614a8243f" alt="Image" width="700" />
<img src="https://github.com/LZHgrla/xtuner/assets/36994684/6eed31aa-70e4-47c7-bfdb-20fa7a1312ea" alt="Image" width="700" />
</div>

## Multi-turn Dialogue Dataset Format

The multi-turn dialogue dataset typically consists of multiple rounds of instructions (or questions) and their corresponding GroundTruth answers. Suppose we have a piece of multi-turn dialogue data. For ease of introduction, for the nth round of dialogue, we set the output corresponding to User and Assistant as Usern and Assistantn.
The multi-turn dialogue dataset typically consists of multiple rounds of instructions (or questions) and their corresponding GroundTruth answers. Suppose we have a piece of multi-turn dialogue data. For ease of introduction, for the nth round of dialogue, we set the output corresponding to User and Assistant as UserN and AssistantN.

```text
System: You are an AI asssistant.
User1:Hello?
Assistant1:Hello! How can I help you?
User2:What's the date today?
Expand All @@ -113,10 +125,10 @@ How can we use the above multi-turn dialogue data to train large models? Current

### Method 1

The text of User1, Assistant1, User2, Assistant2, and User3 is all considered as the input part of the model, while the text of Assistant3 is viewed as the prediction part of the model. Only the loss from the Assistant3 part is involved in the weight update.
The text of System, User1, Assistant1, User2, Assistant2, and User3 is all considered as the input part of the model, while the text of Assistant3 is viewed as the prediction part of the model. Only the loss from the Assistant3 part is involved in the weight update.

<div align="center">
<img src="https://github.com/open-mmlab/mmrazor/assets/41630003/ff4a44c4-43d7-45a7-8749-19b545f90207" alt="Image" width=1100" />
<img src="https://github.com/LZHgrla/xtuner/assets/36994684/ce869cd5-c1ca-4bc8-9bc3-14f63abb7a5f" alt="Image" width=1100" />
</div>

The downside of this method is that it does not fully utilize the multi-turn dialogue training data because the content of Assistant1 and Assistant2 does not participate in model training, leading to a low utilization rate of training data.
Expand All @@ -126,7 +138,7 @@ The downside of this method is that it does not fully utilize the multi-turn dia
Split a piece of multi-turn dialogue data into multiple pieces of data. For example, the above instance can be split into the following three pieces of data.

<div align="center">
<img src="https://github.com/open-mmlab/mmrazor/assets/41630003/c0efbf9b-94bc-46ce-b500-e062c2cb59f7" alt="Image" width=1100" />
<img src="https://github.com/LZHgrla/xtuner/assets/36994684/9fd714fc-20bd-4d4c-a4cf-3f95712f1db8" alt="Image" width=1100" />
</div>

Compared to Method 1, Method 2 can fully utilize the data from each round of dialogue, but it requires splitting one piece of data containing n rounds of dialogue into n pieces of data, which reduces the training efficiency by 1/n.
Expand All @@ -136,7 +148,7 @@ Compared to Method 1, Method 2 can fully utilize the data from each round of dia
When XTuner trains multi-turn dialogue models, it adopts a more comprehensive and efficient method, as shown in the figure below.

<div align="center">
<img src="https://github.com/open-mmlab/mmrazor/assets/41630003/caaac51f-e982-46db-8f68-6ce28f343183" alt="Image" width=1100" />
<img src="https://github.com/LZHgrla/xtuner/assets/36994684/ec67b610-a3b2-4fa7-91ad-a9a235fdb820" alt="Image" width=1100" />
</div>

We concatenate multi-turn dialogues, then input them into the model. The loss at each position is computed in parallel, but only the loss from the Output part participates in backpropagation. Therefore, the format of the multi-turn dialogue dataset in XTuner is shown as follows:
Expand All @@ -145,6 +157,7 @@ We concatenate multi-turn dialogues, then input them into the model. The loss at
[{
"conversation":[
{
"system": "You are an AI asssistant."
"input": "Hello?",
"output": "Hello! How can I help you?"
},
Expand All @@ -161,6 +174,7 @@ We concatenate multi-turn dialogues, then input them into the model. The loss at
{
"conversation":[
{
"system": "You are an AI asssistant."
"input": "Hello?",
"output": "Hello! How can I help you?"
},
Expand Down
6 changes: 6 additions & 0 deletions docs/en/user_guides/dataset_prepare.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,11 @@
# Dataset Prepare

- [HuggingFace datasets](#huggingface-datasets)
- [Others](#others)
- [Arxiv Gentitle](#arxiv-gentitle)
- [MOSS-003-SFT](#moss-003-sft)
- [Chinese Lawyer](#chinese-lawyer)

## HuggingFace datasets

For datasets on HuggingFace Hub, such as [alpaca](https://huggingface.co/datasets/tatsu-lab/alpaca), you can quickly utilize them. For more details, please refer to [single_turn_conversation.md](./single_turn_conversation.md) and [multi_turn_conversation.md](./multi_turn_conversation.md).
Expand Down
Loading
Loading