Skip to content

Commit

Permalink
Merge remote-tracking branch 'upstream/main' (2023-08-30)
Browse files Browse the repository at this point in the history
  • Loading branch information
renning22 committed Aug 30, 2023
2 parents c043a1f + 2fbfcbc commit 9b11481
Show file tree
Hide file tree
Showing 65 changed files with 2,833 additions and 844 deletions.
10 changes: 7 additions & 3 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -9,13 +9,17 @@ Thanks to LMSYS team❤️

## Coverage of models

We are focused to support Llama2 at scale now. If you want any other models, please contact.

* Llama-2-13b-chat-hf
* longchat-13b-1
* falcon-7b-instruct
* codet5p-6b


## Dev Log

### 2023-08

Support llama2 at scale.

### 2023-07-26

Support "Llama-2-13b-chat-hf" and make it the default for API.
Expand Down
13 changes: 9 additions & 4 deletions docs/arena.md
Original file line number Diff line number Diff line change
@@ -1,9 +1,14 @@
# Chatbot Arena
Chatbot Arena is an LLM benchmark platform featuring anonymous, randomized battles, available at https://arena.lmsys.org.
Chatbot Arena is an LLM benchmark platform featuring anonymous, randomized battles, available at https://chat.lmsys.org.
We invite the entire community to join this benchmarking effort by contributing your votes and models.

## How to add a new model
If you want to see a specific model in the arena, you can follow the steps below.
If you want to see a specific model in the arena, you can follow the methods below.

1. Contribute code to support this model in FastChat by submitting a pull request. See [instructions](model_support.md#how-to-support-a-new-model).
2. After the model is supported, we will try to schedule some computing resources to host the model in the arena. However, due to the limited resources we have, we may not be able to serve every model. We will select the models based on popularity, quality, diversity, and other factors.
- Method 1: Hosted by LMSYS.
1. Contribute the code to support this model in FastChat by submitting a pull request. See [instructions](model_support.md#how-to-support-a-new-model).
2. After the model is supported, we will try to schedule some compute resources to host the model in the arena. However, due to the limited resources we have, we may not be able to serve every model. We will select the models based on popularity, quality, diversity, and other factors.

- Method 2: Hosted by 3rd party API providers or yourself.
1. If you have a model hosted by a 3rd party API provider or yourself, please give us an API endpoint. We prefer OpenAI-compatible APIs, so we can reuse our [code](https://github.com/lm-sys/FastChat/blob/33dca5cf12ee602455bfa9b5f4790a07829a2db7/fastchat/serve/gradio_web_server.py#L333-L358) for calling OpenAI models.
2. You can use FastChat's OpenAI API [server](openai_api.md) to serve your model with OpenAI-compatible APIs and provide us with the endpoint.
71 changes: 71 additions & 0 deletions docs/awq.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,71 @@
# AWQ 4bit Inference

We integrated [AWQ](https://github.com/mit-han-lab/llm-awq) into FastChat to provide **efficient and accurate** 4bit LLM inference.

## Install AWQ

Setup environment (please refer to [this link](https://github.com/mit-han-lab/llm-awq#install) for more details):
```bash
conda create -n fastchat-awq python=3.10 -y
conda activate fastchat-awq
# cd /path/to/FastChat
pip install --upgrade pip # enable PEP 660 support
pip install -e . # install fastchat

git clone https://github.com/mit-han-lab/llm-awq repositories/llm-awq
cd repositories/llm-awq
pip install -e . # install awq package

cd awq/kernels
python setup.py install # install awq CUDA kernels
```

## Chat with the CLI

```bash
# Download quantized model from huggingface
# Make sure you have git-lfs installed (https://git-lfs.com)
git lfs install
git clone https://huggingface.co/mit-han-lab/vicuna-7b-v1.3-4bit-g128-awq

# You can specify which quantized model to use by setting --awq-ckpt
python3 -m fastchat.serve.cli \
--model-path models/vicuna-7b-v1.3-4bit-g128-awq \
--awq-wbits 4 \
--awq-groupsize 128
```

## Benchmark

* Through **4-bit weight quantization**, AWQ helps to run larger language models within the device memory restriction and prominently accelerates token generation. All benchmarks are done with group_size 128.

* Benchmark on NVIDIA RTX A6000:

| Model | Bits | Max Memory (MiB) | Speed (ms/token) | AWQ Speedup |
| --------------- | ---- | ---------------- | ---------------- | ----------- |
| vicuna-7b | 16 | 13543 | 26.06 | / |
| vicuna-7b | 4 | 5547 | 12.43 | 2.1x |
| llama2-7b-chat | 16 | 13543 | 27.14 | / |
| llama2-7b-chat | 4 | 5547 | 12.44 | 2.2x |
| vicuna-13b | 16 | 25647 | 44.91 | / |
| vicuna-13b | 4 | 9355 | 17.30 | 2.6x |
| llama2-13b-chat | 16 | 25647 | 47.28 | / |
| llama2-13b-chat | 4 | 9355 | 20.28 | 2.3x |

* NVIDIA RTX 4090:

| Model | AWQ 4bit Speed (ms/token) | FP16 Speed (ms/token) | AWQ Speedup |
| --------------- | ------------------------- | --------------------- | ----------- |
| vicuna-7b | 8.61 | 19.09 | 2.2x |
| llama2-7b-chat | 8.66 | 19.97 | 2.3x |
| vicuna-13b | 12.17 | OOM | / |
| llama2-13b-chat | 13.54 | OOM | / |

* NVIDIA Jetson Orin:

| Model | AWQ 4bit Speed (ms/token) | FP16 Speed (ms/token) | AWQ Speedup |
| --------------- | ------------------------- | --------------------- | ----------- |
| vicuna-7b | 65.34 | 93.12 | 1.4x |
| llama2-7b-chat | 75.11 | 104.71 | 1.4x |
| vicuna-13b | 115.40 | OOM | / |
| llama2-13b-chat | 136.81 | OOM | / |
38 changes: 38 additions & 0 deletions docs/commands/conv_release.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,38 @@
## Chatbot Arena Conversations

1. Gather battles
```
python3 clean_battle_data.py --max-num 10 --mode conv_release
```

2. Tag OpenAI moderation
```
python3 tag_openai_moderation.py --in clean_battle_conv_20230814.json
```

3. Clean PII

4. Filter additional blocked words

```
python3 filter_bad_conv.py --in clean_battle_conv_20230630_tagged_v1_pii.json
```

5. Add additional toxicity tag


## All Conversations

1. Gather chats
```
python3 clean_chat_data.py
```

2. Sample
```
python3 conv_release_scripts/sample.py
```


## Prompt distribution

10 changes: 6 additions & 4 deletions docs/commands/local_cluster.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,8 +3,10 @@ node-01
```
python3 -m fastchat.serve.controller --host 0.0.0.0 --port 10002
CUDA_VISIBLE_DEVICES=0 python3 -m fastchat.serve.vllm_worker --model-path lmsys/vicuna-13b-v1.3 --model-name vicuna-13b --controller http://node-01:10002 --host 0.0.0.0 --port 31000 --worker-address http://$(hostname):31000
CUDA_VISIBLE_DEVICES=1 python3 -m fastchat.serve.vllm_worker --model-path lmsys/vicuna-13b-v1.3 --model-name vicuna-13b --controller http://node-01:10002 --host 0.0.0.0 --port 31001 --worker-address http://$(hostname):31001
CUDA_VISIBLE_DEVICES=0 python3 -m fastchat.serve.vllm_worker --model-path lmsys/vicuna-13b-v1.5 --model-name vicuna-13b --controller http://node-01:10002 --host 0.0.0.0 --port 31000 --worker-address http://$(hostname):31000
CUDA_VISIBLE_DEVICES=1 python3 -m fastchat.serve.vllm_worker --model-path lmsys/vicuna-13b-v1.5 --model-name vicuna-13b --controller http://node-01:10002 --host 0.0.0.0 --port 31001 --worker-address http://$(hostname):31001
CUDA_VISIBLE_DEVICES=2,3 ray start --head
python3 -m fastchat.serve.vllm_worker --model-path lmsys/vicuna-33b-v1.3 --model-name vicuna-33b --controller http://node-01:10002 --host 0.0.0.0 --port 31002 --worker-address http://$(hostname):31002 --num-gpus 2
```

Expand All @@ -13,7 +15,7 @@ node-02
CUDA_VISIBLE_DEVICES=0 python3 -m fastchat.serve.vllm_worker --model-path meta-llama/Llama-2-13b-chat-hf --model-name llama-2-13b-chat --controller http://node-01:10002 --host 0.0.0.0 --port 31000 --worker-address http://$(hostname):31000 --tokenizer meta-llama/Llama-2-7b-chat-hf
CUDA_VISIBLE_DEVICES=1 python3 -m fastchat.serve.vllm_worker --model-path meta-llama/Llama-2-13b-chat-hf --model-name llama-2-13b-chat --controller http://node-01:10002 --host 0.0.0.0 --port 31001 --worker-address http://$(hostname):31001 --tokenizer meta-llama/Llama-2-7b-chat-hf
CUDA_VISIBLE_DEVICES=2 python3 -m fastchat.serve.vllm_worker --model-path meta-llama/Llama-2-7b-chat-hf --model-name llama-2-7b-chat --controller http://node-01:10002 --host 0.0.0.0 --port 31002 --worker-address http://$(hostname):31002 --tokenizer meta-llama/Llama-2-7b-chat-hf
CUDA_VISIBLE_DEVICES=3 python3 -m fastchat.serve.vllm_worker --model-path TheBloke/wizardLM-13B-1.0-fp16 --model-name wizardlm-13b --controller http://node-01:10002 --host 0.0.0.0 --port 31003 --worker-address http://$(hostname):31003
CUDA_VISIBLE_DEVICES=3 python3 -m fastchat.serve.vllm_worker --model-path WizardLM/WizardLM-13B-V1.1 --model-name wizardlm-13b --controller http://node-01:10002 --host 0.0.0.0 --port 31003 --worker-address http://$(hostname):31003
```

node-03
Expand All @@ -26,7 +28,7 @@ node-04
```
CUDA_VISIBLE_DEVICES=0 python3 -m fastchat.serve.multi_model_worker --model-path ~/model_weights/RWKV-4-Raven-14B-v12-Eng98%25-Other2%25-20230523-ctx8192.pth --model-name RWKV-4-Raven-14B --model-path lmsys/fastchat-t5-3b-v1.0 --model-name fastchat-t5-3b --controller http://node-01:10002 --host 0.0.0.0 --port 31000 --worker http://$(hostname):31000 --limit 4
CUDA_VISIBLE_DEVICES=1 python3 -m fastchat.serve.multi_model_worker --model-path OpenAssistant/oasst-sft-4-pythia-12b-epoch-3.5 --model-name oasst-pythia-12b --model-path mosaicml/mpt-7b-chat --model-name mpt-7b-chat --controller http://node-01:10002 --host 0.0.0.0 --port 31001 --worker http://$(hostname):31001 --limit 4
CUDA_VISIBLE_DEVICES=2 python3 -m fastchat.serve.multi_model_worker --model-path lmsys/vicuna-7b-v1.3 --model-name vicuna-7b --model-path THUDM/chatglm-6b --model-name chatglm-6b --controller http://node-01:10002 --host 0.0.0.0 --port 31002 --worker http://$(hostname):31002 --limit 4
CUDA_VISIBLE_DEVICES=2 python3 -m fastchat.serve.multi_model_worker --model-path lmsys/vicuna-7b-v1.5 --model-name vicuna-7b --model-path THUDM/chatglm-6b --model-name chatglm-6b --controller http://node-01:10002 --host 0.0.0.0 --port 31002 --worker http://$(hostname):31002 --limit 4
CUDA_VISIBLE_DEVICES=3 python3 -m fastchat.serve.vllm_worker --model-path ~/model_weights/alpaca-13b --controller http://node-01:10002 --host 0.0.0.0 --port 31003 --worker-address http://$(hostname):31003
```

Expand Down
2 changes: 1 addition & 1 deletion docs/commands/webserver.md
Original file line number Diff line number Diff line change
Expand Up @@ -27,7 +27,7 @@ cd fastchat_logs/server0
export OPENAI_API_KEY=
export ANTHROPIC_API_KEY=
python3 -m fastchat.serve.gradio_web_server_multi --controller http://localhost:21001 --concurrency 10 --add-chatgpt --add-claude --add-palm --anony-only --elo ~/elo_results/elo_results_20230619.pkl --leaderboard-table-file ~/elo_results/leaderboard_table_20230619.csv
python3 -m fastchat.serve.gradio_web_server_multi --controller http://localhost:21001 --concurrency 10 --add-chatgpt --add-claude --add-palm --anony-only --elo ~/elo_results/elo_results_20230802.pkl --leaderboard-table-file ~/elo_results/leaderboard_table_20230802.csv --register ~/elo_results/register_oai_models.json
python3 backup_logs.py
```
Expand Down
24 changes: 19 additions & 5 deletions docs/model_support.md
Original file line number Diff line number Diff line change
@@ -1,16 +1,26 @@
# Model Support

## Supported models

- [meta-llama/Llama-2-7b-chat-hf](https://huggingface.co/meta-llama/Llama-2-7b-chat-hf)
- example: `python3 -m fastchat.serve.cli --model-path meta-llama/Llama-2-7b-chat-hf
- example: `python3 -m fastchat.serve.cli --model-path meta-llama/Llama-2-7b-chat-hf`
- Vicuna, Alpaca, LLaMA, Koala
- example: `python3 -m fastchat.serve.cli --model-path lmsys/vicuna-7b-v1.3`
- example: `python3 -m fastchat.serve.cli --model-path lmsys/vicuna-7b-v1.3`
- [BAAI/AquilaChat-7B](https://huggingface.co/BAAI/AquilaChat-7B)
- [BAAI/bge-large-en](https://huggingface.co/BAAI/bge-large-en#using-huggingface-transformers)
- [baichuan-inc/baichuan-7B](https://huggingface.co/baichuan-inc/baichuan-7B)
- [BlinkDL/RWKV-4-Raven](https://huggingface.co/BlinkDL/rwkv-4-raven)
- example: `python3 -m fastchat.serve.cli --model-path ~/model_weights/RWKV-4-Raven-7B-v11x-Eng99%-Other1%-20230429-ctx8192.pth`
- example: `python3 -m fastchat.serve.cli --model-path ~/model_weights/RWKV-4-Raven-7B-v11x-Eng99%-Other1%-20230429-ctx8192.pth`
- [bofenghuang/vigogne-2-7b-instruct](https://huggingface.co/bofenghuang/vigogne-2-7b-instruct)
- [bofenghuang/vigogne-2-7b-chat](https://huggingface.co/bofenghuang/vigogne-2-7b-chat)
- [camel-ai/CAMEL-13B-Combined-Data](https://huggingface.co/camel-ai/CAMEL-13B-Combined-Data)
- [codellama/CodeLlama-7b-Instruct-hf](https://huggingface.co/codellama/CodeLlama-7b-Instruct-hf)
- [databricks/dolly-v2-12b](https://huggingface.co/databricks/dolly-v2-12b)
- [FlagAlpha/Llama2-Chinese-13b-Chat](https://huggingface.co/FlagAlpha/Llama2-Chinese-13b-Chat)
- [FreedomIntelligence/phoenix-inst-chat-7b](https://huggingface.co/FreedomIntelligence/phoenix-inst-chat-7b)
- [FreedomIntelligence/ReaLM-7b-v1](https://huggingface.co/FreedomIntelligence/Realm-7b)
- [h2oai/h2ogpt-gm-oasst1-en-2048-open-llama-7b](https://huggingface.co/h2oai/h2ogpt-gm-oasst1-en-2048-open-llama-7b)
- [internlm/internlm-chat-7b](https://huggingface.co/internlm/internlm-chat-7b)
- [lcw99/polyglot-ko-12.8b-chang-instruct-chat](https://huggingface.co/lcw99/polyglot-ko-12.8b-chang-instruct-chat)
- [lmsys/fastchat-t5-3b-v1.0](https://huggingface.co/lmsys/fastchat-t5)
- [mosaicml/mpt-7b-chat](https://huggingface.co/mosaicml/mpt-7b-chat)
Expand All @@ -20,7 +30,9 @@
- [NousResearch/Nous-Hermes-13b](https://huggingface.co/NousResearch/Nous-Hermes-13b)
- [openaccess-ai-collective/manticore-13b-chat-pyg](https://huggingface.co/openaccess-ai-collective/manticore-13b-chat-pyg)
- [OpenAssistant/oasst-sft-4-pythia-12b-epoch-3.5](https://huggingface.co/OpenAssistant/oasst-sft-4-pythia-12b-epoch-3.5)
- [VMware/open-llama-7b-v2-open-instruct](https://huggingface.co/VMware/open-llama-7b-v2-open-instruct)
- [project-baize/baize-v2-7b](https://huggingface.co/project-baize/baize-v2-7b)
- [Qwen/Qwen-7B-Chat](https://huggingface.co/Qwen/Qwen-7B-Chat)
- [Salesforce/codet5p-6b](https://huggingface.co/Salesforce/codet5p-6b)
- [StabilityAI/stablelm-tuned-alpha-7b](https://huggingface.co/stabilityai/stablelm-tuned-alpha-7b)
- [THUDM/chatglm-6b](https://huggingface.co/THUDM/chatglm-6b)
Expand All @@ -29,8 +41,7 @@
- [timdettmers/guanaco-33b-merged](https://huggingface.co/timdettmers/guanaco-33b-merged)
- [togethercomputer/RedPajama-INCITE-7B-Chat](https://huggingface.co/togethercomputer/RedPajama-INCITE-7B-Chat)
- [WizardLM/WizardLM-13B-V1.0](https://huggingface.co/WizardLM/WizardLM-13B-V1.0)
- [baichuan-inc/baichuan-7B](https://huggingface.co/baichuan-inc/baichuan-7B)
- [internlm/internlm-chat-7b](https://huggingface.co/internlm/internlm-chat-7b)
- [WizardLM/WizardCoder-15B-V1.0](https://huggingface.co/WizardLM/WizardCoder-15B-V1.0)
- [HuggingFaceH4/starchat-beta](https://huggingface.co/HuggingFaceH4/starchat-beta)
- Any [EleutherAI](https://huggingface.co/EleutherAI) pythia model such as [pythia-6.9b](https://huggingface.co/EleutherAI/pythia-6.9b)
- Any [Peft](https://github.com/huggingface/peft) adapter trained on top of a
Expand All @@ -43,18 +54,21 @@

To support a new model in FastChat, you need to correctly handle its prompt template and model loading.
The goal is to make the following command run with the correct prompts.

```
python3 -m fastchat.serve.cli --model [YOUR_MODEL_PATH]
```

You can run this example command to learn the code logic.

```
python3 -m fastchat.serve.cli --model lmsys/vicuna-7b-v1.3
```

You can add `--debug` to see the actual prompt sent to the model.

### Steps

FastChat uses the `Conversation` class to handle prompt templates and `BaseModelAdapter` class to handle model loading.

1. Implement a conversation template for the new model at [fastchat/conversation.py](https://github.com/lm-sys/FastChat/blob/main/fastchat/conversation.py). You can follow existing examples and use `register_conv_template` to add a new one.
Expand Down
7 changes: 4 additions & 3 deletions docs/openai_api.md
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
# OpenAI-Compatible RESTful APIs & SDK
# OpenAI-Compatible RESTful APIs

FastChat provides OpenAI-compatible APIs for its supported models, so you can use FastChat as a local drop-in replacement for OpenAI APIs.
The FastChat server is compatible with both [openai-python](https://github.com/openai/openai-python) library and cURL commands.
Expand Down Expand Up @@ -40,7 +40,9 @@ pip install --upgrade openai
Then, interact with model vicuna:
```python
import openai
openai.api_key = "EMPTY" # Not support yet
# to get proper authentication, make sure to use a valid key that's listed in
# the --api-keys flag. if no flag value is provided, the `api_key` will be ignored.
openai.api_key = "EMPTY"
openai.api_base = "http://localhost:8000/v1"

model = "vicuna-7b-v1.3"
Expand Down Expand Up @@ -146,5 +148,4 @@ Some features to be implemented:
- [ ] Support more parameters like `logprobs`, `logit_bias`, `user`, `presence_penalty` and `frequency_penalty`
- [ ] Model details (permissions, owner and create time)
- [ ] Edits API
- [ ] Authentication and API key
- [ ] Rate Limitation Settings
2 changes: 0 additions & 2 deletions docs/training.md
Original file line number Diff line number Diff line change
Expand Up @@ -87,5 +87,3 @@ deepspeed fastchat/train/train_lora_t5.py \
--deepspeed playground/deepspeed_config_s2.json

```


Loading

0 comments on commit 9b11481

Please sign in to comment.