Merge remote-tracking branch 'upstream/main' (2023-08-30)

shaleprotocol · Aug 30, 2023 · 9b11481 · 9b11481
2 parents c043a1f + 2fbfcbc
commit 9b11481
Show file tree

Hide file tree

Showing 65 changed files with 2,833 additions and 844 deletions.
diff --git a/README.md b/README.md
@@ -9,13 +9,17 @@ Thanks to LMSYS team❤️
 
 ## Coverage of models
 
+We are focused to support Llama2 at scale now. If you want any other models, please contact.
+
 * Llama-2-13b-chat-hf
-* longchat-13b-1
-* falcon-7b-instruct
-* codet5p-6b
+
 
 ## Dev Log
 
+### 2023-08
+
+Support llama2 at scale.
+
 ### 2023-07-26
 
 Support "Llama-2-13b-chat-hf" and make it the default for API.

diff --git a/docs/arena.md b/docs/arena.md
@@ -1,9 +1,14 @@
 # Chatbot Arena
-Chatbot Arena is an LLM benchmark platform featuring anonymous, randomized battles, available at https://arena.lmsys.org.
+Chatbot Arena is an LLM benchmark platform featuring anonymous, randomized battles, available at https://chat.lmsys.org.
 We invite the entire community to join this benchmarking effort by contributing your votes and models.
 
 ## How to add a new model
-If you want to see a specific model in the arena, you can follow the steps below.
+If you want to see a specific model in the arena, you can follow the methods below.
 
-1. Contribute code to support this model in FastChat by submitting a pull request. See [instructions](model_support.md#how-to-support-a-new-model).
-2. After the model is supported, we will try to schedule some computing resources to host the model in the arena. However, due to the limited resources we have, we may not be able to serve every model. We will select the models based on popularity, quality, diversity, and other factors.
+- Method 1: Hosted by LMSYS.
+  1. Contribute the code to support this model in FastChat by submitting a pull request. See [instructions](model_support.md#how-to-support-a-new-model).
+  2. After the model is supported, we will try to schedule some compute resources to host the model in the arena. However, due to the limited resources we have, we may not be able to serve every model. We will select the models based on popularity, quality, diversity, and other factors.
+
+- Method 2: Hosted by 3rd party API providers or yourself.
+  1. If you have a model hosted by a 3rd party API provider or yourself, please give us an API endpoint. We prefer OpenAI-compatible APIs, so we can reuse our [code](https://github.com/lm-sys/FastChat/blob/33dca5cf12ee602455bfa9b5f4790a07829a2db7/fastchat/serve/gradio_web_server.py#L333-L358) for calling OpenAI models.
+  2. You can use FastChat's OpenAI API [server](openai_api.md) to serve your model with OpenAI-compatible APIs and provide us with the endpoint.
diff --git a/docs/awq.md b/docs/awq.md
@@ -0,0 +1,71 @@
+# AWQ 4bit Inference
+
+We integrated [AWQ](https://github.com/mit-han-lab/llm-awq) into FastChat to provide **efficient and accurate** 4bit LLM inference.
+
+## Install AWQ
+
+Setup environment (please refer to [this link](https://github.com/mit-han-lab/llm-awq#install) for more details):
+```bash
+conda create -n fastchat-awq python=3.10 -y
+conda activate fastchat-awq
+# cd /path/to/FastChat
+pip install --upgrade pip    # enable PEP 660 support
+pip install -e .             # install fastchat
+
+git clone https://github.com/mit-han-lab/llm-awq repositories/llm-awq
+cd repositories/llm-awq
+pip install -e .             # install awq package
+
+cd awq/kernels				
+python setup.py install	     # install awq CUDA kernels
+```
+
+## Chat with the CLI
+
+```bash
+# Download quantized model from huggingface
+# Make sure you have git-lfs installed (https://git-lfs.com)
+git lfs install
+git clone https://huggingface.co/mit-han-lab/vicuna-7b-v1.3-4bit-g128-awq
+
+# You can specify which quantized model to use by setting --awq-ckpt
+python3 -m fastchat.serve.cli \
+    --model-path models/vicuna-7b-v1.3-4bit-g128-awq \
+    --awq-wbits 4 \
+    --awq-groupsize 128 
+```
+
+## Benchmark
+
+* Through **4-bit weight quantization**, AWQ helps to run larger language models within the device memory restriction and prominently accelerates token generation. All benchmarks are done with group_size 128. 
+
+* Benchmark on NVIDIA RTX A6000:
+
+  | Model           | Bits | Max Memory (MiB) | Speed (ms/token) | AWQ Speedup |
+  | --------------- | ---- | ---------------- | ---------------- | ----------- |
+  | vicuna-7b       | 16   | 13543            | 26.06            | /           |
+  | vicuna-7b       | 4    | 5547             | 12.43            | 2.1x        |
+  | llama2-7b-chat  | 16   | 13543            | 27.14            | /           |
+  | llama2-7b-chat  | 4    | 5547             | 12.44            | 2.2x        |
+  | vicuna-13b      | 16   | 25647            | 44.91            | /           |
+  | vicuna-13b      | 4    | 9355             | 17.30            | 2.6x        |
+  | llama2-13b-chat | 16   | 25647            | 47.28            | /           |
+  | llama2-13b-chat | 4    | 9355             | 20.28            | 2.3x        |
+
+* NVIDIA RTX 4090:
+
+  | Model           | AWQ 4bit Speed (ms/token) | FP16 Speed (ms/token) | AWQ Speedup |
+  | --------------- | ------------------------- | --------------------- | ----------- |
+  | vicuna-7b       | 8.61                      | 19.09                 | 2.2x        |
+  | llama2-7b-chat  | 8.66                      | 19.97                 | 2.3x        |
+  | vicuna-13b      | 12.17                     | OOM                   | /           |
+  | llama2-13b-chat | 13.54                     | OOM                   | /           |
+
+* NVIDIA Jetson Orin:
+
+  | Model           | AWQ 4bit Speed (ms/token) | FP16 Speed (ms/token) | AWQ Speedup |
+  | --------------- | ------------------------- | --------------------- | ----------- |
+  | vicuna-7b       | 65.34                     | 93.12                 | 1.4x        |
+  | llama2-7b-chat  | 75.11                     | 104.71                | 1.4x        |
+  | vicuna-13b      | 115.40                    | OOM                   | /           |
+  | llama2-13b-chat | 136.81                    | OOM                   | /           |
diff --git a/docs/commands/conv_release.md b/docs/commands/conv_release.md
@@ -0,0 +1,38 @@
+## Chatbot Arena Conversations
+
+1. Gather battles
+```
+python3 clean_battle_data.py --max-num 10 --mode conv_release
+```
+
+2. Tag OpenAI moderation
+```
+python3 tag_openai_moderation.py --in clean_battle_conv_20230814.json
+```
+
+3. Clean PII
+
+4. Filter additional blocked words
+
+```
+python3 filter_bad_conv.py --in clean_battle_conv_20230630_tagged_v1_pii.json
+```
+
+5. Add additional toxicity tag
+
+
+## All Conversations
+
+1. Gather chats
+```
+python3 clean_chat_data.py
+```
+
+2. Sample
+```
+python3 conv_release_scripts/sample.py
+```
+
+
+## Prompt distribution
+
diff --git a/docs/commands/local_cluster.md b/docs/commands/local_cluster.md
@@ -3,8 +3,10 @@ node-01
 ```
 python3 -m fastchat.serve.controller --host 0.0.0.0 --port 10002
 
-CUDA_VISIBLE_DEVICES=0 python3 -m fastchat.serve.vllm_worker --model-path lmsys/vicuna-13b-v1.3 --model-name vicuna-13b --controller http://node-01:10002 --host 0.0.0.0 --port 31000 --worker-address http://$(hostname):31000
-CUDA_VISIBLE_DEVICES=1 python3 -m fastchat.serve.vllm_worker --model-path lmsys/vicuna-13b-v1.3 --model-name vicuna-13b --controller http://node-01:10002 --host 0.0.0.0 --port 31001 --worker-address http://$(hostname):31001
+CUDA_VISIBLE_DEVICES=0 python3 -m fastchat.serve.vllm_worker --model-path lmsys/vicuna-13b-v1.5 --model-name vicuna-13b --controller http://node-01:10002 --host 0.0.0.0 --port 31000 --worker-address http://$(hostname):31000
+CUDA_VISIBLE_DEVICES=1 python3 -m fastchat.serve.vllm_worker --model-path lmsys/vicuna-13b-v1.5 --model-name vicuna-13b --controller http://node-01:10002 --host 0.0.0.0 --port 31001 --worker-address http://$(hostname):31001
+
+CUDA_VISIBLE_DEVICES=2,3 ray start --head
 python3 -m fastchat.serve.vllm_worker --model-path lmsys/vicuna-33b-v1.3 --model-name vicuna-33b --controller http://node-01:10002 --host 0.0.0.0 --port 31002 --worker-address http://$(hostname):31002 --num-gpus 2
 ```
 
@@ -13,7 +15,7 @@ node-02
 CUDA_VISIBLE_DEVICES=0 python3 -m fastchat.serve.vllm_worker --model-path meta-llama/Llama-2-13b-chat-hf --model-name llama-2-13b-chat --controller http://node-01:10002 --host 0.0.0.0 --port 31000 --worker-address http://$(hostname):31000 --tokenizer meta-llama/Llama-2-7b-chat-hf
 CUDA_VISIBLE_DEVICES=1 python3 -m fastchat.serve.vllm_worker --model-path meta-llama/Llama-2-13b-chat-hf --model-name llama-2-13b-chat --controller http://node-01:10002 --host 0.0.0.0 --port 31001 --worker-address http://$(hostname):31001 --tokenizer meta-llama/Llama-2-7b-chat-hf
 CUDA_VISIBLE_DEVICES=2 python3 -m fastchat.serve.vllm_worker --model-path meta-llama/Llama-2-7b-chat-hf --model-name llama-2-7b-chat --controller http://node-01:10002 --host 0.0.0.0 --port 31002 --worker-address http://$(hostname):31002 --tokenizer meta-llama/Llama-2-7b-chat-hf
-CUDA_VISIBLE_DEVICES=3 python3 -m fastchat.serve.vllm_worker --model-path TheBloke/wizardLM-13B-1.0-fp16 --model-name wizardlm-13b  --controller http://node-01:10002 --host 0.0.0.0 --port 31003 --worker-address http://$(hostname):31003
+CUDA_VISIBLE_DEVICES=3 python3 -m fastchat.serve.vllm_worker --model-path WizardLM/WizardLM-13B-V1.1 --model-name wizardlm-13b  --controller http://node-01:10002 --host 0.0.0.0 --port 31003 --worker-address http://$(hostname):31003
 ```
 
 node-03
@@ -26,7 +28,7 @@ node-04
 ```
 CUDA_VISIBLE_DEVICES=0 python3 -m fastchat.serve.multi_model_worker --model-path ~/model_weights/RWKV-4-Raven-14B-v12-Eng98%25-Other2%25-20230523-ctx8192.pth --model-name RWKV-4-Raven-14B --model-path lmsys/fastchat-t5-3b-v1.0 --model-name fastchat-t5-3b --controller http://node-01:10002 --host 0.0.0.0 --port 31000 --worker http://$(hostname):31000 --limit 4
 CUDA_VISIBLE_DEVICES=1 python3 -m fastchat.serve.multi_model_worker --model-path OpenAssistant/oasst-sft-4-pythia-12b-epoch-3.5 --model-name oasst-pythia-12b --model-path mosaicml/mpt-7b-chat --model-name mpt-7b-chat --controller http://node-01:10002 --host 0.0.0.0 --port 31001 --worker http://$(hostname):31001 --limit 4
-CUDA_VISIBLE_DEVICES=2 python3 -m fastchat.serve.multi_model_worker --model-path lmsys/vicuna-7b-v1.3 --model-name vicuna-7b --model-path THUDM/chatglm-6b --model-name chatglm-6b --controller http://node-01:10002 --host 0.0.0.0 --port 31002 --worker http://$(hostname):31002 --limit 4
+CUDA_VISIBLE_DEVICES=2 python3 -m fastchat.serve.multi_model_worker --model-path lmsys/vicuna-7b-v1.5 --model-name vicuna-7b --model-path THUDM/chatglm-6b --model-name chatglm-6b --controller http://node-01:10002 --host 0.0.0.0 --port 31002 --worker http://$(hostname):31002 --limit 4
 CUDA_VISIBLE_DEVICES=3 python3 -m fastchat.serve.vllm_worker --model-path ~/model_weights/alpaca-13b  --controller http://node-01:10002 --host 0.0.0.0 --port 31003 --worker-address http://$(hostname):31003
 ```
 

diff --git a/docs/commands/webserver.md b/docs/commands/webserver.md
@@ -27,7 +27,7 @@ cd fastchat_logs/server0
 export OPENAI_API_KEY=
 export ANTHROPIC_API_KEY=
 
-python3 -m fastchat.serve.gradio_web_server_multi --controller http://localhost:21001 --concurrency 10 --add-chatgpt --add-claude --add-palm --anony-only --elo ~/elo_results/elo_results_20230619.pkl --leaderboard-table-file ~/elo_results/leaderboard_table_20230619.csv
+python3 -m fastchat.serve.gradio_web_server_multi --controller http://localhost:21001 --concurrency 10 --add-chatgpt --add-claude --add-palm --anony-only --elo ~/elo_results/elo_results_20230802.pkl --leaderboard-table-file ~/elo_results/leaderboard_table_20230802.csv --register ~/elo_results/register_oai_models.json
 
 python3 backup_logs.py
 ```

diff --git a/docs/model_support.md b/docs/model_support.md
@@ -1,16 +1,26 @@
 # Model Support
 
 ## Supported models
+
 - [meta-llama/Llama-2-7b-chat-hf](https://huggingface.co/meta-llama/Llama-2-7b-chat-hf)
-   - example: `python3 -m fastchat.serve.cli --model-path meta-llama/Llama-2-7b-chat-hf
+  - example: `python3 -m fastchat.serve.cli --model-path meta-llama/Llama-2-7b-chat-hf`
 - Vicuna, Alpaca, LLaMA, Koala
-   - example: `python3 -m fastchat.serve.cli --model-path lmsys/vicuna-7b-v1.3`
+  - example: `python3 -m fastchat.serve.cli --model-path lmsys/vicuna-7b-v1.3`
+- [BAAI/AquilaChat-7B](https://huggingface.co/BAAI/AquilaChat-7B)
+- [BAAI/bge-large-en](https://huggingface.co/BAAI/bge-large-en#using-huggingface-transformers)
+- [baichuan-inc/baichuan-7B](https://huggingface.co/baichuan-inc/baichuan-7B)
 - [BlinkDL/RWKV-4-Raven](https://huggingface.co/BlinkDL/rwkv-4-raven)
-   - example: `python3 -m fastchat.serve.cli --model-path ~/model_weights/RWKV-4-Raven-7B-v11x-Eng99%-Other1%-20230429-ctx8192.pth`
+  - example: `python3 -m fastchat.serve.cli --model-path ~/model_weights/RWKV-4-Raven-7B-v11x-Eng99%-Other1%-20230429-ctx8192.pth`
+- [bofenghuang/vigogne-2-7b-instruct](https://huggingface.co/bofenghuang/vigogne-2-7b-instruct)
+- [bofenghuang/vigogne-2-7b-chat](https://huggingface.co/bofenghuang/vigogne-2-7b-chat)
 - [camel-ai/CAMEL-13B-Combined-Data](https://huggingface.co/camel-ai/CAMEL-13B-Combined-Data)
+- [codellama/CodeLlama-7b-Instruct-hf](https://huggingface.co/codellama/CodeLlama-7b-Instruct-hf)
 - [databricks/dolly-v2-12b](https://huggingface.co/databricks/dolly-v2-12b)
+- [FlagAlpha/Llama2-Chinese-13b-Chat](https://huggingface.co/FlagAlpha/Llama2-Chinese-13b-Chat)
 - [FreedomIntelligence/phoenix-inst-chat-7b](https://huggingface.co/FreedomIntelligence/phoenix-inst-chat-7b)
+- [FreedomIntelligence/ReaLM-7b-v1](https://huggingface.co/FreedomIntelligence/Realm-7b)
 - [h2oai/h2ogpt-gm-oasst1-en-2048-open-llama-7b](https://huggingface.co/h2oai/h2ogpt-gm-oasst1-en-2048-open-llama-7b)
+- [internlm/internlm-chat-7b](https://huggingface.co/internlm/internlm-chat-7b)
 - [lcw99/polyglot-ko-12.8b-chang-instruct-chat](https://huggingface.co/lcw99/polyglot-ko-12.8b-chang-instruct-chat)
 - [lmsys/fastchat-t5-3b-v1.0](https://huggingface.co/lmsys/fastchat-t5)
 - [mosaicml/mpt-7b-chat](https://huggingface.co/mosaicml/mpt-7b-chat)
@@ -20,7 +30,9 @@
 - [NousResearch/Nous-Hermes-13b](https://huggingface.co/NousResearch/Nous-Hermes-13b)
 - [openaccess-ai-collective/manticore-13b-chat-pyg](https://huggingface.co/openaccess-ai-collective/manticore-13b-chat-pyg)
 - [OpenAssistant/oasst-sft-4-pythia-12b-epoch-3.5](https://huggingface.co/OpenAssistant/oasst-sft-4-pythia-12b-epoch-3.5)
+- [VMware/open-llama-7b-v2-open-instruct](https://huggingface.co/VMware/open-llama-7b-v2-open-instruct)
 - [project-baize/baize-v2-7b](https://huggingface.co/project-baize/baize-v2-7b)
+- [Qwen/Qwen-7B-Chat](https://huggingface.co/Qwen/Qwen-7B-Chat)
 - [Salesforce/codet5p-6b](https://huggingface.co/Salesforce/codet5p-6b)
 - [StabilityAI/stablelm-tuned-alpha-7b](https://huggingface.co/stabilityai/stablelm-tuned-alpha-7b)
 - [THUDM/chatglm-6b](https://huggingface.co/THUDM/chatglm-6b)
@@ -29,8 +41,7 @@
 - [timdettmers/guanaco-33b-merged](https://huggingface.co/timdettmers/guanaco-33b-merged)
 - [togethercomputer/RedPajama-INCITE-7B-Chat](https://huggingface.co/togethercomputer/RedPajama-INCITE-7B-Chat)
 - [WizardLM/WizardLM-13B-V1.0](https://huggingface.co/WizardLM/WizardLM-13B-V1.0)
-- [baichuan-inc/baichuan-7B](https://huggingface.co/baichuan-inc/baichuan-7B)
-- [internlm/internlm-chat-7b](https://huggingface.co/internlm/internlm-chat-7b)
+- [WizardLM/WizardCoder-15B-V1.0](https://huggingface.co/WizardLM/WizardCoder-15B-V1.0)
 - [HuggingFaceH4/starchat-beta](https://huggingface.co/HuggingFaceH4/starchat-beta)
 - Any [EleutherAI](https://huggingface.co/EleutherAI) pythia model such as [pythia-6.9b](https://huggingface.co/EleutherAI/pythia-6.9b)
 - Any [Peft](https://github.com/huggingface/peft) adapter trained on top of a
@@ -43,18 +54,21 @@
 
 To support a new model in FastChat, you need to correctly handle its prompt template and model loading.
 The goal is to make the following command run with the correct prompts.
+
 ```
 python3 -m fastchat.serve.cli --model [YOUR_MODEL_PATH]
 ```
 
 You can run this example command to learn the code logic.
+
 ```
 python3 -m fastchat.serve.cli --model lmsys/vicuna-7b-v1.3
 ```
 
 You can add `--debug` to see the actual prompt sent to the model.
 
 ### Steps
+
 FastChat uses the `Conversation` class to handle prompt templates and `BaseModelAdapter` class to handle model loading.
 
 1. Implement a conversation template for the new model at [fastchat/conversation.py](https://github.com/lm-sys/FastChat/blob/main/fastchat/conversation.py). You can follow existing examples and use `register_conv_template` to add a new one.

diff --git a/docs/openai_api.md b/docs/openai_api.md
@@ -1,4 +1,4 @@
-# OpenAI-Compatible RESTful APIs & SDK
+# OpenAI-Compatible RESTful APIs
 
 FastChat provides OpenAI-compatible APIs for its supported models, so you can use FastChat as a local drop-in replacement for OpenAI APIs.
 The FastChat server is compatible with both [openai-python](https://github.com/openai/openai-python) library and cURL commands.
@@ -40,7 +40,9 @@ pip install --upgrade openai
 Then, interact with model vicuna:
 ```python
 import openai
-openai.api_key = "EMPTY" # Not support yet
+# to get proper authentication, make sure to use a valid key that's listed in
+# the --api-keys flag. if no flag value is provided, the `api_key` will be ignored.
+openai.api_key = "EMPTY"
 openai.api_base = "http://localhost:8000/v1"
 
 model = "vicuna-7b-v1.3"
@@ -146,5 +148,4 @@ Some features to be implemented:
 - [ ] Support more parameters like `logprobs`, `logit_bias`, `user`, `presence_penalty` and `frequency_penalty`
 - [ ] Model details (permissions, owner and create time)
 - [ ] Edits API
-- [ ] Authentication and API key
 - [ ] Rate Limitation Settings
diff --git a/docs/training.md b/docs/training.md
@@ -87,5 +87,3 @@ deepspeed fastchat/train/train_lora_t5.py \
         --deepspeed playground/deepspeed_config_s2.json
 
 ```
-
-
Original file line number	Diff line number	Diff line change
Expand Up		@@ -87,5 +87,3 @@ deepspeed fastchat/train/train_lora_t5.py \
		--deepspeed playground/deepspeed_config_s2.json

		```