Skip to content

Commit

Permalink
Merge branch 'inference' into peft
Browse files Browse the repository at this point in the history
  • Loading branch information
goliaro authored Nov 6, 2023
2 parents 463c757 + b0fe522 commit 1c231ba
Show file tree
Hide file tree
Showing 28 changed files with 264 additions and 212 deletions.
10 changes: 5 additions & 5 deletions .github/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -72,7 +72,7 @@ ff.init(
Second, we specify the LLM to serve and the SSM(s) used to accelerate LLM serving. The list of supported LLMs and SSMs is available at [supported models](#supported-llms-and-ssms).
```python
# Specify the LLM
llm = ff.LLM("decapoda-research/llama-7b-hf")
llm = ff.LLM("meta-llama/Llama-2-7b-hf")

# Specify a list of SSMs (just one in this case)
ssms=[]
Expand Down Expand Up @@ -116,7 +116,7 @@ ff.init(
)

# Create the FlexFlow LLM
llm = ff.LLM("decapoda-research/llama-7b-hf")
llm = ff.LLM("meta-llama/Llama-2-7b-hf")

# Create the sampling configs
generation_config = ff.GenerationConfig(
Expand Down Expand Up @@ -152,7 +152,7 @@ A C++ example is available at [this folder](../inference/spec_infer/). After bui
* `-ll:gpu`: number of GPU processors to use on each node for serving an LLM (default: 0)
* `-ll:fsize`: size of device memory on each GPU in MB
* `-ll:zsize`: size of zero-copy memory (pinned DRAM with direct GPU access) in MB. FlexFlow Serve keeps a replica of the LLM parameters on zero-copy memory, and therefore requires that the zero-copy memory is sufficient for storing the LLM parameters.
* `-llm-model`: the LLM model ID from HuggingFace (e.g. "decapoda-research/llama-7b-hf")
* `-llm-model`: the LLM model ID from HuggingFace (e.g. "meta-llama/Llama-2-7b-hf")
* `-ssm-model`: the SSM model ID from HuggingFace (e.g. "JackFram/llama-160m"). You can use multiple `-ssm-model`s in the command line to launch multiple SSMs.
* `-cache-folder`: the folder
* `-data-parallelism-degree`, `-tensor-parallelism-degree` and `-pipeline-parallelism-degree`: parallelization degrees in the data, tensor, and pipeline dimensions. Their product must equal the number of GPUs available on the machine. When any of the three parallelism degree arguments is omitted, a default value of 1 will be used.
Expand All @@ -162,7 +162,7 @@ A C++ example is available at [this folder](../inference/spec_infer/). After bui
For example, you can use the following command line to serve a LLaMA-7B or LLaMA-13B model on 4 GPUs and use two collectively boost-tuned LLaMA-68M models for speculative inference.

```bash
./inference/spec_infer/spec_infer -ll:gpu 4 -ll:fsize 14000 -ll:zsize 30000 -llm-model decapoda-research/llama-7b-hf -ssm-model JackFram/llama-68m -prompt /path/to/prompt.json -tensor-parallelism-degree 4 --fusion
./inference/spec_infer/spec_infer -ll:gpu 4 -ll:fsize 14000 -ll:zsize 30000 -llm-model meta-llama/Llama-2-7b-hf -ssm-model JackFram/llama-68m -prompt /path/to/prompt.json -tensor-parallelism-degree 4 --fusion
```
</details>

Expand Down Expand Up @@ -193,7 +193,7 @@ Below is a list of models that we have explicitly tested and for which a SSM may

| Model | Model id on HuggingFace | Boost-tuned SSMs |
| :---- | :---- | :---- |
| LLaMA-7B | decapoda-research/llama-7b-hf | [LLaMA-68M](https://huggingface.co/JackFram/llama-68m) , [LLaMA-160M](https://huggingface.co/JackFram/llama-160m) |
| LLaMA-7B | meta-llama/Llama-2-7b-hf | [LLaMA-68M](https://huggingface.co/JackFram/llama-68m) , [LLaMA-160M](https://huggingface.co/JackFram/llama-160m) |
| LLaMA-13B | decapoda-research/llama-13b-hf | [LLaMA-68M](https://huggingface.co/JackFram/llama-68m) , [LLaMA-160M](https://huggingface.co/JackFram/llama-160m) |
| LLaMA-30B | decapoda-research/llama-30b-hf | [LLaMA-68M](https://huggingface.co/JackFram/llama-68m) , [LLaMA-160M](https://huggingface.co/JackFram/llama-160m) |
| LLaMA-65B | decapoda-research/llama-65b-hf | [LLaMA-68M](https://huggingface.co/JackFram/llama-68m) , [LLaMA-160M](https://huggingface.co/JackFram/llama-160m) |
Expand Down
6 changes: 3 additions & 3 deletions .github/workflows/gpu-ci-skip.yml
Original file line number Diff line number Diff line change
Expand Up @@ -15,7 +15,7 @@ on:
- ".github/workflows/gpu-ci.yml"
- "tests/cpp_gpu_tests.sh"
- "tests/inference_tests.sh"
- "tests/multi_gpu_tests.sh"
- "tests/training_tests.sh"
- "tests/python_interface_test.sh"
workflow_dispatch:

Expand Down Expand Up @@ -44,8 +44,8 @@ jobs:
steps:
- run: 'echo "No gpu-ci required"'

gpu-ci-flexflow:
name: Single Machine, Multiple GPUs Tests
training-tests:
name: Training Tests
runs-on: ubuntu-20.04
# if: ${{ github.event_name != 'pull_request' || github.base_ref != 'inference' }}
needs: inference-tests
Expand Down
15 changes: 8 additions & 7 deletions .github/workflows/gpu-ci.yml
Original file line number Diff line number Diff line change
Expand Up @@ -15,7 +15,7 @@ on:
- ".github/workflows/gpu-ci.yml"
- "tests/cpp_gpu_tests.sh"
- "tests/inference_tests.sh"
- "tests/multi_gpu_tests.sh"
- "tests/training_tests.sh"
- "tests/python_interface_test.sh"
push:
branches:
Expand All @@ -34,7 +34,7 @@ on:
- ".github/workflows/gpu-ci.yml"
- "tests/cpp_gpu_tests.sh"
- "tests/inference_tests.sh"
- "tests/multi_gpu_tests.sh"
- "tests/training_tests.sh"
- "tests/python_interface_test.sh"
workflow_dispatch:

Expand Down Expand Up @@ -141,7 +141,8 @@ jobs:
run:
shell: bash -l {0} # required to use an activated conda environment
env:
CONDA: "3"
CONDA: "3"
HUGGINGFACE_TOKEN: ${{ secrets.HUGGINGFACE_TOKEN }}
needs: gpu-ci-concierge
container:
image: ghcr.io/flexflow/flexflow-environment-cuda-11.8:latest
Expand Down Expand Up @@ -185,7 +186,7 @@ jobs:
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:$CONDA_PREFIX/lib
# GPT tokenizer test
./tests/gpt_tokenizer_test.sh
# ./tests/gpt_tokenizer_test.sh
# Inference tests
source ./build/set_python_envs.sh
Expand All @@ -209,8 +210,8 @@ jobs:
if: always()
run: sudo rm -rf ~/.cache

gpu-ci-flexflow:
name: Single Machine, Multiple GPUs Tests
training-tests:
name: Training Tests
runs-on: [self-hosted, gpu]
# skip this time-consuming test for PRs to the inference branch
# if: ${{ github.event_name != 'pull_request' || github.base_ref != 'inference' }}
Expand Down Expand Up @@ -266,5 +267,5 @@ jobs:
# C++ tests
./tests/cpp_gpu_tests.sh 4
# Python tests
./tests/multi_gpu_tests.sh 4
./tests/training_tests.sh 4
6 changes: 3 additions & 3 deletions .github/workflows/multinode-test.yml
Original file line number Diff line number Diff line change
Expand Up @@ -78,7 +78,7 @@ jobs:
export OMPI_ALLOW_RUN_AS_ROOT=1
export OMPI_ALLOW_RUN_AS_ROOT_CONFIRM=1
export OMPI_MCA_btl_vader_single_copy_mechanism=none
./tests/multi_gpu_tests.sh 2 2
./tests/training_tests.sh 2 2
multinode-gpu-test-ucx:
name: Multinode GPU Test with UCX
Expand Down Expand Up @@ -129,7 +129,7 @@ jobs:
export OMPI_ALLOW_RUN_AS_ROOT=1
export OMPI_ALLOW_RUN_AS_ROOT_CONFIRM=1
export OMPI_MCA_btl_vader_single_copy_mechanism=none
./tests/multi_gpu_tests.sh 2 2
./tests/training_tests.sh 2 2
multinode-gpu-test-native-ucx:
name: Multinode GPU Test with native UCX
Expand Down Expand Up @@ -177,7 +177,7 @@ jobs:
export OMPI_ALLOW_RUN_AS_ROOT=1
export OMPI_ALLOW_RUN_AS_ROOT_CONFIRM=1
export OMPI_MCA_btl_vader_single_copy_mechanism=none
./tests/multi_gpu_tests.sh 2 2
./tests/training_tests.sh 2 2
notify-slack:
name: Notify Slack in case of failure
Expand Down
13 changes: 13 additions & 0 deletions CMakeLists.txt
Original file line number Diff line number Diff line change
Expand Up @@ -32,6 +32,19 @@ if(NOT CMAKE_BUILD_TYPE AND NOT CMAKE_CONFIGURATION_TYPES)
STRING "Choose the type of build." FORCE)
endif()

if(INSTALL_DIR)
message(STATUS "INSTALL_DIR: ${INSTALL_DIR}")
set(CMAKE_INSTALL_PREFIX ${INSTALL_DIR} CACHE PATH "Installation directory" FORCE)
else()
# Install DIR not set. Use default, unless a conda environment is active
if (DEFINED ENV{CONDA_PREFIX} AND NOT FF_BUILD_FROM_PYPI)
set(CONDA_PREFIX $ENV{CONDA_PREFIX})
# Set CMAKE_INSTALL_PREFIX to the Conda environment's installation path
set(CMAKE_INSTALL_PREFIX ${CONDA_PREFIX} CACHE PATH "Installation directory" FORCE)
message(STATUS "Active conda environment detected. Setting CMAKE_INSTALL_PREFIX: ${CMAKE_INSTALL_PREFIX}")
endif()
endif()

# do not disable assertions even if in release mode
set(CMAKE_CXX_FLAGS_RELEASE "${CMAKE_CXX_FLAGS_RELEASE} -UNDEBUG")

Expand Down
2 changes: 1 addition & 1 deletion INSTALL.md
Original file line number Diff line number Diff line change
Expand Up @@ -97,7 +97,7 @@ source ./build/set_python_envs.sh
cd "$FF_HOME"
./python/flexflow_python examples/python/native/mnist_mlp.py -ll:py 1 -ll:gpu 1 -ll:fsize <size of gpu buffer> -ll:zsize <size of zero buffer>
```
A script to run all the Python examples is available at `tests/multi_gpu_tests.sh`
A script to run all the Python examples is available at `tests/training_tests.sh`

### Run FlexFlow C++ examples

Expand Down
10 changes: 5 additions & 5 deletions SERVE.md
Original file line number Diff line number Diff line change
Expand Up @@ -32,7 +32,7 @@ ff.init(
Second, we specify the LLM to serve and the SSM(s) used to accelerate LLM serving. The list of supported LLMs and SSMs is available at [supported models](#supported-llms-and-ssms).
```python
# Specify the LLM
llm = ff.LLM("decapoda-research/llama-7b-hf")
llm = ff.LLM("meta-llama/Llama-2-7b-hf")

# Specify a list of SSMs (just one in this case)
ssms=[]
Expand Down Expand Up @@ -78,7 +78,7 @@ ff.init(
)

# Create the FlexFlow LLM
llm = ff.LLM("decapoda-research/llama-7b-hf")
llm = ff.LLM("meta-llama/Llama-2-7b-hf")

# Create the sampling configs
generation_config = ff.GenerationConfig(
Expand Down Expand Up @@ -116,7 +116,7 @@ A C++ example is available at [this folder](../inference/spec_infer/). After bui
* `-ll:gpu`: number of GPU processors to use on each node for serving an LLM (default: 0)
* `-ll:fsize`: size of device memory on each GPU in MB
* `-ll:zsize`: size of zero-copy memory (pinned DRAM with direct GPU access) in MB. FlexFlow Serve keeps a replica of the LLM parameters on zero-copy memory, and therefore requires that the zero-copy memory is sufficient for storing the LLM parameters.
* `-llm-model`: the LLM model ID from HuggingFace (e.g. "decapoda-research/llama-7b-hf")
* `-llm-model`: the LLM model ID from HuggingFace (e.g. "meta-llama/Llama-2-7b-hf")
* `-ssm-model`: the SSM model ID from HuggingFace (e.g. "JackFram/llama-160m"). You can use multiple `-ssm-model`s in the command line to launch multiple SSMs.
* `-cache-folder`: the folder
* `-data-parallelism-degree`, `-tensor-parallelism-degree` and `-pipeline-parallelism-degree`: parallelization degrees in the data, tensor, and pipeline dimensions. Their product must equal the number of GPUs available on the machine. When any of the three parallelism degree arguments is omitted, a default value of 1 will be used.
Expand All @@ -126,7 +126,7 @@ A C++ example is available at [this folder](../inference/spec_infer/). After bui
For example, you can use the following command line to serve a LLaMA-7B or LLaMA-13B model on 4 GPUs and use two collectively boost-tuned LLaMA-68M models for speculative inference.

```bash
./inference/spec_infer/spec_infer -ll:gpu 4 -ll:fsize 14000 -ll:zsize 30000 -llm-model decapoda-research/llama-7b-hf -ssm-model JackFram/llama-68m -prompt /path/to/prompt.json -tensor-parallelism-degree 4 --fusion
./inference/spec_infer/spec_infer -ll:gpu 4 -ll:fsize 14000 -ll:zsize 30000 -llm-model meta-llama/Llama-2-7b-hf -ssm-model JackFram/llama-68m -prompt /path/to/prompt.json -tensor-parallelism-degree 4 --fusion
```
</details>

Expand Down Expand Up @@ -157,7 +157,7 @@ Below is a list of models that we have explicitly tested and for which a SSM may

| Model | Model id on HuggingFace | Boost-tuned SSMs |
| :---- | :---- | :---- |
| LLaMA-7B | decapoda-research/llama-7b-hf | [LLaMA-68M](https://huggingface.co/JackFram/llama-68m) , [LLaMA-160M](https://huggingface.co/JackFram/llama-160m) |
| LLaMA-7B | meta-llama/Llama-2-7b-hf | [LLaMA-68M](https://huggingface.co/JackFram/llama-68m) , [LLaMA-160M](https://huggingface.co/JackFram/llama-160m) |
| LLaMA-13B | decapoda-research/llama-13b-hf | [LLaMA-68M](https://huggingface.co/JackFram/llama-68m) , [LLaMA-160M](https://huggingface.co/JackFram/llama-160m) |
| LLaMA-30B | decapoda-research/llama-30b-hf | [LLaMA-68M](https://huggingface.co/JackFram/llama-68m) , [LLaMA-160M](https://huggingface.co/JackFram/llama-160m) |
| LLaMA-65B | decapoda-research/llama-65b-hf | [LLaMA-68M](https://huggingface.co/JackFram/llama-68m) , [LLaMA-160M](https://huggingface.co/JackFram/llama-160m) |
Expand Down
2 changes: 1 addition & 1 deletion conda/environment.yml
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,7 @@ channels:
- defaults
- conda-forge
dependencies:
- python>=3.6
- python>=3.6,<3.12
- cffi>=1.11.0
- Pillow
- pybind11
Expand Down
2 changes: 1 addition & 1 deletion conda/flexflow.yml
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,7 @@ channels:
- defaults
- conda-forge
dependencies:
- python>=3.6
- python>=3.6,<3.12
- cffi>=1.11.0
- Pillow
- pybind11
Expand Down
2 changes: 1 addition & 1 deletion config/config.inc
Original file line number Diff line number Diff line change
Expand Up @@ -24,7 +24,7 @@ fi

#set installation dir
if [ -n "$INSTALL_DIR" ]; then
SET_INSTALL_DIR="-DCMAKE_INSTALL_PREFIX=${INSTALL_DIR}"
SET_INSTALL_DIR="-DINSTALL_DIR=${INSTALL_DIR}"
fi

if [ "$INFERENCE_TESTS" = "ON" ]; then
Expand Down
9 changes: 4 additions & 5 deletions include/flexflow/ffconst.h
Original file line number Diff line number Diff line change
Expand Up @@ -195,11 +195,10 @@ enum OperatorType {
enum ModelType {
UNKNOWN = 3001,
LLAMA = 3002,
LLAMA2 = 3003,
OPT = 3004,
FALCON = 3005,
STARCODER = 3006,
MPT = 3007
OPT = 3003,
FALCON = 3004,
STARCODER = 3005,
MPT = 3006
};

enum PMParameter {
Expand Down
2 changes: 1 addition & 1 deletion inference/MODEL_WEIGHTS.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@ To convert the weights of a HuggingFace LLM to SpecInfer's weight format, we fir

```python
from transformers import AutoModelForCausalLM
model = AutoModelForCausalLM.from_pretrained("decapoda-research/llama-7b-hf")
model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-2-7b-hf")

for name, params in model.named_parameters():
for name, params in model.named_parameters():
Expand Down
11 changes: 2 additions & 9 deletions inference/incr_decoding/incr_decoding.cc
Original file line number Diff line number Diff line change
Expand Up @@ -186,14 +186,7 @@ void FlexFlow::top_level_task(Task const *task,
auto architectures = model_config["architectures"];
for (auto const &str : architectures) {
if (str == "LlamaForCausalLM" || str == "LLaMAForCausalLM") {
std::string nameOrPath = model_config["_name_or_path"];
// TODO: support LLAMA-2 models not from Meta
bool llama2 = nameOrPath.find("meta-llama/Llama-2") == 0;
if (llama2) {
model_type = ModelType::LLAMA2;
} else {
model_type = ModelType::LLAMA;
}
model_type = ModelType::LLAMA;
break;
} else if (str == "OPTForCausalLM") {
model_type = ModelType::OPT;
Expand Down Expand Up @@ -229,7 +222,7 @@ void FlexFlow::top_level_task(Task const *task,
rm->register_output_filepath(file_paths.output_file_path);

FFModel model(ffconfig, ffconfig.cpu_offload);
if (model_type == ModelType::LLAMA || model_type == ModelType::LLAMA2) {
if (model_type == ModelType::LLAMA) {
LLAMA::create_llama_model(model,
config_filepath,
weights_filepath,
Expand Down
4 changes: 2 additions & 2 deletions inference/python/incr_decoding.py
Original file line number Diff line number Diff line change
Expand Up @@ -43,7 +43,7 @@ def get_configs():
# required parameters
"num_gpus": 4,
"memory_per_gpu": 14000,
"zero_copy_memory_per_node": 30000,
"zero_copy_memory_per_node": 40000,
# optional parameters
"num_cpus": 4,
"legion_utility_processors": 4,
Expand Down Expand Up @@ -108,7 +108,7 @@ def main():
prompts = [s for s in json.load(open(configs.prompt))]
results = llm.generate(prompts)
else:
result = llm.generate("Here are some travel tips for Tokyo:\n")
result = llm.generate("Three tips for staying healthy are: ")


if __name__ == "__main__":
Expand Down
8 changes: 4 additions & 4 deletions inference/python/spec_infer.py
Original file line number Diff line number Diff line change
Expand Up @@ -43,7 +43,7 @@ def get_configs():
# required parameters
"num_gpus": 4,
"memory_per_gpu": 14000,
"zero_copy_memory_per_node": 30000,
"zero_copy_memory_per_node": 40000,
# optional parameters
"num_cpus": 4,
"legion_utility_processors": 4,
Expand All @@ -60,15 +60,15 @@ def get_configs():
}
llm_configs = {
# required llm arguments
"llm_model": "decapoda-research/llama-7b-hf",
"llm_model": "meta-llama/Llama-2-7b-hf",
# optional llm parameters
"cache_path": "",
"refresh_cache": False,
"full_precision": False,
"ssms": [
{
# required ssm parameter
"ssm_model": "JackFram/llama-160m-base",
"ssm_model": "JackFram/llama-160m",
# optional ssm parameters
"cache_path": "",
"refresh_cache": False,
Expand Down Expand Up @@ -154,7 +154,7 @@ def main():
prompts = [s for s in json.load(open(configs.prompt))]
results = llm.generate(prompts)
else:
result = llm.generate("Here are some travel tips for Tokyo:\n")
result = llm.generate("Three tips for staying healthy are: ")


if __name__ == "__main__":
Expand Down
Loading

0 comments on commit 1c231ba

Please sign in to comment.