Skip to content

Commit

Permalink
Replace ipex with ipex-llm (intel#10554)
Browse files Browse the repository at this point in the history
* fix ipex with ipex_llm

* fix ipex with ipex_llm

* update

* update

* update

* update

* update

* update

* update

* update
  • Loading branch information
Zephyr596 authored Mar 28, 2024
1 parent 0a2e820 commit 52a2135
Show file tree
Hide file tree
Showing 106 changed files with 127 additions and 122 deletions.
2 changes: 1 addition & 1 deletion docker/llm/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -62,7 +62,7 @@ After the container is booted, you could get into the container through `docker
docker exec -it my_container bash
```

To run inference using `IPEX-LLM` using cpu, you could refer to this [documentation](https://github.com/intel-analytics/IPEX/tree/main/python/llm#cpu-int4).
To run inference using `IPEX-LLM` using cpu, you could refer to this [documentation](https://github.com/intel-analytics/ipex-llm/tree/main/python/llm#cpu-int4).


#### Getting started with chat
Expand Down
2 changes: 1 addition & 1 deletion docker/llm/finetune/qlora/cpu/kubernetes/Chart.yaml
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
apiVersion: v2
name: ipex-fintune-service
name: ipex_llm-fintune-service
description: A Helm chart for IPEX-LLM Finetune Service on Kubernetes
type: application
version: 1.1.27
Expand Down
2 changes: 1 addition & 1 deletion docker/llm/serving/cpu/docker/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -30,7 +30,7 @@ sudo docker run -itd \

After the container is booted, you could get into the container through `docker exec`.

To run model-serving using `IPEX-LLM` as backend, you can refer to this [document](https://github.com/intel-analytics/ipex-llm/tree/main/python/llm/src/ipex/llm/serving).
To run model-serving using `IPEX-LLM` as backend, you can refer to this [document](https://github.com/intel-analytics/ipex-llm/tree/main/python/llm/src/ipex_llm/serving/fastchat).
Also you can set environment variables and start arguments while running a container to get serving started initially. You may need to boot several containers to support. One controller container and at least one worker container are needed. The api server address(host and port) and controller address are set in controller container, and you need to set the same controller address as above, model path on your machine and worker address in worker container.

To start a controller container:
Expand Down
2 changes: 1 addition & 1 deletion docker/llm/serving/cpu/kubernetes/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -10,7 +10,7 @@ To deploy IPEX-LLM-serving cpu in Kubernetes environment, please use this image:

In this document, we will use `vicuna-7b-v1.5` as the deployment model.

After downloading the model, please change name from `vicuna-7b-v1.5` to `vicuna-7b-v1.5-ipex` to use `ipex-llm` as the backend. The `ipex-llm` backend will be used if model path contains `ipex-llm`. Otherwise, the original transformer-backend will be used.
After downloading the model, please change name from `vicuna-7b-v1.5` to `vicuna-7b-v1.5-ipex-llm` to use `ipex-llm` as the backend. The `ipex-llm` backend will be used if model path contains `ipex-llm`. Otherwise, the original transformer-backend will be used.

You can download the model from [here](https://huggingface.co/lmsys/vicuna-7b-v1.5).

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -102,7 +102,7 @@
# Batch tokenizing
prompt = args.prompt
input_ids = tokenizer.encode(prompt, return_tensors="pt").to(f'cpu:{local_rank}')
# ipex model needs a warmup, then inference time can be accurate
# ipex-llm model needs a warmup, then inference time can be accurate
output = model.generate(input_ids,
max_new_tokens=args.n_predict,
use_cache=True)
Expand Down
11 changes: 8 additions & 3 deletions python/llm/example/CPU/LangChain/README.md
Original file line number Diff line number Diff line change
@@ -1,8 +1,8 @@
## Langchain Examples

This folder contains examples showcasing how to use `langchain` with `ipex`.
This folder contains examples showcasing how to use `langchain` with `ipex-llm`.

### Install IPEX
### Install-IPEX LLM

Ensure `ipex-llm` is installed by following the [IPEX-LLM Installation Guide](https://github.com/intel-analytics/ipex-llm/tree/main/python/llm#install).

Expand Down Expand Up @@ -36,7 +36,7 @@ To run the example, execute the following command in the current directory:
```bash
python transformers_int4/rag.py -m <path_to_model> [-q <your_question>] [-i <path_to_input_txt>]
```
> Note: If `-i` is not specified, it will use a short introduction to Big-DL as input by default. if `-q` is not specified, `What is IPEX?` will be used by default.
> Note: If `-i` is not specified, it will use a short introduction to Big-DL as input by default. if `-q` is not specified, `What is IPEX LLM?` will be used by default.

### Example: Math
Expand Down Expand Up @@ -66,3 +66,8 @@ python transformers_int4/voiceassistant.py -m <path_to_model> [-q <your_question
- `-x MAX_NEW_TOKENS`: the max new tokens of model tokens input
- `-l LANGUAGE`: you can specify a language such as "english" or "chinese"
- `-d True|False`: whether the model path specified in -m is saved low bit model.

### Legacy (Native INT4 examples)

IPEX-LLM also provides langchain integrations using native INT4 mode. Those examples can be foud in [native_int4](./native_int4/) folder. For detailed instructions of settting up and running `native_int4` examples, refer to [Native INT4 Examples README](./README_nativeint4.md).

Original file line number Diff line number Diff line change
Expand Up @@ -54,7 +54,7 @@
with torch.inference_mode():
prompt = MIXTRAL_PROMPT_FORMAT.format(prompt=args.prompt)
input_ids = tokenizer.encode(prompt, return_tensors="pt").to('cpu')
# ipex model needs a warmup, then inference time can be accurate
# ipex-llm model needs a warmup, then inference time can be accurate
output = model.generate(input_ids,
max_new_tokens=args.n_predict)

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -28,7 +28,7 @@ Example usage:
python ./alpaca_qlora_finetuning_cpu.py \
--base_model "meta-llama/Llama-2-7b-hf" \
--data_path "yahma/alpaca-cleaned" \
--output_dir "./ipex-qlora-alpaca"
--output_dir "./ipex-llm-qlora-alpaca"
```

**Note**: You could also specify `--base_model` to the local path of the huggingface model checkpoint folder and `--data_path` to the local path of the dataset JSON file.
Expand Down Expand Up @@ -109,7 +109,7 @@ def generate_and_tokenize_prompt(data_point):
python ./quotes_qlora_finetuning_cpu.py \
--base_model "meta-llama/Llama-2-7b-hf" \
--data_path "./english_quotes" \
--output_dir "./ipex-qlora-alpaca" \
--output_dir "./ipex-llm-qlora-alpaca" \
--prompt_template_name "english_quotes"
```

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -14,5 +14,5 @@ mpirun -n 2 \
--max_steps -1 \
--base_model "meta-llama/Llama-2-7b-hf" \
--data_path "yahma/alpaca-cleaned" \
--output_dir "./ipex-qlora-alpaca"
--output_dir "./ipex-llm-qlora-alpaca"

Original file line number Diff line number Diff line change
Expand Up @@ -109,7 +109,7 @@ def get_int_from_env(env_keys, default):
with torch.inference_mode():
prompt = args.prompt
input_ids = tokenizer.encode(prompt, return_tensors="pt").to(f'xpu:{local_rank}')
# ipex model needs a warmup, then inference time can be accurate
# ipex_llm model needs a warmup, then inference time can be accurate
output = model.generate(input_ids,
max_new_tokens=args.n_predict,
use_cache=True)
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -64,7 +64,7 @@
with torch.inference_mode():
prompt = PROMPT_FORMAT.format(prompt=args.prompt)
input_ids = tokenizer.encode(prompt, return_tensors="pt").to("xpu")
# ipex model needs a warmup, then inference time can be accurate
# ipex_llm model needs a warmup, then inference time can be accurate
output = model.generate(input_ids,
max_new_tokens=args.n_predict)
st = time.time()
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -55,7 +55,7 @@
with torch.inference_mode():
prompt = BAICHUAN_PROMPT_FORMAT.format(prompt=args.prompt)
input_ids = tokenizer.encode(prompt, return_tensors="pt").to('xpu')
# ipex model needs a warmup, then inference time can be accurate
# ipex_llm model needs a warmup, then inference time can be accurate
output = model.generate(input_ids,
max_new_tokens=args.n_predict)

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -61,7 +61,7 @@
with torch.inference_mode():
prompt = BAICHUAN_PROMPT_FORMAT.format(prompt=args.prompt)
input_ids = tokenizer.encode(prompt, return_tensors="pt").to('xpu')
# ipex model needs a warmup, then inference time can be accurate
# ipex_llm model needs a warmup, then inference time can be accurate
output = model.generate(input_ids,
max_new_tokens=args.n_predict)

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -55,7 +55,7 @@
with torch.inference_mode():
prompt = BLUELM_PROMPT_FORMAT.format(prompt=args.prompt)
input_ids = tokenizer.encode(prompt, return_tensors="pt").to('xpu')
# ipex model needs a warmup, then inference time can be accurate
# ipex_llm model needs a warmup, then inference time can be accurate
output = model.generate(input_ids,
max_new_tokens=args.n_predict)

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -58,7 +58,7 @@
with torch.inference_mode():
prompt = CHATGLM_V2_PROMPT_FORMAT.format(prompt=args.prompt)
input_ids = tokenizer.encode(prompt, return_tensors="pt").to('xpu')
# ipex model needs a warmup, then inference time can be accurate
# ipex_llm model needs a warmup, then inference time can be accurate
output = model.generate(input_ids,
max_new_tokens=args.n_predict)

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -54,7 +54,7 @@
with torch.inference_mode():
prompt = args.question
input_ids = tokenizer.encode(prompt, return_tensors="pt").to('xpu')
# ipex model needs a warmup, then inference time can be accurate
# ipex_llm model needs a warmup, then inference time can be accurate
output = model.generate(input_ids,
max_new_tokens=32)

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -58,7 +58,7 @@
with torch.inference_mode():
prompt = CHATGLM_V3_PROMPT_FORMAT.format(prompt=args.prompt)
input_ids = tokenizer.encode(prompt, return_tensors="pt").to('xpu')
# ipex model needs a warmup, then inference time can be accurate
# ipex_llm model needs a warmup, then inference time can be accurate
output = model.generate(input_ids,
max_new_tokens=args.n_predict)

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -54,7 +54,7 @@
with torch.inference_mode():
prompt = args.question
input_ids = tokenizer.encode(prompt, return_tensors="pt").to('xpu')
# ipex model needs a warmup, then inference time can be accurate
# ipex_llm model needs a warmup, then inference time can be accurate
output = model.generate(input_ids,
max_new_tokens=32)

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -74,7 +74,7 @@ def get_prompt(message: str, chat_history: list[tuple[str, str]],
with torch.inference_mode():
prompt = get_prompt(args.prompt, [], system_prompt=DEFAULT_SYSTEM_PROMPT)
input_ids = tokenizer.encode(prompt, return_tensors="pt").to('xpu')
# ipex model needs a warmup, then inference time can be accurate
# ipex_llm model needs a warmup, then inference time can be accurate
output = model.generate(input_ids,
max_new_tokens=args.n_predict)

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -58,7 +58,7 @@
prompt = CODELLAMA_PROMPT_FORMAT.format(prompt=args.prompt)
input_ids = tokenizer.encode(prompt, return_tensors="pt").to('xpu')

# ipex model needs a warmup, then inference time can be accurate
# ipex_llm model needs a warmup, then inference time can be accurate
output = model.generate(input_ids,
max_new_tokens=args.n_predict)

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -58,7 +58,7 @@
prompt = FALCON_PROMPT_FORMAT.format(prompt=args.prompt)
input_ids = tokenizer.encode(prompt, return_tensors="pt").to('xpu')

# ipex model needs a warmup, then inference time can be accurate
# ipex_llm model needs a warmup, then inference time can be accurate
output = model.generate(input_ids,
max_new_tokens=args.n_predict)

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -60,7 +60,7 @@
with torch.inference_mode():
prompt = FLAN_T5_PROMPT_FORMAT.format(prompt=args.prompt)
input_ids = tokenizer.encode(prompt, return_tensors="pt").to('xpu')
# ipex model needs a warmup, then inference time can be accurate
# ipex_llm model needs a warmup, then inference time can be accurate
output = model.generate(input_ids,
max_new_tokens=args.n_predict)

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -59,7 +59,7 @@
chat[0]['content'] = args.prompt
prompt = tokenizer.apply_chat_template(chat, tokenize=False, add_generation_prompt=True)
input_ids = tokenizer.encode(prompt, return_tensors="pt").to('xpu')
# ipex model needs a warmup, then inference time can be accurate
# ipex_llm model needs a warmup, then inference time can be accurate
output = model.generate(input_ids,
max_new_tokens=args.n_predict)

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -57,7 +57,7 @@
prompt = GptJ_PROMPT_FORMAT.format(prompt=args.prompt)
input_ids = tokenizer.encode(prompt, return_tensors="pt").to('xpu')

# ipex model needs a warmup, then inference time can be accurate
# ipex_llm model needs a warmup, then inference time can be accurate
output = model.generate(input_ids,
max_new_tokens=args.n_predict)

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -57,7 +57,7 @@
with torch.inference_mode():
prompt = INTERNLM_PROMPT_FORMAT.format(prompt=args.prompt)
input_ids = tokenizer.encode(prompt, return_tensors="pt").to('xpu')
# ipex model needs a warmup, then inference time can be accurate
# ipex_llm model needs a warmup, then inference time can be accurate
output = model.generate(input_ids,
max_new_tokens=args.n_predict)

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -62,7 +62,7 @@
with torch.inference_mode():
prompt = INTERNLM_PROMPT_FORMAT.format(prompt=args.prompt)
input_ids = tokenizer.encode(prompt, return_tensors="pt").to('xpu')
# ipex model needs a warmup, then inference time can be accurate
# ipex_llm model needs a warmup, then inference time can be accurate
output = model.generate(input_ids,
max_new_tokens=args.n_predict)

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -70,7 +70,7 @@ def get_prompt(message: str, chat_history: list[tuple[str, str]],
with torch.inference_mode():
prompt = get_prompt(args.prompt, [], system_prompt=DEFAULT_SYSTEM_PROMPT)
input_ids = tokenizer.encode(prompt, return_tensors="pt").to('xpu')
# ipex model needs a warmup, then inference time can be accurate
# ipex_llm model needs a warmup, then inference time can be accurate
output = model.generate(input_ids,
max_new_tokens=args.n_predict)

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -56,7 +56,7 @@
with torch.inference_mode():
prompt = MISTRAL_PROMPT_FORMAT.format(prompt=args.prompt)
input_ids = tokenizer.encode(prompt, return_tensors="pt").to('xpu')
# ipex model needs a warmup, then inference time can be accurate
# ipex_llm model needs a warmup, then inference time can be accurate
output = model.generate(input_ids,
max_new_tokens=args.n_predict)

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -56,7 +56,7 @@
with torch.inference_mode():
prompt = MIXTRAL_PROMPT_FORMAT.format(prompt=args.prompt)
input_ids = tokenizer.encode(prompt, return_tensors="pt").to('xpu')
# ipex model needs a warmup, then inference time can be accurate
# ipex_llm model needs a warmup, then inference time can be accurate
output = model.generate(input_ids,
max_new_tokens=args.n_predict)

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -58,7 +58,7 @@
with torch.inference_mode():
prompt = MPT_PROMPT_FORMAT.format(prompt=args.prompt)
input_ids = tokenizer.encode(prompt, return_tensors="pt").to('xpu')
# ipex model needs a warmup, then inference time can be accurate
# ipex_llm model needs a warmup, then inference time can be accurate
output = model.generate(input_ids,
max_new_tokens=args.n_predict)

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -59,7 +59,7 @@
prompt = PHI1_5_PROMPT_FORMAT.format(prompt=args.prompt)
input_ids = tokenizer.encode(prompt, return_tensors="pt").to('xpu')

# ipex model needs a warmup, then inference time can be accurate
# ipex_llm model needs a warmup, then inference time can be accurate
output = model.generate(input_ids,
max_new_tokens=args.n_predict,
generation_config = generation_config)
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -60,7 +60,7 @@
input_ids = tokenizer.encode(prompt, return_tensors="pt").to('xpu')

model.generation_config.pad_token_id = model.generation_config.eos_token_id
# ipex model needs a warmup, then inference time can be accurate
# ipex_llm model needs a warmup, then inference time can be accurate
output = model.generate(input_ids,
max_new_tokens=args.n_predict,
generation_config = generation_config)
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -61,7 +61,7 @@
prompt = PHI1_5_PROMPT_FORMAT.format(prompt=args.prompt)
input_ids = tokenizer.encode(prompt, return_tensors="pt").to('xpu')

# ipex model needs a warmup, then inference time can be accurate
# ipex_llm model needs a warmup, then inference time can be accurate
output = model.generate(input_ids,
max_new_tokens=args.n_predict,
generation_config = generation_config)
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -64,7 +64,7 @@
with torch.inference_mode():
prompt = QWEN_PROMPT_FORMAT.format(prompt=args.prompt)
input_ids = tokenizer.encode(prompt, return_tensors="pt").to('xpu')
# ipex model needs a warmup, then inference time can be accurate
# ipex_llm model needs a warmup, then inference time can be accurate
output = model.generate(input_ids,
max_new_tokens=args.n_predict)

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -56,7 +56,7 @@
prompt = RedPajama_PROMPT_FORMAT.format(prompt=args.prompt)
inputs = tokenizer(prompt, return_tensors='pt').to('xpu')

# ipex model needs a warmup, then inference time can be accurate
# ipex_llm model needs a warmup, then inference time can be accurate
output = model.generate(**inputs,
max_new_tokens=args.n_predict,
do_sample=True,
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -57,7 +57,7 @@
prompt = REPLIT_PROMPT_FORMAT.format(prompt=args.prompt)
input_ids = tokenizer.encode(prompt, return_tensors="pt").to('xpu')

# ipex model needs a warmup, then inference time can be accurate
# ipex_llm model needs a warmup, then inference time can be accurate
output = model.generate(input_ids,
max_new_tokens=args.n_predict)

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -70,7 +70,7 @@ def generate_prompt(instruction):
with torch.inference_mode():
prompt = generate_prompt(instruction=args.prompt)
input_ids = tokenizer.encode(prompt, return_tensors="pt").to('xpu')
# ipex model needs a warmup, then inference time can be accurate
# ipex_llm model needs a warmup, then inference time can be accurate
output = model.generate(input_ids,
max_new_tokens=args.n_predict)

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -67,7 +67,7 @@ def generate_prompt(instruction):
with torch.inference_mode():
prompt = generate_prompt(instruction=args.prompt)
input_ids = tokenizer.encode(prompt, return_tensors="pt").to('xpu')
# ipex model needs a warmup, then inference time can be accurate
# ipex_llm model needs a warmup, then inference time can be accurate
output = model.generate(input_ids,
max_new_tokens=args.n_predict)

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -58,7 +58,7 @@
with torch.inference_mode():
prompt = SOLAR_PROMPT_FORMAT.format(prompt=args.prompt)
input_ids = tokenizer.encode(prompt, return_tensors="pt").to('xpu')
# ipex model needs a warmup, then inference time can be accurate
# ipex_llm model needs a warmup, then inference time can be accurate
output = model.generate(input_ids,
max_new_tokens=args.n_predict)

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -57,7 +57,7 @@
prompt = StarCoder_PROMPT_FORMAT.format(prompt=args.prompt)
input_ids = tokenizer.encode(prompt, return_tensors="pt").to('xpu')

# ipex model needs a warmup, then inference time can be accurate
# ipex_llm model needs a warmup, then inference time can be accurate
output = model.generate(input_ids,
max_new_tokens=args.n_predict)

Expand Down
Loading

0 comments on commit 52a2135

Please sign in to comment.