Skip to content

Commit

Permalink
Update Windows GPU quickstart regarding demo (#12124)
Browse files Browse the repository at this point in the history
* use Qwen2-1.5B-Instruct in demo

* update

* add reference link

* update

* update
  • Loading branch information
ch1y0q authored Sep 29, 2024
1 parent 17c23cd commit 9b75806
Showing 1 changed file with 42 additions and 27 deletions.
69 changes: 42 additions & 27 deletions docs/mddocs/Quickstart/install_windows_gpu.md
Original file line number Diff line number Diff line change
Expand Up @@ -123,21 +123,15 @@ To monitor your GPU's performance and status (e.g. memory consumption, utilizati

## A Quick Example

Now let's play with a real LLM. We'll be using the [Qwen-1.8B-Chat](https://huggingface.co/Qwen/Qwen-1_8B-Chat) model, a 1.8 billion parameter LLM for this demonstration. Follow the steps below to setup and run the model, and observe how it responds to a prompt "What is AI?".
Now let's play with a real LLM. We'll be using the [Qwen2-1.5B-Instruct](https://huggingface.co/Qwen/Qwen2-1.5B-Instruct) model, a 1.8 billion parameter LLM for this demonstration. Follow the steps below to setup and run the model, and observe how it responds to a prompt "What is AI?".

- Step 1: Follow [Runtime Configurations Section](#step-1-runtime-configurations) above to prepare your runtime environment.

- Step 2: Install additional package required for Qwen-1.8B-Chat to conduct:

```cmd
pip install tiktoken transformers_stream_generator einops
```

- Step 3: Create code file. IPEX-LLM supports loading model from Hugging Face or ModelScope. Please choose according to your requirements.
- Step 2: Create code file. IPEX-LLM supports loading model from Hugging Face or ModelScope. Please choose according to your requirements.

- For **loading model from Hugging Face**:

Create a new file named `demo.py` and insert the code snippet below to run [Qwen-1.8B-Chat](https://huggingface.co/Qwen/Qwen-1_8B-Chat) model with IPEX-LLM optimizations.
Create a new file named `demo.py` and insert the code snippet below to run [Qwen2-1.5B-Instruct](https://huggingface.co/Qwen/Qwen2-1.5B-Instruct) model with IPEX-LLM optimizations.

```python
# Copy/Paste the contents to a new file demo.py
Expand All @@ -147,24 +141,34 @@ Now let's play with a real LLM. We'll be using the [Qwen-1.8B-Chat](https://hugg
generation_config = GenerationConfig(use_cache=True)

print('Now start loading Tokenizer and optimizing Model...')
tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen-1_8B-Chat",
tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen2-1.5B-Instruct",
trust_remote_code=True)

# Load Model using ipex-llm and load it to GPU
model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen-1_8B-Chat",
model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen2-1.5B-Instruct",
load_in_4bit=True,
cpu_embedding=True,
trust_remote_code=True)
model = model.to('xpu')
print('Successfully loaded Tokenizer and optimized Model!')

# Format the prompt
# you could tune the prompt based on your own model,
# here the prompt tuning refers to https://huggingface.co/Qwen/Qwen2-1.5B-Instruct#quickstart
question = "What is AI?"
prompt = "user: {prompt}\n\nassistant:".format(prompt=question)
messages = [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": question}
]
text = tokenizer.apply_chat_template(
messages,
tokenize=False,
add_generation_prompt=True
)

# Generate predicted tokens
with torch.inference_mode():
input_ids = tokenizer.encode(prompt, return_tensors="pt").to('xpu')
input_ids = tokenizer.encode(text, return_tensors="pt").to('xpu')

print('--------------------------------------Note-----------------------------------------')
print('| For the first time that each model runs on Intel iGPU/Intel Arc™ A300-Series or |')
Expand All @@ -185,7 +189,7 @@ Now let's play with a real LLM. We'll be using the [Qwen-1.8B-Chat](https://hugg
do_sample=False,
max_new_tokens=32,
generation_config=generation_config).cpu()
output_str = tokenizer.decode(output[0], skip_special_tokens=True)
output_str = tokenizer.decode(output[0], skip_special_tokens=False)
print(output_str)
```
- For **loading model ModelScopee**:
Expand All @@ -195,10 +199,9 @@ Now let's play with a real LLM. We'll be using the [Qwen-1.8B-Chat](https://hugg
pip install modelscope==1.11.0
```

Create a new file named `demo.py` and insert the code snippet below to run [Qwen-1.8B-Chat](https://www.modelscope.cn/models/qwen/Qwen-1_8B-Chat/summary) model with IPEX-LLM optimizations.
Create a new file named `demo.py` and insert the code snippet below to run [Qwen2-1.5B-Instruct](https://www.modelscope.cn/models/qwen/Qwen2-1.5B-Instruct/summary) model with IPEX-LLM optimizations.

```python

# Copy/Paste the contents to a new file demo.py
import torch
from ipex_llm.transformers import AutoModelForCausalLM
Expand All @@ -207,11 +210,11 @@ Now let's play with a real LLM. We'll be using the [Qwen-1.8B-Chat](https://hugg
generation_config = GenerationConfig(use_cache=True)

print('Now start loading Tokenizer and optimizing Model...')
tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen-1_8B-Chat",
tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen2-1.5B-Instruct",
trust_remote_code=True)

# Load Model using ipex-llm and load it to GPU
model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen-1_8B-Chat",
model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen2-1.5B-Instruct",
load_in_4bit=True,
cpu_embedding=True,
trust_remote_code=True,
Expand All @@ -220,13 +223,22 @@ Now let's play with a real LLM. We'll be using the [Qwen-1.8B-Chat](https://hugg
print('Successfully loaded Tokenizer and optimized Model!')

# Format the prompt
# you could tune the prompt based on your own model,
# here the prompt tuning refers to https://huggingface.co/Qwen/Qwen2-1.5B-Instruct#quickstart
question = "What is AI?"
prompt = "user: {prompt}\n\nassistant:".format(prompt=question)

messages = [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": question}
]
text = tokenizer.apply_chat_template(
messages,
tokenize=False,
add_generation_prompt=True
)

# Generate predicted tokens
with torch.inference_mode():
input_ids = tokenizer.encode(prompt, return_tensors="pt").to('xpu')

input_ids = tokenizer.encode(text, return_tensors="pt").to('xpu')
print('--------------------------------------Note-----------------------------------------')
print('| For the first time that each model runs on Intel iGPU/Intel Arc™ A300-Series or |')
print('| Pro A60, it may take several minutes for GPU kernels to compile and initialize. |')
Expand All @@ -246,7 +258,7 @@ Now let's play with a real LLM. We'll be using the [Qwen-1.8B-Chat](https://hugg
do_sample=False,
max_new_tokens=32,
generation_config=generation_config).cpu()
output_str = tokenizer.decode(output[0], skip_special_tokens=True)
output_str = tokenizer.decode(output[0], skip_special_tokens=False)
print(output_str)
```
> **Note**:
Expand All @@ -257,7 +269,7 @@ Now let's play with a real LLM. We'll be using the [Qwen-1.8B-Chat](https://hugg
> When running LLMs on Intel iGPUs with limited memory size, we recommend setting `cpu_embedding=True` in the `from_pretrained` function.
> This will allow the memory-intensive embedding layer to utilize the CPU instead of GPU.

- Step 4. Run `demo.py` within the activated Python environment using the following command:
- Step 3. Run `demo.py` within the activated Python environment using the following command:

```cmd
python demo.py
Expand All @@ -267,9 +279,12 @@ Now let's play with a real LLM. We'll be using the [Qwen-1.8B-Chat](https://hugg

Example output on a system equipped with an Intel Core Ultra 5 125H CPU and Intel Arc Graphics iGPU:
```
user: What is AI?
assistant: AI stands for Artificial Intelligence, which refers to the development of computer systems that can perform tasks that typically require human intelligence, such as visual perception, speech recognition,
<|im_start|>system
You are a helpful assistant.<|im_end|>
<|im_start|>user
What is AI?<|im_end|>
<|im_start|>assistant
Artificial Intelligence (AI) refers to the simulation of human intelligence in machines that are programmed to think and act like humans. It involves the development of algorithms,
```

## Tips & Troubleshooting
Expand Down

0 comments on commit 9b75806

Please sign in to comment.