- [2024/01] Supported INT4 inference on Intel GPUs including Intel Data Center GPU Max Series (e.g., PVC) and Intel Arc A-Series (e.g., ARC). Check out the examples and scripts.
- [2024/01] Demonstrated Intel Hybrid Copilot in CES 2024 Great Minds Session "Bringing the Limitless Potential of AI Everywhere".
- [2023/12] Supported QLoRA on CPUs to make fine-tuning on client CPU possible. Check out the blog and readme for more details.
- [2023/11] Released top-1 7B-sized LLM NeuralChat-v3-1 and DPO dataset. Check out the nice video published by WorldofAI.
- [2023/11] Published a 4-bit chatbot demo (based on NeuralChat) available on Intel Hugging Face Space. Welcome to have a try! To setup the demo locally, please follow the instructions.
pip install intel-extension-for-transformers
For more installation methods, please refer to Installation Page
Intel® Extension for Transformers is an innovative toolkit designed to accelerate GenAI/LLM everywhere with the optimal performance of Transformer-based models on various Intel platforms, including Intel Gaudi2, Intel CPU, and Intel GPU. The toolkit provides the below key features and examples:
-
Seamless user experience of model compressions on Transformer-based models by extending Hugging Face transformers APIs and leveraging Intel® Neural Compressor
-
Advanced software optimizations and unique compression-aware runtime (released with NeurIPS 2022's paper Fast Distilbert on CPUs and QuaLA-MiniLM: a Quantized Length Adaptive MiniLM, and NeurIPS 2021's paper Prune Once for All: Sparse Pre-Trained Language Models)
-
Optimized Transformer-based model packages such as Stable Diffusion, GPT-J-6B, GPT-NEOX, BLOOM-176B, T5, Flan-T5, and end-to-end workflows such as SetFit-based text classification and document level sentiment analysis (DLSA)
-
NeuralChat, a customizable chatbot framework to create your own chatbot within minutes by leveraging a rich set of plugins such as Knowledge Retrieval, Speech Interaction, Query Caching, and Security Guardrail. This framework supports Intel Gaudi2/CPU/GPU.
-
Inference of Large Language Model (LLM) in pure C/C++ with weight-only quantization kernels for Intel CPU and Intel GPU (TBD), supporting GPT-NEOX, LLAMA, MPT, FALCON, BLOOM-7B, OPT, ChatGLM2-6B, GPT-J-6B, and Dolly-v2-3B. Support AMX, VNNI, AVX512F and AVX2 instruction set. We've boosted the performance of Intel CPUs, with a particular focus on the 4th generation Intel Xeon Scalable processor, codenamed Sapphire Rapids.
Hardware | Fine-Tuning | Inference | ||
Full | PEFT | 8-bit | 4-bit | |
Intel Gaudi2 | ✔ | ✔ | WIP (FP8) | - |
Intel Xeon Scalable Processors | ✔ | ✔ | ✔ (INT8, FP8) | ✔ (INT4, FP4, NF4) |
Intel Xeon CPU Max Series | ✔ | ✔ | ✔ (INT8, FP8) | ✔ (INT4, FP4, NF4) |
Intel Data Center GPU Max Series | WIP | WIP | WIP (INT8) | ✔ (INT4) |
Intel Arc A-Series | - | - | WIP (INT8) | ✔ (INT4) |
Intel Core Processors | - | ✔ | ✔ (INT8, FP8) | ✔ (INT4, FP4, NF4) |
In the table above, "-" means not applicable or not started yet.
Software | Fine-Tuning | Inference | ||
Full | PEFT | 8-bit | 4-bit | |
PyTorch | 2.0.1+cpu, 2.0.1a0 (gpu) |
2.0.1+cpu, 2.0.1a0 (gpu) |
2.1.0+cpu, 2.0.1a0 (gpu) |
2.1.0+cpu, 2.0.1a0 (gpu) |
Intel® Extension for PyTorch | 2.1.0+cpu, 2.0.110+xpu |
2.1.0+cpu, 2.0.110+xpu |
2.1.0+cpu, 2.0.110+xpu |
2.1.0+cpu, 2.0.110+xpu |
Transformers | 4.35.2(CPU), 4.31.0 (Intel GPU) |
4.35.2(CPU), 4.31.0 (Intel GPU) |
4.35.2(CPU), 4.31.0 (Intel GPU) |
4.35.2(CPU), 4.31.0 (Intel GPU) |
Synapse AI | 1.13.0 | 1.13.0 | 1.13.0 | 1.13.0 |
Gaudi2 driver | 1.13.0-ee32e42 | 1.13.0-ee32e42 | 1.13.0-ee32e42 | 1.13.0-ee32e42 |
intel-level-zero-gpu | 1.3.26918.50-736~22.04 | 1.3.26918.50-736~22.04 | 1.3.26918.50-736~22.04 | 1.3.26918.50-736~22.04 |
Please refer to the detailed requirements in CPU, Gaudi2, Intel GPU.
Below is the sample code to create your chatbot. See more examples.
NeuralChat provides OpenAI-compatible RESTful APIs for chat, so you can use NeuralChat as a drop-in replacement for OpenAI APIs. You can start NeuralChat server either using the Shell command or Python code.
# Shell Command
neuralchat_server start --config_file ./server/config/neuralchat.yaml
# Python Code
from intel_extension_for_transformers.neural_chat import NeuralChatServerExecutor
server_executor = NeuralChatServerExecutor()
server_executor(config_file="./server/config/neuralchat.yaml", log_file="./neuralchat.log")
NeuralChat service can be accessible through OpenAI client library, curl
commands, and requests
library. See more in NeuralChat.
from intel_extension_for_transformers.neural_chat import build_chatbot
chatbot = build_chatbot()
response = chatbot.predict("Tell me about Intel Xeon Scalable Processors.")
Below is the sample code to use the extended Transformers APIs. See more examples.
from transformers import AutoTokenizer
from intel_extension_for_transformers.transformers import AutoModelForCausalLM
model_name = "Intel/neural-chat-7b-v3-1"
prompt = "Once upon a time, there existed a little girl,"
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
inputs = tokenizer(prompt, return_tensors="pt").input_ids
model = AutoModelForCausalLM.from_pretrained(model_name, load_in_4bit=True)
outputs = model.generate(inputs)
You can also load the low-bit model quantized by GPTQ/AWQ/RTN/AutoRound algorithm.
from transformers import AutoTokenizer
from intel_extension_for_transformers.transformers import AutoModelForCausalLM, WeightOnlyQuantConfig
# Download Hugging Face GPTQ/AWQ model or use local quantize model
model_name = "PATH_TO_MODEL" # local path to model
woq_config = WeightOnlyQuantConfig(use_gptq=True) # use_awq=True for AWQ; use_autoround=True for AutoRound
prompt = "Once upon a time, a little girl"
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
inputs = tokenizer(prompt, return_tensors="pt").input_ids
model = AutoModelForCausalLM.from_pretrained(model_name, quantization_config=woq_config, trust_remote_code=True)
outputs = model.generate(inputs)
import intel_extension_for_pytorch as ipex
from intel_extension_for_transformers.transformers.modeling import AutoModelForCausalLM
from transformers import AutoTokenizer
device_map = "xpu"
model_name ="Qwen/Qwen-7B"
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
prompt = "Once upon a time, there existed a little girl,"
inputs = tokenizer(prompt, return_tensors="pt").input_ids.to(device_map)
model = AutoModelForCausalLM.from_pretrained(model_name, trust_remote_code=True,
device_map=device_map, load_in_4bit=True)
model = ipex.optimize_transformers(model, inplace=True, dtype=torch.float16, woq=True, device=device_map)
output = model.generate(inputs)
Note: Please refer to the example and script for more details.
Below is the sample code to use the extended Langchain APIs. See more examples.
from langchain_community.llms.huggingface_pipeline import HuggingFacePipeline
from langchain.chains import RetrievalQA
from langchain_core.vectorstores import VectorStoreRetriever
from intel_extension_for_transformers.langchain.vectorstores import Chroma
retriever = VectorStoreRetriever(vectorstore=Chroma(...))
retrievalQA = RetrievalQA.from_llm(llm=HuggingFacePipeline(...), retriever=retriever)
You can access the validated models, accuracy and performance from Release data or Medium blog.
OVERVIEW | |||||||
---|---|---|---|---|---|---|---|
NeuralChat | Neural Speed | ||||||
NEURALCHAT | |||||||
Chatbot on Intel CPU | Chatbot on Intel GPU | Chatbot on Gaudi | |||||
Chatbot on Client | More Notebooks | ||||||
NEURAL SPEED | |||||||
Neural Speed | Streaming LLM | Low Precision Kernels | Tensor Parallelism | ||||
LLM COMPRESSION | |||||||
SmoothQuant (INT8) | Weight-only Quantization (INT4/FP4/NF4/INT8) | QLoRA on CPU | |||||
GENERAL COMPRESSION | |||||||
Quantization | Pruning | Distillation | Orchestration | ||||
Neural Architecture Search | Export | Metrics | Objectives | ||||
Pipeline | Length Adaptive | Early Exit | Data Augmentation | ||||
TUTORIALS & RESULTS | |||||||
Tutorials | LLM List | General Model List | Model Performance |
- LLM Infinite Inference (up to 4M tokens)
streamingLLM_v2.mp4
- LLM QLoRA on Client CPU
QLoRA.on.Core.i9-12900.mp4
- CES 2024: CES 2024 Great Minds Keynote: Bringing the Limitless Potential of AI Everywhere: Intel Hybrid Copilot demo (Jan 2024)
- Blog published on Medium: Connect an AI agent with your API: Intel Neural-Chat 7b LLM can replace Open AI Function Calling (Dec 2023)
- NeurIPS'2023 on Efficient Natural Language and Speech Processing: Efficient LLM Inference on CPUs (Nov 2023)
- Blog published on Hugging Face: Intel Neural-Chat 7b: Fine-Tuning on Gaudi2 for Top LLM Performance (Nov 2023)
- Blog published on VMware: AI without GPUs: A Technical Brief for VMware Private AI with Intel (Nov 2023)
-
Excellent open-source projects: bitsandbytes, FastChat, fastRAG, ggml, gptq, llama.cpp, lm-evauation-harness, peft, trl, streamingllm and many others.
-
Thanks to all the contributors.
Welcome to raise any interesting ideas on model compression techniques and LLM-based chatbot development! Feel free to reach us, and we look forward to our collaborations on Intel Extension for Transformers!