🌟RESTful API | 🔥Features | 💻Examples | 📖Notebooks
NeuralChat is a powerful and flexible open framework that empowers you to effortlessly create LLM-centric AI applications, including chatbots and copilots.
- Support a range of hardware like Intel Xeon Scalable processors, Intel Gaudi AI processors, Intel® Data Center GPU Max Series and NVidia GPUs
- Leverage the leading AI frameworks (e.g., PyTorch and popular domain libraries (e.g., Hugging Face, Langchain) with their extensions
- Support the model customizations through parameter-efficient fine-tuning, quantization, and sparsity. Released Intel NeuralChat-7B LLM, ranking #1 in Hugging Face open LLM leaderboard in Nov'23
- Provide a rich set of plugins that can augment the AI applications through retrieval-augmented generation (RAG) (e.g., fastRAG), content moderation, query caching, more
- Integrate with popular serving frameworks (e.g., vLLM, TGI, Triton). Support OpenAI-compatible API to simplify the creation or migration of AI applications
NeuralChat is under active development. APIs are subject to change.
NeuralChat is under Intel Extension for Transformers, so ensure the installation of Intel Extension for Transformers first by following the installation. After that, install additional dependency for NeuralChat per your device:
# For CPU device
pip install -r requirements_cpu.txt
# For HPU device
pip install -r requirements_hpu.txt
# For XPU device
pip install -r requirements_xpu.txt
# For CUDA device
pip install -r requirements.txt
NeuralChat provides OpenAI-compatible RESTful APIs for LLM inference, so you can use NeuralChat as a drop-in replacement for OpenAI APIs. NeuralChat service can also be accessible through OpenAI client library, curl
commands, and requests
library. See neuralchat_api.md.
NeuralChat launches a chatbot service using Intel/neural-chat-7b-v3-1 by default. You can customize the chatbot service by configuring the YAML file.
You can start the NeuralChat server either using the shell command or Python code.
Using Shell Command:
neuralchat_server start --config_file ./server/config/neuralchat.yaml
Using Python Code:
from intel_extension_for_transformers.neural_chat import NeuralChatServerExecutor
server_executor = NeuralChatServerExecutor()
server_executor(config_file="./server/config/neuralchat.yaml", log_file="./neuralchat.log")
Once the service is running, you can observe an OpenAI-compatible endpoint /v1/chat/completions
. You can use any of below ways to access the endpoint.
from openai import Client
# Replace 'your_api_key' with your actual OpenAI API key
api_key = 'your_api_key'
backend_url = 'http://127.0.0.1:80/v1/chat/completions'
client = Client(api_key=api_key, base_url=backend_url)
response = client.ChatCompletion.create(
model="Intel/neural-chat-7b-v3-1",
messages=[
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Tell me about Intel Xeon Scalable Processors."},
]
)
print(response)
curl http://127.0.0.1:80/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "Intel/neural-chat-7b-v3-1",
"messages": [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Tell me about Intel Xeon Scalable Processors."}
]
}'
import requests
url = 'http://127.0.0.1:80/v1/chat/completions'
headers = {'Content-Type': 'application/json'}
data = '{"model": "Intel/neural-chat-7b-v3-1", "messages": [ \
{"role": "system", "content": "You are a helpful assistant."}, \
{"role": "user", "content": "Tell me about Intel Xeon Scalable Processors."}] \
}'
response = requests.post(url, headers=headers, data=data)
print(response.json())
Intel Extension for Transformers provides a comprehensive suite of Langchain-based extension APIs, including advanced retrievers, embedding models, and vector stores. These enhancements are carefully crafted to expand the capabilities of the original langchain API, ultimately boosting overall performance. This extension is specifically tailored to enhance the functionality and performance of RAG.
We introduce enhanced vector store operations, enabling users to adjust and fine-tune their settings even after the chatbot has been initialized, offering a more adaptable and user-friendly experience. For langchain users, integrating and utilizing optimized Vector Stores is straightforward by replacing the original Chroma API in langchain.
from langchain_community.llms.huggingface_pipeline import HuggingFacePipeline
from langchain.chains import RetrievalQA
from langchain_core.vectorstores import VectorStoreRetriever
from intel_extension_for_transformers.langchain.vectorstores import Chroma
retriever = VectorStoreRetriever(vectorstore=Chroma(...))
retrievalQA = RetrievalQA.from_llm(llm=HuggingFacePipeline(...), retriever=retriever)
We provide optimized retrievers such as VectorStoreRetriever
, ChildParentRetriever
to efficiently handle vectorstore operations, ensuring optimal retrieval performance.
from intel_extension_for_transformers.langchain.retrievers import ChildParentRetriever
from langchain.vectorstores import Chroma
retriever = ChildParentRetriever(vectorstore=Chroma(documents=child_documents), parentstore=Chroma(documents=parent_documents), search_type=xxx, search_kwargs={...})
docs=retriever.get_relevant_documents("Intel")
Please refer to this documentation for more details.
Users have the flexibility to customize the NeuralChat service by making modifications in the YAML configuration file. Detailed instructions can be found in the documentation.
NeuralChat boasts support for various generative Transformer models available in HuggingFace Transformers. The following is a curated list of models validated for both inference and fine-tuning within NeuralChat:
Pretrained model | Text Generation (Completions) | Text Generation (Chat Completions) | Summarization | Code Generation |
---|---|---|---|---|
Intel/neural-chat-7b-v1-1 | ✅ | ✅ | ✅ | ✅ |
Intel/neural-chat-7b-v3-1 | ✅ | ✅ | ✅ | ✅ |
meta-llama/Llama-2-7b-chat-hf | ✅ | ✅ | ✅ | ✅ |
meta-llama/Llama-2-70b-chat-hf | ✅ | ✅ | ✅ | ✅ |
EleutherAI/gpt-j-6b | ✅ | ✅ | ✅ | ✅ |
mosaicml/mpt-7b-chat | ✅ | ✅ | ✅ | ✅ |
mistralai/Mistral-7B-v0.1 | ✅ | ✅ | ✅ | ✅ |
mistralai/Mixtral-8x7B-Instruct-v0.1 | ✅ | ✅ | ✅ | ✅ |
upstage/SOLAR-10.7B-Instruct-v1.0 | ✅ | ✅ | ✅ | ✅ |
THUDM/chatglm2-6b | ✅ | ✅ | ✅ | ✅ |
THUDM/chatglm3-6b | ✅ | ✅ | ✅ | ✅ |
Qwen/Qwen-7B | ✅ | ✅ | ✅ | ✅ |
microsoft/phi-2 | ✅ | ✅ | ✅ | ✅ |
bigcode/starcoder | ✅ | |||
codellama/CodeLlama-7b-hf | ✅ | |||
codellama/CodeLlama-34b-hf | ✅ | |||
Phind/Phind-CodeLlama-34B-v2 | ✅ | |||
Salesforce/codegen2-7B | ✅ | |||
ise-uiuc/Magicoder-S-CL-7B | ✅ |
Modify the model_name_or_path
parameter in the YAML configuration file to load different models.
NeuralChat includes support for various plugins to enhance its capabilities:
-
- Text-to-Speech (TTS)
- Automatic Speech Recognition (ASR)
In addition to the text-based chat RESTful API, NeuralChat offers several helpful plugins in its RESTful API lineup to aid users in building multimodal applications. NeuralChat supports the following RESTful APIs:
Tasks List | RESTful APIs |
---|---|
textchat | /v1/chat/completions |
/v1/completions | |
voicechat | /v1/audio/speech |
/v1/audio/transcriptions | |
/v1/audio/translations | |
retrieval | /v1/rag/create |
/v1/rag/append | |
/v1/rag/upload_link | |
/v1/rag/chat | |
codegen | /v1/code_generation |
/v1/code_chat | |
text2image | /v1/text2image |
image2image | /v1/image2image |
faceanimation | /v1/face_animation |
finetune | /v1/finetune |
Modify the tasks_list
parameter in the YAML configuration file to seamlessly leverage different RESTful APIs as per your project needs.