Chatbots are the most widely adopted use case for leveraging the powerful chat and reasoning capabilities of large language models (LLM). The retrieval augmented generation (RAG) architecture is quickly becoming the industry standard for developing chatbots because it combines the benefits of a knowledge base (via a vector store) and generative models to reduce hallucinations, maintain up-to-date information, and leverage domain-specific knowledge.
RAG bridges the knowledge gap by dynamically fetching relevant information from external sources, ensuring that responses generated remain factual and current. At the heart of this architecture are vector databases, instrumental in enabling efficient and semantic retrieval of information. These databases store data as vectors, allowing RAG to swiftly access the most pertinent documents or data points based on semantic similarity.
ChatQnA architecture shows below:
This ChatQnA use case performs RAG using LangChain, Redis vectordb and Text Generation Inference on Intel Gaudi2. The Intel Gaudi2 accelerator supports both training and inference for deep learning models in particular for LLMs. Please visit Habana AI products for more details.
Steps to implement the solution are as follows
- Deploy a TGI container with LLM model of your choice (Solution uses 70B model by default)
- Export TGI endpoint as environment variable
- Deploy a TEI container for Embedding model service and export the endpoint
- Launch a Redis container and Langchain container
- Ingest data into redis, this example provides few example PDF documents
- Start the backend service to accept queries to Langchain
- Start the GUI based chatbot service to experiment with RAG based Chatbot
To use 🤗 text-generation-inference on Habana Gaudi/Gaudi2, please follow these steps:
Getting started is straightforward with the official Docker container. Simply pull the image using:
docker pull ghcr.io/huggingface/tgi-gaudi:1.2.1
Alternatively, you can build the Docker image yourself using latest TGI-Gaudi code with the below command:
bash ./serving/tgi_gaudi/build_docker.sh
bash ./serving/tgi_gaudi/launch_tgi_service.sh
For gated models such as LLAMA-2
, you will have to pass -e HUGGING_FACE_HUB_TOKEN=<token> to the docker run command above with a valid Hugging Face Hub read token.
Please follow this link huggingface token to get the access token and export HUGGINGFACEHUB_API_TOKEN
environment with the token.
export HUGGINGFACEHUB_API_TOKEN=<token>
bash ./serving/tgi_gaudi/launch_tgi_service.sh 8
And then you can make requests like below to check the service status:
curl 127.0.0.1:8080/generate \
-X POST \
-d '{"inputs":"What is Deep Learning?","parameters":{"max_new_tokens":32}}' \
-H 'Content-Type: application/json'
The ./serving/tgi_gaudi/launch_tgi_service.sh script accepts three parameters:
- num_cards: The number of Gaudi cards to be utilized, ranging from 1 to 8. The default is set to 1.
- port_number: The port number assigned to the TGI Gaudi endpoint, with the default being 8080.
- model_name: The model name utilized for LLM, with the default set to "Intel/neural-chat-7b-v3-3".
You have the flexibility to customize these parameters according to your specific needs. Additionally, you can set the TGI Gaudi endpoint by exporting the environment variable TGI_LLM_ENDPOINT
:
export TGI_LLM_ENDPOINT="http://xxx.xxx.xxx.xxx:8080"
Text Embeddings Inference (TEI) is a toolkit designed for deploying and serving open-source text embeddings and sequence classification models efficiently. With TEI, users can extract high-performance features using various popular models. It supports token-based dynamic batching for enhanced performance.
To launch the TEI service, you can use the following commands:
model=BAAI/bge-large-en-v1.5
revision=refs/pr/5
volume=$PWD/data # share a volume with the Docker container to avoid downloading weights every run
docker run -p 9090:80 -v $volume:/data -e http_proxy=$http_proxy -e https_proxy=$https_proxy --pull always ghcr.io/huggingface/text-embeddings-inference:cpu-1.2 --model-id $model --revision $revision
export TEI_ENDPOINT="http://xxx.xxx.xxx.xxx:9090"
And then you can make requests like below to check the service status:
curl 127.0.0.1:9090/embed \
-X POST \
-d '{"inputs":"What is Deep Learning?"}' \
-H 'Content-Type: application/json'
Note: If you want to integrate the TEI service into the LangChain application, you'll need to restart the LangChain backend service after launching the TEI service.
Update the HUGGINGFACEHUB_API_TOKEN
environment variable with your huggingface token in the docker-compose.yml
cd langchain/docker
docker compose -f docker-compose.yml up -d
cd ../../
Note
If you modified any files and want that change introduced in this step, add --build
to the end of the command to build the container image instead of pulling it from dockerhub.
Each time the Redis container is launched, data should be ingested into the container using the commands:
docker exec -it qna-rag-redis-server bash
cd /ws
python ingest.py
Note: ingest.py
will download the embedding model. Please set the proxy if necessary.
We offer content moderation support utilizing Meta's Llama Guard model. To activate GuardRails, kindly follow the instructions below to deploy the Llama Guard model on TGI Gaudi.
volume=$PWD/data
model_id="meta-llama/LlamaGuard-7b"
docker run -p 8088:80 -v $volume:/data --runtime=habana -e HABANA_VISIBLE_DEVICES=all -e OMPI_MCA_btl_vader_single_copy_mechanism=none --cap-add=sys_nice --ipc=host -e HUGGING_FACE_HUB_TOKEN=<your HuggingFace token> -e HTTPS_PROXY=$https_proxy -e HTTP_PROXY=$https_proxy tgi_gaudi --model-id $model_id
export SAFETY_GUARD_ENDPOINT="http://xxx.xxx.xxx.xxx:8088"
And then you can make requests like below to check the service status:
curl 127.0.0.1:8088/generate \
-X POST \
-d '{"inputs":"How do you buy a tiger in the US?","parameters":{"max_new_tokens":32}}' \
-H 'Content-Type: application/json'
Make sure TGI-Gaudi service is running and also make sure data is populated into Redis. Launch the backend service:
docker exec -it qna-rag-redis-server bash
nohup python app/server.py &
The LangChain backend service listens to port 8000, you can customize it by changing the code in docker/qna-app/app/server.py
.
And then you can make requests like below to check the LangChain backend service status:
# non-streaming endpoint
curl 127.0.0.1:8000/v1/rag/chat \
-X POST \
-d '{"query":"What is the total revenue of Nike in 2023?"}' \
-H 'Content-Type: application/json'
# streaming endpoint
curl 127.0.0.1:8000/v1/rag/chat_stream \
-X POST \
-d '{"query":"What is the total revenue of Nike in 2023?"}' \
-H 'Content-Type: application/json'
Navigate to the "ui" folder and execute the following commands to start the frontend GUI:
cd ui
sudo apt-get install npm && \
npm install -g n && \
n stable && \
hash -r && \
npm install -g npm@latest
For CentOS, please use the following commands instead:
curl -sL https://rpm.nodesource.com/setup_20.x | sudo bash -
sudo yum install -y nodejs
Update the DOC_BASE_URL
environment variable in the .env
file by replacing the IP address '127.0.0.1' with the actual IP address.
Run the following command to install the required dependencies:
npm install
Start the development server by executing the following command:
nohup npm run dev &
This will initiate the frontend service and launch the application.
The TGI Gaudi utilizes BFLOAT16 optimization as the default setting. If you aim to achieve higher throughput, you can enable FP8 quantization on the TGI Gaudi. Note that currently only Llama2 series and Mistral series models support FP8 quantization. Please follow the below steps to enable FP8 quantization.
Enter into the TGI Gaudi docker container, and then run the below commands:
pip install git+https://github.com/huggingface/optimum-habana.git
git clone https://github.com/huggingface/optimum-habana.git
cd optimum-habana/examples/text-generation
pip install -r requirements_lm_eval.txt
QUANT_CONFIG=./quantization_config/maxabs_measure.json python ../gaudi_spawn.py run_lm_eval.py -o acc_7b_bs1_measure.txt --model_name_or_path Intel/neural-chat-7b-v3-3 --attn_softmax_bf16 --use_hpu_graphs --trim_logits --use_kv_cache --reuse_cache --bf16 --batch_size 1
QUANT_CONFIG=./quantization_config/maxabs_quant.json python ../gaudi_spawn.py run_lm_eval.py -o acc_7b_bs1_quant.txt --model_name_or_path Intel/neural-chat-7b-v3-3 --attn_softmax_bf16 --use_hpu_graphs --trim_logits --use_kv_cache --reuse_cache --bf16 --batch_size 1 --fp8
After finishing the above commands, the quantization metadata will be generated. Move the metadata directory ./hqt_output/ and copy the quantization JSON file to the host (under …/data). Please adapt the commands with your Docker ID and directory path.
docker cp 262e04bbe466:/usr/src/optimum-habana/examples/text-generation/hqt_output data/
docker cp 262e04bbe466:/usr/src/optimum-habana/examples/text-generation/quantization_config/maxabs_quant.json data/
Then modify the dump_stats_path
to "/data/hqt_output/measure" and update dump_stats_xlsx_path
to /data/hqt_output/measure/fp8stats.xlsx" in maxabs_quant.json file.
docker run -p 8080:80 -e QUANT_CONFIG=/data/maxabs_quant.json -v $volume:/data --runtime=habana -e HABANA_VISIBLE_DEVICES=all -e OMPI_MCA_btl_vader_single_copy_mechanism=none --cap-add=sys_nice --ipc=host ghcr.io/huggingface/tgi-gaudi:1.2.1 --model-id Intel/neural-chat-7b-v3-3
Now the TGI Gaudi will launch the FP8 model by default and you can make requests like below to check the service status:
curl 127.0.0.1:8080/generate \
-X POST \
-d '{"inputs":"What is Deep Learning?","parameters":{"max_new_tokens":32}}' \
-H 'Content-Type: application/json'
SCRIPT USAGE NOTICE: By downloading and using any script file included with the associated software package (such as files with .bat, .cmd, or .JS extensions, Docker files, or any other type of file that, when executed, automatically downloads and/or installs files onto your system) (the “Script File”), it is your obligation to review the Script File to understand what files (e.g., other software, AI models, AI Datasets) the Script File will download to your system (“Downloaded Files”). Furthermore, by downloading and using the Downloaded Files, even if they are installed through a silent install, you agree to any and all terms and conditions associated with such files, including but not limited to, license terms, notices, or disclaimers.