This repository provides the fastertrasformer implementation of CodeGeeX model.
First, download and setup the following docker environment, replace <WORK_DIR>
by the directory of this repo:
docker pull nvcr.io/nvidia/pytorch:21.11-py3
docker run -p 9114:5000 --cpus 12 --gpus '"device=0"' -it -v <WORK_DIR>:/workspace/codegeex-fastertransformer --ipc=host --name=test nvcr.io/nvidia/pytorch:21.11-py3
Second, install following packages in the docker:
pip3 install transformers
pip3 install sentencepiece
cd codegeex-fastertransformer
sh make_all.sh # Remember to specify the DSM version according to the GPU.
Then, convert the initial checkpoint (download here) to FT version using get_ckpt_ft.py
.
Finally, run api.py
to start the server and run post.py
to send request:
nohup python3 api.py > test.log 2>&1 &
python3 post.py
The following figure compares the performances of pure Pytorch, Megatron and FasterTransformer under INT8 and FP16. The fastest implementation is INT8 + FastTrans, and the average time of generating a token <15ms.
Our code is licensed under the Apache-2.0 license.