From 435499ea7f80958fbfad922cd7b2f93599e37c0b Mon Sep 17 00:00:00 2001 From: Vinish M Date: Sat, 15 Apr 2023 16:27:07 +0530 Subject: [PATCH] pegasus speedster notebook added --- ..._Face_PyTorch_PEGASUS_with_Speedster.ipynb | 814 ++++++++++++++++++ 1 file changed, 814 insertions(+) create mode 100644 notebooks/speedster/huggingface/Accelerate_Hugging_Face_PyTorch_PEGASUS_with_Speedster.ipynb diff --git a/notebooks/speedster/huggingface/Accelerate_Hugging_Face_PyTorch_PEGASUS_with_Speedster.ipynb b/notebooks/speedster/huggingface/Accelerate_Hugging_Face_PyTorch_PEGASUS_with_Speedster.ipynb new file mode 100644 index 00000000..171e1147 --- /dev/null +++ b/notebooks/speedster/huggingface/Accelerate_Hugging_Face_PyTorch_PEGASUS_with_Speedster.ipynb @@ -0,0 +1,814 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "id": "ef331be9", + "metadata": { + "id": "ef331be9" + }, + "source": [ + "![nebullvm nebuly AI accelerate inference optimize DeepLearning](https://user-images.githubusercontent.com/38586138/201391643-a80407e5-2c28-409c-90c9-327795cd27e8.png)" + ] + }, + { + "attachments": {}, + "cell_type": "markdown", + "id": "f260653a", + "metadata": { + "id": "f260653a" + }, + "source": [ + "# Accelerate Hugging Face PEGASUS with Speedster\n" + ] + }, + { + "cell_type": "markdown", + "id": "8bdf3af5", + "metadata": { + "id": "8bdf3af5" + }, + "source": [ + "Hi and welcome đŸ‘‹\n", + "\n", + "In this notebook we will discover how in just a few steps you can speed up the response time of deep learning model inference using the Speedster app from the open-source library nebullvm.\n", + "\n", + "With Speedster's latest API, you can speed up models up to 10 times without any loss of accuracy (option A), or accelerate them up to 20-30 times by setting a self-defined amount of accuracy/precision that you are willing to trade off to get even lower response time (option B). To accelerate your model, Speedster takes advantage of various optimization techniques such as deep learning compilers (in both option A and option B), quantization, half accuracy, and so on (option B).\n", + "\n", + "Let's jump to the code." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "d527d63b", + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "d527d63b", + "outputId": "57626bac-e458-487f-f4fa-a459627af296" + }, + "outputs": [], + "source": [ + "%env CUDA_VISIBLE_DEVICES=0" + ] + }, + { + "cell_type": "markdown", + "id": "cXXh1ifQ13mH", + "metadata": { + "id": "cXXh1ifQ13mH" + }, + "source": [ + "# Installation" + ] + }, + { + "cell_type": "markdown", + "id": "48aljCHu14-H", + "metadata": { + "id": "48aljCHu14-H" + }, + "source": [ + "Install Speedster:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "QFQh3BVr1-GO", + "metadata": { + "id": "QFQh3BVr1-GO" + }, + "outputs": [], + "source": [ + "!pip install speedster" + ] + }, + { + "cell_type": "markdown", + "id": "8a7a86b3", + "metadata": { + "id": "8a7a86b3" + }, + "source": [ + "Install deep learning compilers:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "cffbfa32", + "metadata": { + "id": "cffbfa32" + }, + "outputs": [], + "source": [ + "!python -m nebullvm.installers.auto_installer --frameworks huggingface --compilers all" + ] + }, + { + "cell_type": "markdown", + "id": "73072506", + "metadata": { + "id": "73072506" + }, + "source": [ + "## Model and Dataset setup" + ] + }, + { + "attachments": {}, + "cell_type": "markdown", + "id": "cf24c4c4", + "metadata": {}, + "source": [ + "Add tensorrt installation path to the LD_LIBRARY_PATH env variable, in order to activate TensorrtExecutionProvider for ONNXRuntime" + ] + }, + { + "cell_type": "code", + "execution_count": 2, + "id": "1cf8ff74", + "metadata": {}, + "outputs": [], + "source": [ + "import os\n", + "\n", + "tensorrt_path = \"/usr/local/lib/python3.8/dist-packages/tensorrt\" # Change this path according to your TensorRT location\n", + "\n", + "if os.path.exists(tensorrt_path):\n", + " os.environ['LD_LIBRARY_PATH'] += f\":{tensorrt_path}\"\n", + "else:\n", + " print(\"Unable to find TensorRT path. ONNXRuntime won't use TensorrtExecutionProvider.\")" + ] + }, + { + "attachments": {}, + "cell_type": "markdown", + "id": "e4d55115", + "metadata": { + "id": "e4d55115" + }, + "source": [ + "We chose Google's pegasus-x-base as the pre-trained model that we want to optimize. Let's download both the pre-trained model and the tokenizer from the Hugging Face model hub." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "NOgOmfdY_dav", + "metadata": { + "id": "NOgOmfdY_dav" + }, + "outputs": [], + "source": [ + "from transformers import AutoTokenizer, AutoModelForSeq2SeqLM\n", + "import torch\n", + "\n", + "device = torch.device(\"cuda\" if torch.cuda.is_available() else \"cpu\")\n", + "\n", + "model_name = \"google/pegasus-x-base\"\n", + "\n", + "tokenizer = AutoTokenizer.from_pretrained(model_name)\n", + "model = AutoModelForSeq2SeqLM.from_pretrained(model_name, torchscript=True).to(device)\n", + "\n", + "# set the model to eval mode\n", + "_ = model.eval()" + ] + }, + { + "cell_type": "markdown", + "id": "11aa0739", + "metadata": { + "id": "11aa0739" + }, + "source": [ + "Let's create an example dataset with some random sentences" + ] + }, + { + "cell_type": "code", + "execution_count": 2, + "id": "ghGcDNFtKt3X", + "metadata": { + "id": "ghGcDNFtKt3X" + }, + "outputs": [], + "source": [ + "texts = [\n", + " \"\"\"BERT is a transformers model pretrained on a large corpus of English data in a self-supervised fashion. This means it was pretrained on the raw texts only, with no humans labeling them in any way (which is why it can use lots of publicly available data) with an automatic process to generate inputs and labels from those texts.\"\"\",\n", + " \"\"\"GPT-2 is a transformers model pretrained on a very large corpus of English data in a self-supervised fashion. This means it was pretrained on the raw texts only, with no humans labelling them in any way (which is why it can use lots of publicly available data) with an automatic process to generate inputs and labels from those texts. More precisely, it was trained to guess the next word in sentences.\"\"\",\n", + " \"\"\"With T5, we propose reframing all NLP tasks into a unified text-to-text-format where the input and output are always text strings, in contrast to BERT-style models that can only output either a class label or a span of the input. Our text-to-text framework allows us to use the same model, loss function, and hyperparameters on any NLP task.\"\"\",\n", + " \"\"\"LayoutLMv3 is a pre-trained multimodal Transformer for Document AI with unified text and image masking. The simple unified architecture and training objectives make LayoutLMv3 a general-purpose pre-trained model. For example, LayoutLMv3 can be fine-tuned for both text-centric tasks, including form understanding, receipt understanding, and document visual question answering, and image-centric tasks such as document image classification and document layout analysis.\"\"\",\n", + " \"\"\"XLNet is a new unsupervised language representation learning method based on a novel generalized permutation language modeling objective. Additionally, XLNet employs Transformer-XL as the backbone model, exhibiting excellent performance for language tasks involving long context. Overall, XLNet achieves state-of-the-art (SOTA) results on various downstream language tasks including question answering, natural language inference, sentiment analysis, and document ranking.\"\"\"\n", + "]\n", + "texts = texts*20" + ] + }, + { + "cell_type": "code", + "execution_count": 3, + "id": "a09f9424", + "metadata": { + "id": "a09f9424" + }, + "outputs": [], + "source": [ + "encoded_inputs = [tokenizer(text, padding=\"longest\", return_tensors=\"pt\") for text in texts]" + ] + }, + { + "cell_type": "markdown", + "id": "17040431", + "metadata": { + "id": "17040431" + }, + "source": [ + "## Speed up inference with Speedster: no metric drop" + ] + }, + { + "cell_type": "markdown", + "id": "44ddc21d", + "metadata": { + "id": "44ddc21d" + }, + "source": [ + "It's now time of improving a bit the performance in terms of speed. Let's use `Speedster`." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "f9d934f6", + "metadata": { + "id": "f9d934f6" + }, + "outputs": [], + "source": [ + "from speedster import optimize_model, save_model, load_model" + ] + }, + { + "attachments": {}, + "cell_type": "markdown", + "id": "76248033", + "metadata": { + "id": "76248033" + }, + "source": [ + "Usually Speedster is very simple and straightforward! Just use the `optimize_model` function and provide as input the model, some input data as example and the optimization time mode. PEGASUS is like T5, an encoder-decoder model, which needs to be splited into encoder and decoder before optimization." + ] + }, + { + "cell_type": "code", + "execution_count": 5, + "id": "i7sgUWjePN9i", + "metadata": { + "id": "i7sgUWjePN9i" + }, + "outputs": [], + "source": [ + "# First, we get the encoder and decoder from the model\n", + "encoder = model.get_encoder()\n", + "decoder = model.get_decoder()" + ] + }, + { + "cell_type": "markdown", + "id": "O7xaI1drQOQ0", + "metadata": { + "id": "O7xaI1drQOQ0" + }, + "source": [ + "Optionally a dynamic_info dictionary can be also provided, in order to support inputs with dynamic shape." + ] + }, + { + "cell_type": "code", + "execution_count": 6, + "id": "nTUPdDchQLc1", + "metadata": { + "id": "nTUPdDchQLc1" + }, + "outputs": [], + "source": [ + "encoder_dynamic_info = {\n", + " \"inputs\": [\n", + " {0: 'batch', 1: 'num_tokens'},\n", + " {0: 'batch', 1: 'num_tokens'}\n", + " ],\n", + " \"outputs\": [\n", + " {0: 'batch', 1: 'num_tokens'},\n", + " ]\n", + "}" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "zPC_EDwEJIM0", + "metadata": { + "id": "zPC_EDwEJIM0" + }, + "outputs": [], + "source": [ + "# Create the optimized encoder model seperately\n", + "optimized_encoder_model = optimize_model(\n", + " model=encoder,\n", + " input_data=encoded_inputs,\n", + " optimization_time=\"constrained\",\n", + " ignore_compilers=[\"tensor_rt\", \"tvm\"], # TensorRT does not work for this model\n", + " dynamic_info=encoder_dynamic_info,\n", + ")" + ] + }, + { + "cell_type": "code", + "execution_count": 32, + "id": "1a88596d", + "metadata": {}, + "outputs": [], + "source": [ + "decoder_dynamic_info = {\n", + " \"inputs\": [\n", + " {0: 'batch', 1: 'num_tokens'},\n", + " {0: 'batch', 1: 'num_tokens'}\n", + " ],\n", + " \"outputs\": [\n", + " {0: 'batch', 1: 'num_tokens'},\n", + " ] + [{0: 'batch', 2: 'num_tokens'} for i in range(24)]\n", + "}" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "7Oa68a87Qjre", + "metadata": { + "id": "7Oa68a87Qjre" + }, + "outputs": [], + "source": [ + "# Create the optimized decoder model seperately\n", + "optimized_decoder_model = optimize_model(\n", + " model=decoder,\n", + " input_data=encoded_inputs,\n", + " optimization_time=\"constrained\",\n", + " ignore_compilers=[\"tensor_rt\", \"tvm\"], # TensorRT does not work for this model\n", + " dynamic_info=decoder_dynamic_info,\n", + ")" + ] + }, + { + "cell_type": "code", + "execution_count": 18, + "id": "98c6ab09", + "metadata": { + "id": "98c6ab09" + }, + "outputs": [], + "source": [ + "import time\n", + "\n", + "# Move inputs to gpu if available\n", + "encoded_inputs = [tokenizer(text, padding=\"longest\", return_tensors=\"pt\").to(device) for text in texts]" + ] + }, + { + "cell_type": "markdown", + "id": "6e5b3b21", + "metadata": { + "id": "6e5b3b21" + }, + "source": [ + "Let's run the prediction 100 times to calculate the average response time of the original model." + ] + }, + { + "cell_type": "code", + "execution_count": 23, + "id": "51ffae98", + "metadata": {}, + "outputs": [], + "source": [ + "_ = encoder.to(device)\n", + "_ = decoder.to(device)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "d3bc5c98", + "metadata": { + "id": "d3bc5c98" + }, + "outputs": [], + "source": [ + "times = []\n", + "# Warmup for 30 iterations\n", + "for encoded_input in encoded_inputs[:30]:\n", + " with torch.no_grad():\n", + " encoder_out = encoder(**encoded_input)\n", + " decoder_out = decoder(**encoded_input,encoder_hidden_states=encoder_out[0])\n", + "\n", + "# Benchmark\n", + "for encoded_input in encoded_inputs:\n", + " st = time.time()\n", + " with torch.no_grad():\n", + " encoder_out = encoder(**encoded_input)\n", + " decoder_out = decoder(**encoded_input,encoder_hidden_states=encoder_out[0])\n", + " times.append(time.time()-st)\n", + "original_model_time = sum(times)/len(times)*1000\n", + "print(f\"Average response time for original PEGASUS: {original_model_time} ms\")" + ] + }, + { + "cell_type": "markdown", + "id": "GU0SwykMTVAj", + "metadata": { + "id": "GU0SwykMTVAj" + }, + "source": [ + "In Real world use cases, we pass the decoder output to `model.lm_head` to get the actual prediction, but here we are testing the performance improvements, so i am skipping that step." + ] + }, + { + "cell_type": "markdown", + "id": "12c2df98", + "metadata": { + "id": "12c2df98" + }, + "source": [ + "Let's see the output of the original model" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "4892a905", + "metadata": { + "id": "4892a905" + }, + "outputs": [], + "source": [ + "encoder(**encoded_input)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "gx0naPVuSVrm", + "metadata": { + "id": "gx0naPVuSVrm" + }, + "outputs": [], + "source": [ + "decoder(**encoded_input,encoder_hidden_states=encoder_out[0])" + ] + }, + { + "cell_type": "markdown", + "id": "3db0a7a1", + "metadata": { + "id": "3db0a7a1" + }, + "source": [ + "Let's run the prediction 100 times to calculate the average response time of the optimized model." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "a3e83997", + "metadata": { + "id": "a3e83997" + }, + "outputs": [], + "source": [ + "times = []\n", + "\n", + "# Warmup for 30 iterations\n", + "for encoded_input in encoded_inputs[:30]:\n", + " with torch.no_grad():\n", + " encoder_out = optimized_encoder_model(**encoded_input)\n", + " decoder_out = optimized_decoder_model(**encoded_input,encoder_hidden_states=encoder_out[0])\n", + "\n", + "# Benchmark\n", + "for encoded_input in encoded_inputs:\n", + " st = time.time()\n", + " with torch.no_grad():\n", + " encoder_out = optimized_encoder_model(**encoded_input)\n", + " decoder_out = optimized_decoder_model(**encoded_input,encoder_hidden_states=encoder_out[0])\n", + " times.append(time.time()-st)\n", + "optimized_model_time = sum(times)/len(times)*1000\n", + "print(f\"Average response time for optimized PEGASUS (no metric drop): {optimized_model_time} ms\")" + ] + }, + { + "cell_type": "markdown", + "id": "0d884d61", + "metadata": { + "id": "0d884d61" + }, + "source": [ + "Let's see the output of the optimized_model" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "75611b2e", + "metadata": { + "id": "75611b2e" + }, + "outputs": [], + "source": [ + "optimized_encoder_model(**encoded_input)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "cpieoDfwS-V7", + "metadata": { + "id": "cpieoDfwS-V7" + }, + "outputs": [], + "source": [ + "optimized_decoder_model(**encoded_input,encoder_hidden_states=encoder_out[0])" + ] + }, + { + "cell_type": "markdown", + "id": "ceb60d8c", + "metadata": { + "id": "ceb60d8c" + }, + "source": [ + "## Speed up inference with Speedster: metric drop" + ] + }, + { + "cell_type": "markdown", + "id": "7b1950d5", + "metadata": { + "id": "7b1950d5" + }, + "source": [ + "This time we will use the `metric_drop_ths` argument to accept a little drop in terms of precision, in order to enable quantization and obtain an higher speedup" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "VwOLWZSZUM89", + "metadata": { + "id": "VwOLWZSZUM89" + }, + "outputs": [], + "source": [ + "optimized_encoder_model = optimize_model(\n", + " model=encoder,\n", + " input_data=encoded_inputs,\n", + " optimization_time=\"constrained\",\n", + " ignore_compilers=[\"tensor_rt\", \"tvm\"], # TensorRT does not work for this model\n", + " dynamic_info=encoder_dynamic_info,\n", + " metric_drop_ths=0.1,\n", + ")" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "FIKn4V3dUIZB", + "metadata": { + "id": "FIKn4V3dUIZB" + }, + "outputs": [], + "source": [ + "optimized_decoder_model = optimize_model(\n", + " model=decoder,\n", + " input_data=encoded_inputs,\n", + " optimization_time=\"constrained\",\n", + " ignore_compilers=[\"tensor_rt\", \"tvm\"], # TensorRT does not work for this model\n", + " dynamic_info=decoder_dynamic_info,\n", + " metric_drop_ths=0.1,\n", + ")" + ] + }, + { + "cell_type": "code", + "execution_count": 36, + "id": "6ce0647d", + "metadata": {}, + "outputs": [], + "source": [ + "_ = encoder.to(device)\n", + "_ = decoder.to(device)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "0fbfe6fa", + "metadata": { + "id": "0fbfe6fa" + }, + "outputs": [], + "source": [ + "times = []\n", + "# Warmup for 30 iterations\n", + "for encoded_input in encoded_inputs[:30]:\n", + " with torch.no_grad():\n", + " encoder_out = encoder(**encoded_input)\n", + " decoder_out = decoder(**encoded_input,encoder_hidden_states=encoder_out[0])\n", + "\n", + "# Benchmark\n", + "for encoded_input in encoded_inputs:\n", + " st = time.time()\n", + " with torch.no_grad():\n", + " encoder_out = encoder(**encoded_input)\n", + " decoder_out = decoder(**encoded_input,encoder_hidden_states=encoder_out[0])\n", + " times.append(time.time()-st)\n", + "original_model_time = sum(times)/len(times)*1000\n", + "print(f\"Average response time for original PEGASUS: {original_model_time} ms\")" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "f89b7e6d", + "metadata": { + "id": "f89b7e6d" + }, + "outputs": [], + "source": [ + "encoder(**encoded_input)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "oI1zjIBSUoIU", + "metadata": { + "id": "oI1zjIBSUoIU" + }, + "outputs": [], + "source": [ + "decoder(**encoded_input,encoder_hidden_states=encoder_out[0])" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "10d17b5c", + "metadata": { + "id": "10d17b5c" + }, + "outputs": [], + "source": [ + "times = []\n", + "\n", + "# Warmup for 30 iterations\n", + "for encoded_input in encoded_inputs[:30]:\n", + " with torch.no_grad():\n", + " encoder_out = optimized_encoder_model(**encoded_input)\n", + " decoder_out = optimized_decoder_model(**encoded_input,encoder_hidden_states=encoder_out[0])\n", + "\n", + "# Benchmark\n", + "for encoded_input in encoded_inputs:\n", + " st = time.time()\n", + " with torch.no_grad():\n", + " encoder_out = optimized_encoder_model(**encoded_input)\n", + " decoder_out = optimized_decoder_model(**encoded_input,encoder_hidden_states=encoder_out[0])\n", + " times.append(time.time()-st)\n", + "optimized_model_time = sum(times)/len(times)*1000\n", + "print(f\"Average response time for optimized PEGASUS (metric drop): {optimized_model_time} ms\")" + ] + }, + { + "cell_type": "markdown", + "id": "4XFMC1S6zXTU", + "metadata": { + "id": "4XFMC1S6zXTU" + }, + "source": [ + "## Save and reload the optimized model" + ] + }, + { + "cell_type": "markdown", + "id": "OXHVr3EAzbT5", + "metadata": { + "id": "OXHVr3EAzbT5" + }, + "source": [ + "We can easily save to disk the optimized model with the following line:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "3M565P-zzaFB", + "metadata": { + "id": "3M565P-zzaFB" + }, + "outputs": [], + "source": [ + "save_model(optimized_encoder_model, \"encoder_model_save_path\")\n", + "save_model(optimized_decoder_model, \"decoder_model_save_path\")" + ] + }, + { + "cell_type": "markdown", + "id": "ee8CS_Evzg1j", + "metadata": { + "id": "ee8CS_Evzg1j" + }, + "source": [ + "We can then load again the model:\n", + "\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "zOQ88SY_zg-A", + "metadata": { + "id": "zOQ88SY_zg-A" + }, + "outputs": [], + "source": [ + "optimized_encoder_model = load_model(\"encoder_model_save_path\")\n", + "optimized_decoder_model = load_model(\"decoder_model_save_path\")" + ] + }, + { + "cell_type": "markdown", + "id": "cb234e5e", + "metadata": { + "id": "cb234e5e" + }, + "source": [ + "Great! Was it easy? How are the results? Do you have any comments?\n", + "Share your optimization results and thoughts with our community on Discord, where we chat about Speedster and AI acceleration.\n", + "\n", + "Note that the acceleration of Speedster depends very much on the hardware configuration and your AI model. Given the same input model, Speedster can accelerate it by 10 times on some machines and perform poorly on others.\n", + "\n", + "If you want to learn more about how Speedster works, look at other tutorials and performance benchmarks, check out the links below or write to us on Discord." + ] + }, + { + "cell_type": "markdown", + "id": "b77ff2ac", + "metadata": { + "id": "b77ff2ac" + }, + "source": [ + "
\n", + " Join the community |\n", + " Contribute to the library \n", + "
\n", + "\n", + "
\n", + " How speedster works •\n", + " Documentation •\n", + " Quick start \n", + "
" + ] + } + ], + "metadata": { + "accelerator": "GPU", + "colab": { + "provenance": [] + }, + "gpuClass": "premium", + "kernelspec": { + "display_name": "Python 3", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.8.16" + }, + "vscode": { + "interpreter": { + "hash": "b0fa6594d8f4cbf19f97940f81e996739fb7646882a419484c72d19e05852a7e" + } + } + }, + "nbformat": 4, + "nbformat_minor": 5 +}