diff --git a/.github/workflows/pylint.yml b/.github/workflows/pylint.yml new file mode 100644 index 0000000..53c02da --- /dev/null +++ b/.github/workflows/pylint.yml @@ -0,0 +1,41 @@ +name: Python Lint + +on: + push: + branches: [ main ] + pull_request: + branches: [ main ] + +jobs: + build: + + runs-on: ubuntu-latest + strategy: + matrix: + python-version: [3.7, 3.8] + + steps: + - uses: actions/checkout@v2 + - name: Set up Python ${{ matrix.python-version }} + uses: actions/setup-python@v2 + with: + python-version: ${{ matrix.python-version }} + - name: Install dependencies + run: | + python -m pip install --upgrade pip + pip install -r requirements.txt + pip install flake8 black mypy types-requests + - name: Lint with Black + run: | + # check if black would reformat anything + black lti_llm_client/ --check + - name: Lint with flake8 + run: | + # stop the build if there are Python syntax errors or undefined names + flake8 lti_llm_client/ --count --select=C,E,F,W,B,B950 --ignore=E203,E501,E731,W503 --show-source --statistics + # exit-zero treats all errors as warnings. The GitHub editor is 127 chars wide + flake8 lti_llm_client --count --exit-zero --max-complexity=10 --max-line-length=88 --statistics + - name: Type Checking with MyPy + run: | + # stop the build if there are type errors + mypy --strict lti_llm_client/ \ No newline at end of file diff --git a/.pre-commit-config.yaml b/.pre-commit-config.yaml deleted file mode 100644 index d4cd0d2..0000000 --- a/.pre-commit-config.yaml +++ /dev/null @@ -1,11 +0,0 @@ -repos: - - repo: https://github.com/pycqa/isort - rev: 5.10.1 - hooks: - - id: isort - name: isort (python) - - repo: https://github.com/psf/black - rev: 22.8.0 - hooks: - - id: black - args: [--line-length=119,--target-version=py35] diff --git a/README.md b/README.md index 1049f9d..d8f6ae5 100644 --- a/README.md +++ b/README.md @@ -1,33 +1,20 @@ -# Fast Inference Solutions for BLOOM +# LTI's Large Language Model Deployment -This repo provides demos and packages to perform fast inference solutions for BLOOM. Some of the solutions have their own repos in which case a link to the corresponding repos is provided instead. +**TODO**: Add a description of the project. -Some of the solutions provide both half-precision and int8-quantized solution. +This repo is a fork of the [huggingface](https://huggingface.co/)'s [BLOOM inference demos](https://github.com/huggingface/transformers-bloom-inference). -## Client-side solutions +## Installation -Solutions developed to perform large batch inference locally: +```bash +pip install -e . +``` -Pytorch: +## Example API Usage -* [Accelerate, DeepSpeed-Inference and DeepSpeed-ZeRO](./bloom-inference-scripts) +```python +import lti_llm_client -* Thomas Wang is working on a Custom Fused Kernel solution - will link once it's ready for a general use. - -JAX: - -* [BLOOM Inference in JAX](https://github.com/huggingface/bloom-jax-inference) - - - -## Server solutions - -Solutions developed to be used in a server mode (i.e. varied batch size, varied request rate): - -Pytorch: - -* [Accelerate and DeepSpeed-Inference based solutions](./bloom-inference-server) - -Rust: - -* [Bloom-server](https://github.com/Narsil/bloomserver) +client = lti_llm_client.Client() +client.prompt("CMU's PhD students are") +``` diff --git a/bloom-inference-scripts/README.md b/bloom-inference-scripts/README.md deleted file mode 100644 index a943664..0000000 --- a/bloom-inference-scripts/README.md +++ /dev/null @@ -1,174 +0,0 @@ -# Inference scripts for BLOOM - -## BLOOM Inference solutions - -Here are some benchmark resuls on JeanZay's 8x80GB A100 node w/ 512GB of CPU memory: - -All benchmarks are doing greedy generation of 100 token outputs: -``` -Generate args {'max_length': 100, 'do_sample': False} -``` -The input prompt is comprised of just a few tokens. - -Throughput in msecs on 8x80GB gpus: - -| project \ bs | 1 | 8 | 16 | 32 | 64 | 128 | 256 | 512 | -| :---------------- | :----- | :---- | :---- | :---- | :--- | :--- | :--- | :--- | -| accelerate bf16 | 230.38 | 31.78 | 17.84 | 10.89 | oom | | | | -| accelerate int8 | 286.56 | 40.92 | 22.65 | 13.27 | oom | | | | -| ds-inference fp16 | 44.02 | 5.70 | 3.01 | 1.68 | 1.00 | 0.69 | oom | | -| ds-inference int8 | 89.09 | 11.44 | 5.88 | 3.09 | 1.71 | 1.02 | 0.71 | oom | -| ds-zero bf16 | 283 | 34.88 | oom | | | | | | - -note: Since Deepspeed-ZeRO can process multiple generate streams in parallel its throughput can be further divided by 8 or 16, depending on whether 8 or 16 gpus were used during the generate. and, of course, it means that it can process a bs of 64 in the case of 8x80 A100 (the table above). - -Start to ready to generate in secs (mainly loading and data preparation time): - -| project | | -| :---------------------- | :--- | -| accelerate | 121 | -| ds-inference shard-int8 | 61 | -| ds-inference shard-fp16 | 60 | -| ds-inference unsharded | 662 | -| ds-zero | 462 | - -Now let's look at the power of quantized int8-based models provided by [Deepspeed-Inference](https://www.deepspeed.ai/tutorials/inference-tutorial/) and [BitsNBytes](https://github.com/TimDettmers/bitsandbytes), as it requires only half the original GPU memory of inference in bfloat16 or float16. - -Throughput in msecs 4x80GB A100: - -| project \ bs | 1 | 8 | 16 | 32 | 64 | 128 | -| :---------------- | :----- | :---- | :---- | :---- | :--- | :--- | -| accelerate int8 | 284.15 | 40.14 | 21.97 | oom | | | -| ds-inference int8 | 156.51 | 20.11 | 10.38 | 5.50 | 2.96 | oom | - -To get the benchmark results simply add `--benchmark` to any of these 3 scripts discussed below. - - -## Deepspeed-Inference - -Deepspeed-Inference uses Tensor-Parallelism and efficient fused CUDA kernels: -https://www.deepspeed.ai/tutorials/inference-tutorial/ - -### Setup - -``` -pip install deepspeed>=0.7.3 -``` - -### Run - -1. the fastest approach is to use a tp-pre-sharded checkpoint that takes only ~1min to load, as compared to 10min for non-presharded bloom checkpoint - - -``` -deepspeed --num_gpus 8 bloom-inference-scripts/bloom-ds-inference.py --name microsoft/bloom-deepspeed-inference-fp16 -``` - -1a. -if you want to run the original bloom checkpoint, which once loaded will run at the same throughput as the previous solution, but the loading will take 10-20min: - -``` -deepspeed --num_gpus 8 bloom-inference-scripts/bloom-ds-inference.py --name bigscience/bloom -``` - -2a. The 8bit quantized version requires you to have only half the GPU memory of the normal half precision version: - - -``` -deepspeed --num_gpus 8 bloom-inference-scripts/bloom-ds-inference.py --name microsoft/bloom-deepspeed-inference-int8 --dtype int8 -``` - -Here we used `microsoft/bloom-deepspeed-inference-int8` and also told the script to run in `int8`. - -And of course, just 4x80GB A100 gpus is now sufficient: - -``` -deepspeed --num_gpus 4 bloom-inference-scripts/bloom-ds-inference.py --name microsoft/bloom-deepspeed-inference-int8 --dtype int8 -``` - - - -## HF Accelerate - -HF Accelerate can use naive Pipeline Parallelism to load a huge model over multiple GPUs: -https://github.com/huggingface/accelerate - -### Setup - -``` -pip install transformers>=4.21.3 accelerate>=0.12.0 -``` - - -### Run - - -``` -python bloom-inference-scripts/bloom-accelerate-inference.py --name bigscience/bloom --batch_size 1 --benchmark 2>&1 | tee bloom-accelerate-inference_bs=1.txt -``` - -To activate the 8bit quantized solution first install `bitsnbytes`: - -``` -pip install bitsandbytes -``` - -and then add `--dtype int8` to the previous command line: - -``` -python bloom-inference-scripts/bloom-accelerate-inference.py --name bigscience/bloom --dtype int8 --batch_size 1 --benchmark 2>&1 | tee bloom-int8-accelerate-inference_bs=1.txt -``` - -if you have more than 4 GPUs you can tell it to use only 4 with: -``` -CUDA_VISIBLE_DEVICES=0,1,2,3 python bloom-inference-scripts/bloom-accelerate-inference.py --name bigscience/bloom --dtype int8 --batch_size 1 --benchmark 2>&1 | tee bloom-int8-accelerate-inference_bs=1.txt -``` - - -## Deepspeed ZeRO-Inference - - -[Deepspeed ZeRO](https://www.deepspeed.ai/tutorials/zero/) uses a magical sharding approach which can take almost any model and scale it across a few or hundreds of GPUs. - -### Setup - -``` -pip install deepspeed -``` - - -### Run - -Note that the script currently runs the same inputs on all GPUs, but you can run a different stream on each GPU, and get `n_gpu` times faster throughput. You can't do that with Deepspeed-Inference. - - -``` -deepspeed --num_gpus 8 bloom-inference-scripts/bloom-ds-zero-inference.py --name bigscience/bloom --batch_size 1 --benchmark 2>&1 | tee bloom-ds-zero-inference_bs=1.txt -``` - -Please remember that with ZeRO the user can generate multiple unique streams at the same time - and thus the overall performance should be throughput in secs/token divided by number of participating gpus - so 8x to 16x faster depending on whether 8 or 16 gpus were used! - -You can also try the offloading solutions with just one small GPU, which will take a long time to run, but if you don't have 8 huge GPUs this is as good as it gets. - - -CPU-Offload (1x gpus): -``` -deepspeed --num_gpus 1 bloom-inference-scripts/bloom-ds-zero-inference.py --name bigscience/bloom --batch_size 8 --cpu_offload --benchmark 2>&1 | tee bloom-ds-zero-inference-cpu_offload_bs=8.txt -``` - -NVMe-Offload (1x gpus): -``` -deepspeed --num_gpus 1 bloom-inference-scripts/bloom-ds-zero-inference.py --name bigscience/bloom --batch_size 8 --nvme_offload_path=/path/to/nvme_offload --benchmark 2>&1 | tee bloom-ds-zero-inference-nvme_offload_bs=8.txt -``` - -make sure to adjust `/path/to/nvme_offload` to somewhere you have ~400GB of free memory on a fast NVMe drive. - -## Support - -If you run into things not working or have other questions please open an Issue in the corresponding backend: - -- [Accelerate](https://github.com/huggingface/accelerate/issues) -- [Deepspeed-Inference](https://github.com/microsoft/DeepSpeed/issues) -- [Deepspeed-ZeRO](https://github.com/microsoft/DeepSpeed/issues) - -If there a specific issue with one of the scripts and not the backend only then please open an Issue here and tag [@stas00](https://github.com/stas00). diff --git a/bloom-inference-scripts/bloom-accelerate-inference.py b/bloom-inference-scripts/bloom-accelerate-inference.py deleted file mode 100644 index 58c2c3e..0000000 --- a/bloom-inference-scripts/bloom-accelerate-inference.py +++ /dev/null @@ -1,237 +0,0 @@ -import argparse -import gc -import math -import os -import time - -import torch - -from transformers import AutoConfig, AutoModelForCausalLM, AutoTokenizer - - -def get_args(): - parser = argparse.ArgumentParser() - parser.add_argument("--local_rank", required=False, type=int, help="used by dist launchers") - parser.add_argument("--name", type=str, help="Name path", required=True) - parser.add_argument("--batch_size", default=1, type=int, help="batch size") - parser.add_argument("--benchmark", action="store_true", help="additionally run benchmark") - parser.add_argument("--greedy", action="store_true") - parser.add_argument("--top-k", type=int, default=0) - parser.add_argument("--top-p", type=float, default=0.0) - parser.add_argument("--dtype", type=str, help="float16 or int8", choices=["int8", "float16"], default="float16") - - return parser.parse_args() - - -def get_max_memory_per_gpu_dict(dtype, model_name): - """try to generate the memory map based on what we know about the model and the available hardware""" - - # figure out the memory map - the minimum per gpu required to load the model - n_gpus = torch.cuda.device_count() - - if ( - model_name == "bigscience/bloom" - and n_gpus == 8 - and torch.cuda.get_device_properties(0).total_memory > 79 * 2**30 - ): - # hand crafted optimized memory map for 8x80 setup over BLOOM - # this works with bs=40 - if dtype != torch.int8: - max_memory_per_gpu = { - 0: "0GIB", - 1: "51GIB", - 2: "51GIB", - 3: "51GIB", - 4: "51GIB", - 5: "51GIB", - 6: "51GIB", - 7: "51GIB", - } - else: - max_memory_per_gpu = { - 0: "0GIB", - 1: "26GIB", - 2: "26GIB", - 3: "26GIB", - 4: "26GIB", - 5: "26GIB", - 6: "26GIB", - 7: "26GIB", - } - print("Max memory per gpu:", max_memory_per_gpu) - return max_memory_per_gpu - - try: - # model_params calculation, as we don't have a model yet to do: - # model_params = sum(dict((p.data_ptr(), p.numel()) for p in model.parameters()).values()) - - config = AutoConfig.from_pretrained(model_name) - h = config.hidden_size - l = config.n_layer - v = config.vocab_size - # from https://github.com/bigscience-workshop/bigscience/tree/6917a3b5fefcf439d3485ca184b4d9f6ab605150/math#model-sizing - model_params = l * (12 * h**2 + 13 * h) + v * h + 4 * h - except: - print_rank0(f"The model {model_name} has a broken config file. Please notify the owner") - raise - - if dtype == torch.int8: - bytes = 1 - else: - bytes = torch.finfo(dtype).bits / 8 - param_memory_total_in_bytes = model_params * bytes - # add 5% since weight sizes aren't the same and some GPU may need more memory - param_memory_per_gpu_in_bytes = int(param_memory_total_in_bytes / n_gpus * 1.10) - print_rank0(f"Estimating {param_memory_per_gpu_in_bytes/2**30:0.2f}GB per gpu for weights") - - # check the real available memory - # load cuda kernels first and only measure the real free memory after loading (shorter by ~2GB) - torch.ones(1).cuda() - max_memory_per_gpu_in_bytes = torch.cuda.mem_get_info(0)[0] - if max_memory_per_gpu_in_bytes < param_memory_per_gpu_in_bytes: - raise ValueError( - f"Unable to generate the memory map automatically as the needed estimated memory per gpu ({param_memory_per_gpu_in_bytes/2**30:0.2f}GB) is bigger than the available per gpu memory ({max_memory_per_gpu_in_bytes/2**30:0.2f}GB)" - ) - - max_memory_per_gpu = {i: param_memory_per_gpu_in_bytes for i in range(torch.cuda.device_count())} - print("Max memory per gpu:", max_memory_per_gpu) - return max_memory_per_gpu - - -t_start = time.time() - -num_tokens = 100 - -args = get_args() - -local_rank = int(os.getenv("LOCAL_RANK", "0")) -world_size = torch.cuda.device_count() - -rank = local_rank - - -def print_rank0(*msg): - if rank != 0: - return - print(*msg) - - -print_rank0(f"Using {world_size} gpus") -model_name = args.name -print_rank0(f"Loading model {model_name}") - -tokenizer = AutoTokenizer.from_pretrained(model_name) - -# XXX: can't automatically derive dtype via config's `from_pretrained` -dtype = torch.bfloat16 if model_name in ["bigscience/bloom", "bigscience/bigscience-small-testing"] else torch.float16 - -# print(get_max_memory_per_gpu_dict()) - -infer_dtype = args.dtype -if infer_dtype == "int8": - dtype = torch.int8 - -kwargs = dict( - device_map="auto", - max_memory=get_max_memory_per_gpu_dict(dtype, model_name), -) - -if infer_dtype == "int8": - print_rank0("Using `load_in_8bit=True` to use quanitized model") - kwargs["load_in_8bit"] = True -else: - kwargs["torch_dtype"] = dtype - - -model = AutoModelForCausalLM.from_pretrained(model_name, **kwargs) - - -if args.benchmark: - t_ready = time.time() - - -### Generate - -print_rank0(f"*** Starting to generate {num_tokens} tokens with bs={args.batch_size}") - -input_sentences = [ - "DeepSpeed is a machine learning framework", - "He is working on", - "He has a", - "He got all", - "Everyone is happy and I can", - "The new movie that got Oscar this year", - "In the far far distance from our galaxy,", - "Peace is the only way", -] - -if args.batch_size > len(input_sentences): - # dynamically extend to support larger bs by repetition - input_sentences *= math.ceil(args.batch_size / len(input_sentences)) - -generate_kwargs = dict(max_new_tokens=num_tokens, do_sample=False) -# generate_kwargs = dict(max_new_tokens=num_tokens, use_cache=False, do_sample=False) -# generate_kwargs = dict(min_length=num_tokens, max_length=num_tokens, do_sample=False) - -print_rank0(f"Generate args {generate_kwargs}") -inputs = input_sentences[: args.batch_size] - - -def generate(): - """returns a list of zipped inputs, outputs and number of new tokens""" - - input_tokens = tokenizer.batch_encode_plus(inputs, return_tensors="pt", padding=True) - for t in input_tokens: - if torch.is_tensor(input_tokens[t]): - input_tokens[t] = input_tokens[t].to("cuda:0") - - outputs = model.generate(**input_tokens, **generate_kwargs) - - input_tokens_lengths = [x.shape[0] for x in input_tokens.input_ids] - output_tokens_lengths = [x.shape[0] for x in outputs] - - total_new_tokens = [o - i for i, o in zip(input_tokens_lengths, output_tokens_lengths)] - outputs = tokenizer.batch_decode(outputs, skip_special_tokens=True) - - return zip(inputs, outputs, total_new_tokens) - - -print_rank0(f"*** Running generate") -t_generate_start = time.time() -generated = generate() -t_generate_span = time.time() - t_generate_start -for i, o, _ in generated: - print_rank0(f"{'-'*60}\nin={i}\nout={o}\n") - - -### Benchmark - -if args.benchmark: - # clear cache / free memory - torch.cuda.empty_cache() - gc.collect() - - print_rank0(f"*** Running benchmark") - # warm up - for i in range(1): - _ = generate() - torch.cuda.synchronize() - - # benchmark - t0 = time.time() - cycles = 5 - total_new_tokens_generated = 0 - for i in range(cycles): - generated = generate() - total_new_tokens_generated += sum(new_tokens for _, _, new_tokens in generated) - torch.cuda.synchronize() - througput = (time.time() - t0) / (total_new_tokens_generated) - print_rank0( - f""" -*** Performance stats: -Throughput per token including tokenize: {througput*1000:.2f} msecs -Start to ready to generate: {t_ready - t_start:.3f} secs -Tokenize and generate {total_new_tokens_generated} (bs={args.batch_size}) tokens: {t_generate_span:.3f} secs -Start to finish: {t_ready - t_start + t_generate_span:.3f} secs -""" - ) diff --git a/bloom-inference-scripts/bloom-ds-inference.py b/bloom-inference-scripts/bloom-ds-inference.py deleted file mode 100644 index 51fea88..0000000 --- a/bloom-inference-scripts/bloom-ds-inference.py +++ /dev/null @@ -1,306 +0,0 @@ -# usage: -# deepspeed --num_gpus 8 bloom-ds-inference.py --name bigscience/bloom -# -# to run benchmarks: -# deepspeed --num_gpus 8 bloom-ds-inference.py --name bigscience/bloom --benchmark -# - - -# This is going to improve, but at the moment, the process is a bit cumbersome - we first use -# 1. use Deepspeed-ZeRO to instantiate the model on GPUs, w/o loading the checkpoints, -# 2. free the allocated storage -# 3. start Deepspeed-Inference and only now load the checkpoint -# 4. run generate -# Done. -# - - -import gc -import glob -import io -import json -import math -import os -import time -from argparse import ArgumentParser -from pathlib import Path - -import torch -import torch.distributed as dist - -import deepspeed -from huggingface_hub import snapshot_download -from transformers import AutoConfig, AutoModelForCausalLM, AutoTokenizer -from transformers.models.bloom.modeling_bloom import BloomBlock as BloomBlock -from transformers.utils import is_offline_mode - - -# the Deepspeed team made these so it's super fast to load (~1 minute), rather than wait 10-20min loading time. -tp_presharded_models = ["microsoft/bloom-deepspeed-inference-int8", "microsoft/bloom-deepspeed-inference-fp16"] - -t_start = time.time() - -num_tokens = 100 - -parser = ArgumentParser() - -parser.add_argument("--name", required=True, type=str, help="model_name") -parser.add_argument("--dtype", type=str, help="float16 or int8", choices=["int8", "float16"], default="float16") -parser.add_argument("--local_rank", required=False, type=int, help="used by dist launchers") -parser.add_argument("--batch_size", default=1, type=int, help="batch size") -parser.add_argument("--benchmark", action="store_true", help="additionally run benchmark") -args = parser.parse_args() - -local_rank = int(os.getenv("LOCAL_RANK", "0")) -world_size = int(os.getenv("WORLD_SIZE", "1")) - -deepspeed.init_distributed("nccl") -rank = dist.get_rank() - - -def print_rank0(*msg): - if rank != 0: - return - print(*msg) - - -### Model loading and instantiating on GPUs - - -def get_repo_root(model_name_or_path, revision=None): - # checks if online or not - if is_offline_mode(): - - print_rank0("Offline mode: forcing local_files_only=True") - local_files_only = True - else: - local_files_only = False - - # loads files from hub - cached_repo_dir = snapshot_download( - model_name_or_path, allow_patterns=["*"], local_files_only=local_files_only, revision=revision - ) - - return cached_repo_dir - - -def get_checkpoint_files(model_name_or_path, revision=None): - # checks if online or not - if is_offline_mode(): - print_rank0("Offline mode: forcing local_files_only=True") - local_files_only = True - else: - local_files_only = False - - # loads files from hub - cached_repo_dir = snapshot_download( - model_name_or_path, allow_patterns=["*"], local_files_only=local_files_only, revision=revision - ) - - # extensions: .bin | .pt - # creates a list of paths from all downloaded files in cache dir - file_list = [str(entry) for entry in Path(cached_repo_dir).rglob("*.[bp][it][n]") if entry.is_file()] - return file_list - - -model_name = args.name -infer_dtype = args.dtype - -tp_presharded_mode = True if model_name in tp_presharded_models else False - -# print(get_checkpoint_files(model_name)) - -print_rank0(f"*** Loading the model {model_name}") - -tokenizer = AutoTokenizer.from_pretrained(model_name) -config = AutoConfig.from_pretrained(model_name) - -# XXX: can't automatically derive dtype via config's `from_pretrained` -# dtype = torch.bfloat16 if model_name in ["bigscience/bloom", "bigscience/bigscience-small-testing"] else torch.float16 - - -# use one of these args to `init_inference` -# 1. injection_policy is the slower version, but it's plain pytorch so it'll always work -# 2. replace_with_kernel_inject is the faster one (fast fused kernels) -kernel_inject = True -# kernel_inject = False - -if kernel_inject: - # XXX: for now ds-inference only works with fp16 - dtype = torch.float16 -else: - dtype = torch.bfloat16 - -if args.benchmark: - torch.cuda.empty_cache() - gc.collect() - deepspeed.runtime.utils.see_memory_usage("pre-from-pretrained", force=True) - -# Construct model with fake meta tensors, later will be replaced during ds-inference ckpt load -with deepspeed.OnDevice(dtype=dtype, device="meta"): - model = AutoModelForCausalLM.from_config(config, torch_dtype=torch.bfloat16) - -if args.benchmark: - deepspeed.runtime.utils.see_memory_usage("post-from-pretrained", force=True) - -model = model.eval() - -if args.benchmark: - torch.cuda.empty_cache() - gc.collect() - deepspeed.runtime.utils.see_memory_usage("post-init-ds-zero-init", force=True) - -### Deepspeed-Inference Loading - -checkpoints_json = "checkpoints.json" - - -def write_checkponts_json(): - - with io.open(checkpoints_json, "w", encoding="utf-8") as f: - - # checkpoint_files = glob.glob(f"{checkpoint_dir}/*bin") - checkpoint_files = get_checkpoint_files(model_name) - - # print("Checkpoint files:", checkpoint_files) - - data = {"type": "BLOOM", "checkpoints": checkpoint_files, "version": 1.0} - - json.dump(data, f) - - -if args.benchmark: - torch.cuda.empty_cache() - gc.collect() - deepspeed.runtime.utils.see_memory_usage("pre-ds-inference-init", force=True) - -if kernel_inject: - kwargs = dict(replace_with_kernel_inject=True) -else: - kwargs = dict(injection_policy={BloomBlock: ("self_attention.dense", "mlp.dense_4h_to_h")}) - -repo_root = get_repo_root(model_name) -if tp_presharded_mode: - # tp presharded repos come with their own checkpoints config file - checkpoints_json = os.path.join(repo_root, "ds_inference_config.json") -else: - # for normal bloom repo we need to write the checkpoints config file - if rank == 0: - write_checkponts_json() - dist.barrier() - -# checkpoints_json=None -model = deepspeed.init_inference( - model, - mp_size=world_size, - base_dir=repo_root, - dtype=getattr(torch, infer_dtype), - checkpoint=checkpoints_json, - **kwargs, -) - -if args.benchmark: - torch.cuda.empty_cache() - gc.collect() - deepspeed.runtime.utils.see_memory_usage("post-ds-inference-init", force=True) - - -model = model.module - -if args.benchmark: - t_ready = time.time() - - -### Generate - - -print_rank0(f"*** Starting to generate {num_tokens} tokens with bs={args.batch_size}") - -input_sentences = [ - "DeepSpeed is a machine learning framework", - "He is working on", - "He has a", - "He got all", - "Everyone is happy and I can", - "The new movie that got Oscar this year", - "In the far far distance from our galaxy,", - "Peace is the only way", -] - -if args.batch_size > len(input_sentences): - # dynamically extend to support larger bs by repetition - input_sentences *= math.ceil(args.batch_size / len(input_sentences)) - -generate_kwargs = dict(max_new_tokens=num_tokens, do_sample=False) - - -print_rank0(f"Generate args {generate_kwargs}") - -inputs = input_sentences[: args.batch_size] - - -def generate(): - """returns a list of zipped inputs, outputs and number of new tokens""" - - input_tokens = tokenizer.batch_encode_plus(inputs, return_tensors="pt", padding=True) - for t in input_tokens: - if torch.is_tensor(input_tokens[t]): - input_tokens[t] = input_tokens[t].to(torch.cuda.current_device()) - - outputs = model.generate(**input_tokens, **generate_kwargs) - - input_tokens_lengths = [x.shape[0] for x in input_tokens.input_ids] - output_tokens_lengths = [x.shape[0] for x in outputs] - - total_new_tokens = [o - i for i, o in zip(input_tokens_lengths, output_tokens_lengths)] - outputs = tokenizer.batch_decode(outputs, skip_special_tokens=True) - - return zip(inputs, outputs, total_new_tokens) - - -# warmup is a must if measuring speed as it's when all the optimizations are performed -# e.g. on 8x80 a100 the first pass of 100 tokens takes 23sec, and the next one is 4secs -print_rank0(f"*** Running generate warmup") -_ = generate() - -print_rank0(f"*** Running generate") -t_generate_start = time.time() -generated = generate() -t_generate_span = time.time() - t_generate_start -for i, o, _ in generated: - print_rank0(f"{'-'*60}\nin={i}\nout={o}\n") - -if args.benchmark: - torch.cuda.empty_cache() - gc.collect() - deepspeed.runtime.utils.see_memory_usage("end-of-run", force=True) - -### Benchmark - -# benchmark it! -if args.benchmark: - print_rank0(f"*** Running benchmark") - - # warm up - for i in range(1): - _ = generate() - torch.cuda.synchronize() - - # benchmark - t0 = time.time() - cycles = 5 - total_new_tokens_generated = 0 - for i in range(cycles): - generated = generate() - total_new_tokens_generated += sum(new_tokens for _, _, new_tokens in generated) - torch.cuda.synchronize() - througput = (time.time() - t0) / (total_new_tokens_generated) - print_rank0( - f""" -*** Performance stats: -Throughput per token including tokenize: {througput*1000:.2f} msecs -Start to ready to generate: {t_ready - t_start:.3f} secs -Tokenize and generate {total_new_tokens_generated} (bs={args.batch_size}) tokens: {t_generate_span:.3f} secs -Start to finish: {t_ready - t_start + t_generate_span:.3f} secs -""" - ) diff --git a/bloom-inference-scripts/bloom-ds-zero-inference.py b/bloom-inference-scripts/bloom-ds-zero-inference.py deleted file mode 100644 index ae4e7e3..0000000 --- a/bloom-inference-scripts/bloom-ds-zero-inference.py +++ /dev/null @@ -1,224 +0,0 @@ -# usage: -# deepspeed --num_gpus 8 bloom-ds-inference.py --name bigscience/bloom -# -# to run benchmarks: -# deepspeed --num_gpus 8 bloom-ds-inference.py --name bigscience/bloom --benchmark -# - - -# This is going to improve, but at the moment, the process is a bit cumbersome - we first use -# 1. use Deepspeed-ZeRO to instantiate the model on GPUs, w/o loading the checkpoints, -# 2. free the allocated storage -# 3. start Deepspeed-Inference and only now load the checkpoint -# 4. run generate -# Done. -# - - -import gc -import math -import os -import time -from argparse import ArgumentParser - -import torch -import torch.distributed as dist - -import deepspeed -from transformers import AutoConfig, AutoModelForCausalLM, AutoTokenizer -from transformers.deepspeed import HfDeepSpeedConfig -from transformers.models.bloom.modeling_bloom import BloomBlock as BloomBlock - - -t_start = time.time() - -num_tokens = 100 - -parser = ArgumentParser() - -parser.add_argument("--name", required=True, type=str, help="model_name") -parser.add_argument("--local_rank", required=False, type=int, help="used by dist launchers") -parser.add_argument("--batch_size", default=1, type=int, help="batch size") -parser.add_argument("--benchmark", action="store_true", help="additionally run benchmark") -parser.add_argument("--cpu_offload", action="store_true", help="whether to activate CPU offload") -parser.add_argument("--nvme_offload_path", help="whether to activate NVME offload and the path on nvme") -args = parser.parse_args() - -local_rank = int(os.getenv("LOCAL_RANK", "0")) -world_size = int(os.getenv("WORLD_SIZE", "1")) - -deepspeed.init_distributed("nccl") -rank = dist.get_rank() - - -def print_rank0(*msg): - if rank != 0: - return - print(*msg) - - -### Model loading and instantiating on GPU (via ZeRO) - -model_name = args.name - -print_rank0(f"*** Loading the model {model_name}") - -tokenizer = AutoTokenizer.from_pretrained(model_name) -config = AutoConfig.from_pretrained(model_name) - -# XXX: can't automatically derive dtype via config's `from_pretrained` -dtype = torch.bfloat16 if model_name in ["bigscience/bloom", "bigscience/bigscience-small-testing"] else torch.float16 - -model_hidden_size = config.hidden_size -train_batch_size = 1 * world_size - -ds_config = { - "fp16": { - "enabled": dtype == torch.float16, - }, - "bf16": { - "enabled": dtype == torch.bfloat16, - }, - "zero_optimization": { - "stage": 3, - "overlap_comm": True, - "contiguous_gradients": True, - "reduce_bucket_size": model_hidden_size * model_hidden_size, - "stage3_prefetch_bucket_size": 0.9 * model_hidden_size * model_hidden_size, - "stage3_param_persistence_threshold": 0, - }, - "steps_per_print": 2000, - "train_batch_size": train_batch_size, - "train_micro_batch_size_per_gpu": 1, - "wall_clock_breakdown": False, -} - -if args.cpu_offload and args.nvme_offload_path: - raise ValueError("Use one of --cpu_offload or --nvme_offload_path and not both") - -if args.cpu_offload: - ds_config["zero_optimization"]["offload_param"] = dict(device="cpu", pin_memory=True) - -if args.nvme_offload_path: - ds_config["zero_optimization"]["offload_param"] = dict( - device="nvme", - pin_memory=True, - nvme_path=args.nvme_offload_path, - buffer_size=4e9, - ) - -dschf = HfDeepSpeedConfig(ds_config) # this tells from_pretrained to instantiate directly on gpus - -if args.benchmark: - torch.cuda.empty_cache() - gc.collect() - deepspeed.runtime.utils.see_memory_usage("pre-from-pretrained", force=True) - -model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype=torch.bfloat16) - -if args.benchmark: - deepspeed.runtime.utils.see_memory_usage("post-from-pretrained", force=True) - -model = model.eval() - -print_rank0(ds_config) - -ds_engine = deepspeed.initialize(model=model, config_params=ds_config)[0] -ds_engine.module.eval() -model = ds_engine.module - -if args.benchmark: - t_ready = time.time() - deepspeed.runtime.utils.see_memory_usage("start-of-generate", force=True) - - -### Generate - -print_rank0(f"*** Starting to generate {num_tokens} tokens with bs={args.batch_size}") - -input_sentences = [ - "DeepSpeed is a machine learning framework", - "He is working on", - "He has a", - "He got all", - "Everyone is happy and I can", - "The new movie that got Oscar this year", - "In the far far distance from our galaxy,", - "Peace is the only way", -] - -if args.batch_size > len(input_sentences): - # dynamically extend to support larger bs by repetition - input_sentences *= math.ceil(args.batch_size / len(input_sentences)) - -generate_kwargs = dict(max_new_tokens=num_tokens, do_sample=False) - -print_rank0(f"Generate args {generate_kwargs}") -inputs = input_sentences[: args.batch_size] - - -def generate(): - """returns a list of zipped inputs, outputs and number of new tokens""" - - input_tokens = tokenizer.batch_encode_plus(inputs, return_tensors="pt", padding=True) - for t in input_tokens: - if torch.is_tensor(input_tokens[t]): - input_tokens[t] = input_tokens[t].to(torch.cuda.current_device()) - - outputs = model.generate(**input_tokens, **generate_kwargs) - - input_tokens_lengths = [x.shape[0] for x in input_tokens.input_ids] - output_tokens_lengths = [x.shape[0] for x in outputs] - - total_new_tokens = [o - i for i, o in zip(input_tokens_lengths, output_tokens_lengths)] - outputs = tokenizer.batch_decode(outputs, skip_special_tokens=True) - - return zip(inputs, outputs, total_new_tokens) - - -# XXX: this is currently doing world_size streams on world_size gpus, so we can feed it different inputs on each! and hence the time can be divided by world_size - -print_rank0(f"*** Running generate") -t_generate_start = time.time() -pairs = generate() -t_generate_span = time.time() - t_generate_start -for i, o, _ in pairs: - print_rank0(f"{'-'*60}\nin={i}\nout={o}\n") - - -### Benchmark - -if args.benchmark: - # clear cache / free memory - torch.cuda.empty_cache() - gc.collect() - deepspeed.runtime.utils.see_memory_usage("end-of-generate", force=True) - - print_rank0(f"*** Running benchmark") - - # warm up - for i in range(1): - _ = generate() - torch.cuda.synchronize() - - # benchmark - t0 = time.time() - cycles = 5 - total_new_tokens_generated = 0 - for i in range(cycles): - generated = generate() - total_new_tokens_generated += sum(new_tokens for _, _, new_tokens in generated) - - torch.cuda.synchronize() - # note that we actually generate world_size unique streams (though the benchmark feeds the same inputs) - total_new_tokens_generated *= world_size - througput = (time.time() - t0) / (total_new_tokens_generated) - print_rank0( - f""" -*** Performance stats: -Throughput per token including tokenize: {througput*1000:.2f} msecs -Start to ready to generate: {t_ready - t_start:.3f} secs -Tokenize and generate {total_new_tokens_generated} (bs={args.batch_size}) tokens: {t_generate_span:.3f} secs -Start to finish: {t_ready - t_start + t_generate_span:.3f} secs -""" - ) diff --git a/bloom-inference-server/benchmark.py b/bloom-inference-server/benchmark.py deleted file mode 100644 index 9c05470..0000000 --- a/bloom-inference-server/benchmark.py +++ /dev/null @@ -1,127 +0,0 @@ -import argparse -import gc -import os -from functools import partial - -import torch - -import deepspeed -import utils -from models import Model, get_model_class -from utils import ( - BENCHMARK, - DS_INFERENCE, - DS_ZERO, - GenerateRequest, - get_argument_parser, - get_dummy_batch, - parse_generate_kwargs, - print_rank_n, - run_and_log_time, -) - - -def benchmark_generation(model: Model, request: GenerateRequest, cycles: int = 5): - # run benchmarks for number of cycles - total_new_tokens_generated = 0 - for _ in range(cycles): - response = model.generate(request) - total_new_tokens_generated += sum(new_tokens for new_tokens in response.num_generated_tokens) - return total_new_tokens_generated - - -def get_benchmark_results( - benchmark_time: float, initialization_time: float, total_new_tokens_generated: int, batch_size: int, cycles: int -) -> str: - throughput = total_new_tokens_generated / benchmark_time - latency = benchmark_time / cycles - return f""" -*** Performance stats: -Throughput (including tokenization) = {throughput:.2f} tokens/sec -Throughput (including tokenization) = {1000 / throughput:.2f} msecs/token -Model loading time = {initialization_time:.2f} secs -Total tokens generated = {total_new_tokens_generated} with batch size = {batch_size} -Latency = {latency:.2f} secs -Model loading time + generation time per batch = {initialization_time + latency:.2f} secs -""" - - -def benchmark_end_to_end(args: argparse.Namespace, model_class: Model, zero_activated: bool = False) -> None: - model, initialization_time = run_and_log_time(partial(model_class, args=args)) - - request = parse_generate_kwargs(get_dummy_batch(args.batch_size), args.generate_kwargs) - - request.preprocess() - - print_rank_n(f"generate_kwargs = {args.generate_kwargs}") - print_rank_n(f"batch_size = {args.batch_size}") - - # warmup is a must if measuring speed as it's when all the optimizations are performed - # e.g. on 8x80 a100 the first pass of 100 tokens takes 23sec, and the next one is 4secs - response = model.generate(request) - - for i, (o, _) in zip(request.text, zip(response.text, response.num_generated_tokens)): - print_rank_n(f"{'-' * 60}\nin = {i}\nout = {o}\n") - - if args.benchmark_cycles > 0: - print_rank_n(f"*** Running benchmark") - - torch.cuda.empty_cache() - gc.collect() - - # warm up - model.generate(request) - torch.cuda.synchronize() - - # benchmark - total_new_tokens_generated, benchmark_time = run_and_log_time( - partial(benchmark_generation, model=model, request=request, cycles=args.benchmark_cycles) - ) - - # with ZeRO every GPU is generating batch_size * sequence_length tokens - if zero_activated: - world_size = int(os.getenv("WORLD_SIZE", "1")) - total_new_tokens_generated *= world_size - - print_rank_n( - get_benchmark_results( - benchmark_time, initialization_time, total_new_tokens_generated, args.batch_size, args.benchmark_cycles - ) - ) - - -def get_args() -> argparse.Namespace: - parser = get_argument_parser() - - group = parser.add_argument_group(title="launch config") - group.add_argument("--benchmark_cycles", type=int, default=0, help="additionally run benchmark") - group.add_argument("--local_rank", required=False, type=int, help="used by dist launchers") - group.add_argument("--batch_size", default=1, type=int, help="batch size") - group.add_argument("--cpu_offload", action="store_true", help="whether to activate CPU offload for DS ZeRO") - - args = utils.get_args(parser, BENCHMARK) - - launched_with_deepspeed = args.deployment_framework in [DS_INFERENCE, DS_ZERO] - - if not launched_with_deepspeed: - assert args.local_rank == None, "local_rank must be None if not launched with DeepSpeed" - - if args.cpu_offload: - assert args.deployment_framework == DS_ZERO, "cpu_offload only works with DS_ZeRO" - - return args - - -def main() -> None: - args = get_args() - - model_class = get_model_class(args.deployment_framework, True) - - if args.deployment_framework in [DS_INFERENCE, DS_ZERO]: - deepspeed.init_distributed("nccl") - - benchmark_end_to_end(args, model_class, args.deployment_framework == DS_ZERO) - - -if __name__ == "__main__": - main() diff --git a/bloom-inference-server/cli.py b/bloom-inference-server/cli.py deleted file mode 100644 index 14e214c..0000000 --- a/bloom-inference-server/cli.py +++ /dev/null @@ -1,63 +0,0 @@ -import argparse -import json -import sys - -import utils -from models import get_model_class -from utils import CLI, get_argument_parser, parse_generate_kwargs, print_rank_n - - -def get_args() -> argparse.Namespace: - parser = get_argument_parser() - - group = parser.add_argument_group(title="launch config") - group.add_argument( - "--shutdown_command", required=False, type=str, default="__shutdown__", help="This string will exit the script" - ) - - args = utils.get_args(parser, CLI) - - return args - - -def main() -> None: - args = get_args() - - model = get_model_class(args.deployment_framework)(args) - - generate_kwargs = args.generate_kwargs - - while True: - try: - input_text = input("Input text: ") - - if input_text == args.shutdown_command: - model.shutdown() - - if input("change generate_kwargs? [y/n] ") == "y": - while True: - try: - generate_kwargs = json.loads(input("Generate kwargs: ")) - break - except KeyboardInterrupt: - model.shutdown() - except Exception as e: - e_type, e_message, _ = sys.exc_info() - print("error =", e_type.__name__) - print("message =", e_message) - continue - - request = parse_generate_kwargs([input_text], generate_kwargs) - - request.preprocess() - - response = model.generate(request) - - print_rank_n("Output text:", response.text[0]) - print_rank_n("Generated tokens:", response.num_generated_tokens[0]) - except KeyboardInterrupt: - model.shutdown() - - -if __name__ == "__main__": - main() diff --git a/bloom-inference-server/examples/server_request.py b/bloom-inference-server/examples/server_request.py deleted file mode 100644 index fbee7ec..0000000 --- a/bloom-inference-server/examples/server_request.py +++ /dev/null @@ -1,57 +0,0 @@ -import argparse - -import requests - - -def get_args() -> argparse.Namespace: - parser = argparse.ArgumentParser() - - group = parser.add_argument_group(title="launch config") - group.add_argument("--host", type=str, required=True, help="host address") - group.add_argument("--port", type=int, required=True, help="port number") - - return parser.parse_args() - - -def generate(url: str) -> None: - url = url + "/generate/" - - request_body = { - "text": [ - "DeepSpeed", - "DeepSpeed is a", - "DeepSpeed is a machine", - "DeepSpeed is a machine learning framework", - ], - "max_new_tokens": 40, - } - response = requests.post(url=url, json=request_body, verify=False) - print(response.json(), "\n") - - -def tokenize(url: str) -> None: - url = url + "/tokenize/" - - request_body = {"text": ["DeepSpeed is a", "DeepSpeed is a machine learning framework"]} - response = requests.post(url=url, json=request_body, verify=False) - print(response.json(), "\n") - - -def query_id(url: str) -> None: - url = url + "/query_id/" - - response = requests.get(url=url, verify=False) - print(response.json(), "\n") - - -def main(): - args = get_args() - url = "http://{}:{}".format(args.host, args.port) - - generate(url) - tokenize(url) - query_id(url) - - -if __name__ == "__main__": - main() diff --git a/bloom-inference-server/server.sh b/bloom-inference-server/server.sh deleted file mode 100644 index be49ced..0000000 --- a/bloom-inference-server/server.sh +++ /dev/null @@ -1,7 +0,0 @@ -export MODEL_NAME=bigscience/bloom -export DEPLOYMENT_FRAMEWORK=hf_accelerate -export DTYPE=fp16 -export MAX_INPUT_LENGTH=2048 - -# for more information on gunicorn see https://docs.gunicorn.org/en/stable/settings.html -gunicorn -t 0 -w 1 -b 127.0.0.1:5000 server:app --access-logfile - --access-logformat '%(h)s %(t)s "%(r)s" %(s)s %(b)s' diff --git a/lti_llm_client/__init__.py b/lti_llm_client/__init__.py new file mode 100644 index 0000000..9c9e5ba --- /dev/null +++ b/lti_llm_client/__init__.py @@ -0,0 +1,43 @@ +from typing import Any +import requests + + +class Client: + """A client for the LTI's LLM API.""" + + def __init__(self, address: str = "tir-1-7", port: int = 5000) -> None: + """Initialize the client. + + Args: + address: The address of the server. Defaults to "tir-1-7", a node on the LTI's TIR cluster. + port: The port of the server. Defaults to 5000. + """ + self.address = address + self.port = port + self.url = f"http://{self.address}:{self.port}" + + def prompt( + self, + text: str, + max_tokens: int = 64, + **kwargs: Any, + ) -> str: + """Prompt the LLM currently being served with a text and return the response. + Args: + text: The text to prompt the LLM with. + max_tokens: The maximum number of tokens to generate. Note + that this is *excluding* the prompt tokens. Defaults to 64. + **kwargs: Additional keyword arguments to pass to model. + They follow HF's generate API + Returns: + """ + request_body = { + "text": [text], + "max_new_tokens": max_tokens, + "do_sample": True, + **kwargs, + } + response = requests.post( + url=f"{self.url}/generate/", json=request_body, verify=False + ) + return str(response.json()["text"][0]) diff --git a/bloom-inference-server/README.md b/lti_llm_server/README.md similarity index 100% rename from bloom-inference-server/README.md rename to lti_llm_server/README.md diff --git a/bloom-inference-server/models/__init__.py b/lti_llm_server/models/__init__.py similarity index 100% rename from bloom-inference-server/models/__init__.py rename to lti_llm_server/models/__init__.py diff --git a/bloom-inference-server/models/ds_inference.py b/lti_llm_server/models/ds_inference.py similarity index 100% rename from bloom-inference-server/models/ds_inference.py rename to lti_llm_server/models/ds_inference.py diff --git a/bloom-inference-server/models/ds_zero.py b/lti_llm_server/models/ds_zero.py similarity index 100% rename from bloom-inference-server/models/ds_zero.py rename to lti_llm_server/models/ds_zero.py diff --git a/bloom-inference-server/models/hf_accelerate.py b/lti_llm_server/models/hf_accelerate.py similarity index 100% rename from bloom-inference-server/models/hf_accelerate.py rename to lti_llm_server/models/hf_accelerate.py diff --git a/bloom-inference-server/models/model.py b/lti_llm_server/models/model.py similarity index 100% rename from bloom-inference-server/models/model.py rename to lti_llm_server/models/model.py diff --git a/bloom-inference-server/server.py b/lti_llm_server/server.py similarity index 100% rename from bloom-inference-server/server.py rename to lti_llm_server/server.py diff --git a/lti_llm_server/server.sh b/lti_llm_server/server.sh new file mode 100644 index 0000000..85e0179 --- /dev/null +++ b/lti_llm_server/server.sh @@ -0,0 +1,8 @@ +export MODEL_NAME=microsoft/bloom-deepspeed-inference-int8 +export DEPLOYMENT_FRAMEWORK=ds_inference +export DTYPE=int8 +export MAX_INPUT_LENGTH=2048 +export MII_CACHE_PATH=/tmp/pfernand/mii_cache + +# for more information on gunicorn see https://docs.gunicorn.org/en/stable/settings.html +gunicorn -t 0 -w 1 -b 0.0.0.0:5000 server:app --access-logfile - --access-logformat '%(h)s %(t)s "%(r)s" %(s)s %(b)s' diff --git a/bloom-inference-server/utils/__init__.py b/lti_llm_server/utils/__init__.py similarity index 100% rename from bloom-inference-server/utils/__init__.py rename to lti_llm_server/utils/__init__.py diff --git a/bloom-inference-server/utils/constants.py b/lti_llm_server/utils/constants.py similarity index 100% rename from bloom-inference-server/utils/constants.py rename to lti_llm_server/utils/constants.py diff --git a/bloom-inference-server/utils/requests.py b/lti_llm_server/utils/requests.py similarity index 100% rename from bloom-inference-server/utils/requests.py rename to lti_llm_server/utils/requests.py diff --git a/bloom-inference-server/utils/utils.py b/lti_llm_server/utils/utils.py similarity index 100% rename from bloom-inference-server/utils/utils.py rename to lti_llm_server/utils/utils.py diff --git a/setup.cfg b/setup.cfg deleted file mode 100644 index 5b684cb..0000000 --- a/setup.cfg +++ /dev/null @@ -1,47 +0,0 @@ -[isort] -default_section = FIRSTPARTY -ensure_newline_before_comments = True -force_grid_wrap = 0 -include_trailing_comma = True -known_first_party = transformers -known_third_party = - absl - conllu - datasets - elasticsearch - fairseq - faiss-cpu - fastprogress - fire - fugashi - git - h5py - matplotlib - nltk - numpy - packaging - pandas - PIL - psutil - pytest - pytorch_lightning - rouge_score - sacrebleu - seqeval - sklearn - streamlit - tensorboardX - tensorflow - tensorflow_datasets - timeout_decorator - torch - torchaudio - torchtext - torchvision - torch_xla - tqdm - -line_length = 119 -lines_after_imports = 2 -multi_line_output = 3 -use_parentheses = True diff --git a/setup.py b/setup.py new file mode 100644 index 0000000..a3fe7b5 --- /dev/null +++ b/setup.py @@ -0,0 +1,16 @@ +# -*- coding: utf-8 -*- +from setuptools import find_packages, setup + +setup( + name="lti_llm_client", + version="0.0.1", + author="Patrick Fernandes", + author_email="pfernand@cs.cmu.edu", + url="", + packages=find_packages(exclude=["tests"]), + python_requires=">=3.7", + setup_requires=[], + install_requires=[ + 'requests' + ], +) \ No newline at end of file