An AWS implementation of a Machine Learning (ML) Large Language Model (LLM) hosting, compatible with GovCloud.
- An AWS account with:
- An increased quota to deploy one
ml.g4dn.12xlarge
instance for endpoint usage - An IAM User or Role with AdministratorAccess policy granted (we recommend restricting access as needed)
- An increased quota to deploy one
- The AWS CLI installed
- export the
AWS_REGION
environment variable
- export the
- Typescript, npm, and the AWS CDK cli installed (Required versions found in the package.json file)
Note: The Amazon SageMaker Endpoint can incur significant cost if left running, be sure to monitor your billing, and destroy the stack via the Clean Up section when done experimenting.
Configure your AWS credentials, either by exporting your role credentials to the current terminal, or configuring your AWS CLI profile
Next you can run the following commands:
npm install
npm run build
cdk bootstrap
(Only for first time running cdk in the account)cdk deploy
Once the deployment is completed, you can navigate to SageMaker Notebook Instances
and open the notebook Falcon40BNotebook-XXXXXXXXXX
where the X's are randomly generated. From there you can run the notebook cells.
If you have created additional Jupyter notebooks in SageMaker you can download them from the SageMaker notebook instance's IDE before destroying the stack.
When complete you can run:
cdk destroy
To delete all resources you created.
To deploy this application, we leverage HuggingFace's prebuilt Text Generation Inference (TGI) Falcon-40b docker image with HuggingFaceSageMakerEndpoint construct(deploy hugging face model to Amazon SageMaker) from @cdklabs/generative-ai-cdk-constructs.
The AWS Deep Learning Containers (DLCs) provide the set of Docker images which can be deployed on Amazon SageMaker. This creates a scalable, secure, hosted endpoint for real time inference.
We deploy a SageMaker notebook instance in a private subnet and allow outbound internet connectivity, while controlling inbound connectivity. To enable notebook to AWS Service Endpoint communication, we then use VPC Endpoints powered by AWS PrivateLink. The benefit of using AWS PrivateLink is it allows SageMaker notebook instances to access the SageMaker real-time inference endpoint over the private network IP space.
Hugging Face's TGI provides a seamless way to deploy LLMs for real-time text generation. It bundles prebuilt Docker containers that handle hosting infrastructure so users can focus on their applications and use-cases.
Falcon-40b features advanced text generation and comprehension capabilities. Boasting 178 billion parameters, Falcon-40b is one of the largest publicly available models. Trained on 1.5 trillion text tokens across English, German, Spanish, French, and other languages, Falcon-40b can fluently generate, summarize, and translate text.
SageMaker real-time inference endpoints enable low-latency, high-throughput hosting of machine learning models for real-time inference. By using Amazon SageMaker, we can take advantage of the operational efficiencies of using AWS infrastructure and eliminating the undifferentiated heavy-lifting. Amazon SageMaker handles provisioning servers, scaling, monitoring, and availability freeing up the data scientists to work with LLMs.
Amazon SageMaker notebook instances provide a managed and familiar environment, purpose-built for developing and evaluating ML models. Amazon SageMaker provides a painless and cost effective sandbox to prototype capabilities.
Multiple instance types give data scientists flexibility to test small demos or fine-tune LLMs on significant datasets.
Networking components and features of AWS like AWS PrivateLink allow administrators to control private connectivity between VPCs and AWS services securely on AWS without traversing the public internet. This helps enable secure LLM experimentation with datasets.