Skip to content

Latest commit

 

History

History
36 lines (23 loc) · 3.3 KB

README.md

File metadata and controls

36 lines (23 loc) · 3.3 KB

Inference

Inference is the process of using a trained language model to generate predictions or responses. While inference might seem straightforward, deploying models efficiently at scale requires careful consideration of various factors like performance, cost, and reliability. Large Language Models (LLMs) present unique challenges due to their size and computational requirements.

We'll explore both simple and production-ready approaches using the transformers library and text-generation-inference, two popular frameworks for LLM inference. For production deployments, we'll focus on Text Generation Inference (TGI), which provides optimized serving capabilities.

Module Overview

LLM inference can be categorized into two main approaches: simple pipeline-based inference for development and testing, and optimized serving solutions for production deployments. We'll cover both approaches, starting with the simpler pipeline approach and moving to production-ready solutions.

Contents

Learn how to use the Hugging Face Transformers pipeline for basic inference. We'll cover setting up pipelines, configuring generation parameters, and best practices for local development. The pipeline approach is perfect for prototyping and small-scale applications. Start learning.

Learn how to deploy models for production using Text Generation Inference. We'll explore optimized serving techniques, batching strategies, and monitoring solutions. TGI provides production-ready features like health checks, metrics, and Docker deployment options. Start learning.

Exercise Notebooks

Title Description Exercise Link Colab
Pipeline Inference Basic inference with transformers pipeline 🐢 Set up a basic pipeline
🐕 Configure generation parameters
🦁 Create a simple web server
Link Colab
TGI Deployment Production deployment with TGI 🐢 Deploy a model with TGI
🐕 Configure performance optimizations
🦁 Set up monitoring and scaling
Link Colab

Resources