Implement Efficient Serve-Time Model Loading and Inference in FastAPI #46

Abellegese · 2024-12-24T18:00:43Z

Description:

In the current implementation of the ersilia-pack API, models are loaded into memory on every incoming request. This approach leads to significant performance bottlenecks, especially when handling multiple requests, as the repetitive model loading increases latency and resource usage.

To optimize this, I propose leveraging Inter-Process Communication (IPC) via Python's multiprocessing.shared_memory module. Shared memory allows for efficient sharing of large objects (like machine learning models) between processes without the overhead of serialization and deserialization on each request. So the plan is to integrate this when the model api is initialized, which is at serve time. So the python on app start up event will be used for this as example below.

@app.on_event("startup")
async def load_model():
    # Event to load a model into a shared memory so that it can be accessed by the run session

When the process or the api closes, if we dont clean up resources, we might face leakage so we will use on app shutdown event

@app.on_event("shutdown")
async def cleanup():
    logger.info("Cleaning up shared memory.")

For this task two file will be create one load.py to separate out the model loading into a shared memory, which will be executed in on app start event and a normal run.py or run.sh to leverage this buffered model and make inference.

The need for IPC

Ideally the simple way to load the model into memory is to define a global variable at the app start event and use that global variable at inference endpoint. But the problem is the way eos-template designed is to load and run the model as a subprocess which makes the implementation complicated. IPC came in handy if we we want to communicate subprocess(lets assume that we decoupled the loading logic as its own process). There are several IPC methods i.e sockets but SharedMem performs better if we emphasize low latency process.

Goals

Use shared memory to store and share a preloaded ersilia models between the model loader (load.py) and inference runner (run.py) sub-processes.
Demonstrate the concept with a simple pretrained PyTorch model.
Integrate the subprocesses into the FastAPI lifecycle (startup and shutdown events).

Key Concepts

Shared Memory:
Shared memory is a mechanism that allows multiple processes to access the same block of memory. In Python, the multiprocessing.shared_memory module enables creating and managing shared memory segments.
- Benefits:
  - Faster communication since data is accessed directly in memory.
  - No need for serialization/deserialization or message-passing overhead unless we have complex DataTypes.
- Use Case in Our Project:
  The model loader will store the PyTorch or any other Ersilias model in shared memory during startup, and the inference process will retrieve the model directly from the shared memory for prediction tasks.
Buffering:
Shared memory works with buffers, which are byte-level representations of data. For PyTorch models, we will serialize the model into a byte format (using pickle) and store it in the shared memory buffer.

Few Overheads/Cons

Complexity in Synchronization:
Shared memory requires careful synchronization to avoid race conditions. This often necessitates using locks, semaphores, or other mechanisms, which can complicate the code. But I don't think this will be a problem we need to worry about because usually ML models are works on i.i.d data points which don't require information sync across processes.
Limited Data Types:
Shared memory typically supports only basic data types or requires serialization/deserialization for complex data types add temporary memory overhead. So specifically for instance we use pickle to convert the models to binary and at the access time this conversion between binary to original data type requires some memory but temporarily.
Scalability Constraints:
The amount of shared memory is often limited by system configuration, which can restrict scalability when dealing with large amounts of data. Usually Linux system constrained to have 2-4GB and sometimes less but can be configured.

Integration Idea

Feature needs in eos-template

Tasks

Create a small scale fastapi for testing the concepts
Implement SharedMemory based IPC
Use a small pytorch model to test the IPC
Integrate IPC in ersilia-pack and eos-template
Create a unitest for testing the IPC on several models
Update the API documentation

The text was updated successfully, but these errors were encountered:

github-project-automation bot added this to Ersilia Model Hub Dec 24, 2024

github-project-automation bot moved this to On Hold in Ersilia Model Hub Dec 24, 2024

Abellegese moved this from On Hold to In Progress in Ersilia Model Hub Dec 24, 2024

Abellegese mentioned this issue Jan 2, 2025

🐅 Epic: Add support for actually loading models at serve time. ersilia-os/ersilia#1399

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement Efficient Serve-Time Model Loading and Inference in FastAPI #46

Implement Efficient Serve-Time Model Loading and Inference in FastAPI #46

Abellegese commented Dec 24, 2024 •

edited by DhanshreeA

Loading

Implement Efficient Serve-Time Model Loading and Inference in FastAPI #46

Implement Efficient Serve-Time Model Loading and Inference in FastAPI #46

Comments

Abellegese commented Dec 24, 2024 • edited by DhanshreeA Loading

Description:

The need for IPC

Goals

Key Concepts

Few Overheads/Cons

Integration Idea

Tasks

Abellegese commented Dec 24, 2024 •

edited by DhanshreeA

Loading