Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement Efficient Serve-Time Model Loading and Inference in FastAPI #46

Open
3 of 6 tasks
Abellegese opened this issue Dec 24, 2024 · 0 comments
Open
3 of 6 tasks

Comments

@Abellegese
Copy link

Abellegese commented Dec 24, 2024

Description:

In the current implementation of the ersilia-pack API, models are loaded into memory on every incoming request. This approach leads to significant performance bottlenecks, especially when handling multiple requests, as the repetitive model loading increases latency and resource usage.

To optimize this, I propose leveraging Inter-Process Communication (IPC) via Python's multiprocessing.shared_memory module. Shared memory allows for efficient sharing of large objects (like machine learning models) between processes without the overhead of serialization and deserialization on each request. So the plan is to integrate this when the model api is initialized, which is at serve time. So the python on app start up event will be used for this as example below.

@app.on_event("startup")
async def load_model():
    # Event to load a model into a shared memory so that it can be accessed by the run session

When the process or the api closes, if we dont clean up resources, we might face leakage so we will use on app shutdown event


@app.on_event("shutdown")
async def cleanup():
    logger.info("Cleaning up shared memory.")

For this task two file will be create one load.py to separate out the model loading into a shared memory, which will be executed in on app start event and a normal run.py or run.sh to leverage this buffered model and make inference.

The need for IPC

Ideally the simple way to load the model into memory is to define a global variable at the app start event and use that global variable at inference endpoint. But the problem is the way eos-template designed is to load and run the model as a subprocess which makes the implementation complicated. IPC came in handy if we we want to communicate subprocess(lets assume that we decoupled the loading logic as its own process). There are several IPC methods i.e sockets but SharedMem performs better if we emphasize low latency process.

Goals

  • Use shared memory to store and share a preloaded ersilia models between the model loader (load.py) and inference runner (run.py) sub-processes.
  • Demonstrate the concept with a simple pretrained PyTorch model.
  • Integrate the subprocesses into the FastAPI lifecycle (startup and shutdown events).

Key Concepts

  1. Shared Memory:
    Shared memory is a mechanism that allows multiple processes to access the same block of memory. In Python, the multiprocessing.shared_memory module enables creating and managing shared memory segments.

    • Benefits:

      • Faster communication since data is accessed directly in memory.
      • No need for serialization/deserialization or message-passing overhead unless we have complex DataTypes.
    • Use Case in Our Project:
      The model loader will store the PyTorch or any other Ersilias model in shared memory during startup, and the inference process will retrieve the model directly from the shared memory for prediction tasks.

  2. Buffering:
    Shared memory works with buffers, which are byte-level representations of data. For PyTorch models, we will serialize the model into a byte format (using pickle) and store it in the shared memory buffer.

concept_map_IPC

Few Overheads/Cons

  1. Complexity in Synchronization:
    Shared memory requires careful synchronization to avoid race conditions. This often necessitates using locks, semaphores, or other mechanisms, which can complicate the code. But I don't think this will be a problem we need to worry about because usually ML models are works on i.i.d data points which don't require information sync across processes.

  2. Limited Data Types:
    Shared memory typically supports only basic data types or requires serialization/deserialization for complex data types add temporary memory overhead. So specifically for instance we use pickle to convert the models to binary and at the access time this conversion between binary to original data type requires some memory but temporarily.

  3. Scalability Constraints:
    The amount of shared memory is often limited by system configuration, which can restrict scalability when dealing with large amounts of data. Usually Linux system constrained to have 2-4GB and sometimes less but can be configured.

Integration Idea

  1. Feature needs in eos-template

Tasks

  • Create a small scale fastapi for testing the concepts
  • Implement SharedMemory based IPC
  • Use a small pytorch model to test the IPC
  • Integrate IPC in ersilia-pack and eos-template
  • Create a unitest for testing the IPC on several models
  • Update the API documentation
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: In Progress
Development

No branches or pull requests

1 participant