You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
In the current implementation of the ersilia-pack API, models are loaded into memory on every incoming request. This approach leads to significant performance bottlenecks, especially when handling multiple requests, as the repetitive model loading increases latency and resource usage.
To optimize this, I propose leveraging Inter-Process Communication (IPC) via Python's multiprocessing.shared_memory module. Shared memory allows for efficient sharing of large objects (like machine learning models) between processes without the overhead of serialization and deserialization on each request. So the plan is to integrate this when the model api is initialized, which is at serve time. So the python on app start up event will be used for this as example below.
@app.on_event("startup")asyncdefload_model():
# Event to load a model into a shared memory so that it can be accessed by the run session
When the process or the api closes, if we dont clean up resources, we might face leakage so we will use on app shutdown event
@app.on_event("shutdown")asyncdefcleanup():
logger.info("Cleaning up shared memory.")
For this task two file will be create one load.py to separate out the model loading into a shared memory, which will be executed in on app start event and a normal run.py or run.sh to leverage this buffered model and make inference.
The need for IPC
Ideally the simple way to load the model into memory is to define a global variable at the app start event and use that global variable at inference endpoint. But the problem is the way eos-template designed is to load and run the model as a subprocess which makes the implementation complicated. IPC came in handy if we we want to communicate subprocess(lets assume that we decoupled the loading logic as its own process). There are several IPC methods i.e sockets but SharedMem performs better if we emphasize low latency process.
Goals
Use shared memory to store and share a preloaded ersilia models between the model loader (load.py) and inference runner (run.py) sub-processes.
Demonstrate the concept with a simple pretrained PyTorch model.
Integrate the subprocesses into the FastAPI lifecycle (startup and shutdown events).
Key Concepts
Shared Memory:
Shared memory is a mechanism that allows multiple processes to access the same block of memory. In Python, the multiprocessing.shared_memory module enables creating and managing shared memory segments.
Benefits:
Faster communication since data is accessed directly in memory.
No need for serialization/deserialization or message-passing overhead unless we have complex DataTypes.
Use Case in Our Project:
The model loader will store the PyTorch or any other Ersilias model in shared memory during startup, and the inference process will retrieve the model directly from the shared memory for prediction tasks.
Buffering:
Shared memory works with buffers, which are byte-level representations of data. For PyTorch models, we will serialize the model into a byte format (using pickle) and store it in the shared memory buffer.
Few Overheads/Cons
Complexity in Synchronization:
Shared memory requires careful synchronization to avoid race conditions. This often necessitates using locks, semaphores, or other mechanisms, which can complicate the code. But I don't think this will be a problem we need to worry about because usually ML models are works on i.i.d data points which don't require information sync across processes.
Limited Data Types:
Shared memory typically supports only basic data types or requires serialization/deserialization for complex data types add temporary memory overhead. So specifically for instance we use pickle to convert the models to binary and at the access time this conversion between binary to original data type requires some memory but temporarily.
Scalability Constraints:
The amount of shared memory is often limited by system configuration, which can restrict scalability when dealing with large amounts of data. Usually Linux system constrained to have 2-4GB and sometimes less but can be configured.
Integration Idea
Feature needs in eos-template
Tasks
Create a small scale fastapi for testing the concepts
Implement SharedMemory based IPC
Use a small pytorch model to test the IPC
Integrate IPC in ersilia-pack and eos-template
Create a unitest for testing the IPC on several models
Update the API documentation
The text was updated successfully, but these errors were encountered:
Description:
In the current implementation of the
ersilia-pack
API, models are loaded into memory on every incoming request. This approach leads to significant performance bottlenecks, especially when handling multiple requests, as the repetitive model loading increases latency and resource usage.To optimize this, I propose leveraging Inter-Process Communication (IPC) via Python's
multiprocessing.shared_memory
module. Shared memory allows for efficient sharing of large objects (like machine learning models) between processes without the overhead of serialization and deserialization on each request. So the plan is to integrate this when the model api is initialized, which is at serve time. So the python on app start up event will be used for this as example below.When the process or the api closes, if we dont clean up resources, we might face leakage so we will use on app shutdown event
For this task two file will be create one
load.py
to separate out the model loading into a shared memory, which will be executed in on app start event and a normalrun.py
orrun.sh
to leverage this buffered model and make inference.The need for IPC
Ideally the simple way to load the model into memory is to define a global variable at the app start event and use that global variable at inference endpoint. But the problem is the way
eos-template
designed is to load and run the model as a subprocess which makes the implementation complicated. IPC came in handy if we we want to communicate subprocess(lets assume that we decoupled the loading logic as its own process). There are several IPC methods i.esockets
butSharedMem
performs better if we emphasize low latency process.Goals
load.py
) and inference runner (run.py
) sub-processes.Key Concepts
Shared Memory:
Shared memory is a mechanism that allows multiple processes to access the same block of memory. In Python, the
multiprocessing.shared_memory
module enables creating and managing shared memory segments.Benefits:
DataTypes
.Use Case in Our Project:
The model loader will store the PyTorch or any other Ersilias model in shared memory during startup, and the inference process will retrieve the model directly from the shared memory for prediction tasks.
Buffering:
Shared memory works with buffers, which are byte-level representations of data. For PyTorch models, we will serialize the model into a byte format (using
pickle
) and store it in the shared memory buffer.Few Overheads/Cons
Complexity in Synchronization:
Shared memory requires careful synchronization to avoid race conditions. This often necessitates using locks, semaphores, or other mechanisms, which can complicate the code. But I don't think this will be a problem we need to worry about because usually ML models are works on i.i.d data points which don't require information sync across processes.
Limited Data Types:
Shared memory typically supports only basic data types or requires serialization/deserialization for complex data types add temporary memory overhead. So specifically for instance we use pickle to convert the models to binary and at the access time this conversion between binary to original data type requires some memory but temporarily.
Scalability Constraints:
The amount of shared memory is often limited by system configuration, which can restrict scalability when dealing with large amounts of data. Usually Linux system constrained to have
2-4GB
and sometimes less but can be configured.Integration Idea
eos-template
Tasks
SharedMemory
based IPCersilia-pack
andeos-template
The text was updated successfully, but these errors were encountered: