[Feature] Support dynamic loading and unloading of Lora adapters #2891
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Motivation
This PR aims to implement the dynamic LoRA feature mentioned in #2686.
This PR is still under developing, please comment if the code can be improved.
Modifications
Current implementation of LoRA modules
Current LoRA features are implemented under folder
python/sglang/srt/lora
, where three fileslora.py
,lora_manager.py
,lora_config.py
are included. Initial support can be referred to #1307.In the
__init__
function ofModelRunner
, aLoraManager
will be created if a validlora_path
is passed inserver_args
. The initialization ofLoraManager
contains two parts: first callinginit_loras
to load huggingface LoRA weights to CPU and replace the targeted layers withBaseLayerWithLoRA
instances, then callinginit_lora_memory_pool
to preallocate the memory pool for S-Lora. The definition of lora modules inlora.py
are implemented on the basis of vllm implementation.Before forwarding the batch,
LoraManager
will callprepare_lora_batch
method to load active Lora adapters from memory pool. During loading, lora weight not used in current batch can be evicted from buffer if necessary.Unit tests are put under
test/srt/models/test_lora.py
. The test for inference can be passed, but the test for serving is skipped, so the feature of LoRA serving might require further check. The benchmark codes can be found inbenchmark/lora/lora_bench.py
.Implementation of dynamic serving LoRA
Dynamical serving LoRA means the lora adapters can be loaded and unloaded at users' commands during server runtime. This feature has been supported in vllm (vllm lora doc). As mentioned in #1433, current implementation supports multi-lora serving, but the loading and unloading of Lora modules can only be done at initialization of servers.
The design of loading and unloading LoRA at API side can be similar to
update_weights_from_disk
API, since both of their behavior is changing the weights that current server is running on. In this design, the two APIs are named asload_lora_adapter
andunload_lora_adapter
as in vllm.After the user send the
LoadLoraAdapterReq
/UnloadLoraAdapterReq
request to server, the serve will grab a write lock and wait for the request in progress to be finished. Then, the request will be transmitted toModelRunner
through several passes, and be handled byLoraManager
owned byModelRunner
.At
LoraManager
side, the loading of new lora adapter just follows the process of initialization: collecting new target modules, initialize new lora weights on CPU, and if open new space in memory buffer if needed.The implementation of unloading and testing scripts to be done...
Checklist