[Feature] Support dynamic loading and unloading of Lora adapters #2891

Fridge003 · 2025-01-14T20:26:08Z

Motivation

This PR aims to implement the dynamic LoRA feature mentioned in #2686.
This PR is still under developing, please comment if the code can be improved.

Modifications

Current implementation of LoRA modules

Current LoRA features are implemented under folder python/sglang/srt/lora, where three files lora.py, lora_manager.py, lora_config.py are included. Initial support can be referred to #1307.

In the __init__ function of ModelRunner, a LoraManager will be created if a valid lora_path is passed in server_args. The initialization of LoraManager contains two parts: first calling init_loras to load huggingface LoRA weights to CPU and replace the targeted layers with BaseLayerWithLoRA instances, then calling init_lora_memory_pool to preallocate the memory pool for S-Lora. The definition of lora modules in lora.py are implemented on the basis of vllm implementation.

Before forwarding the batch, LoraManager will call prepare_lora_batch method to load active Lora adapters from memory pool. During loading, lora weight not used in current batch can be evicted from buffer if necessary.

Unit tests are put under test/srt/models/test_lora.py. The test for inference can be passed, but the test for serving is skipped, so the feature of LoRA serving might require further check. The benchmark codes can be found in benchmark/lora/lora_bench.py.

Implementation of dynamic serving LoRA

Dynamical serving LoRA means the lora adapters can be loaded and unloaded at users' commands during server runtime. This feature has been supported in vllm (vllm lora doc). As mentioned in #1433, current implementation supports multi-lora serving, but the loading and unloading of Lora modules can only be done at initialization of servers.

The design of loading and unloading LoRA at API side can be similar to update_weights_from_disk API, since both of their behavior is changing the weights that current server is running on. In this design, the two APIs are named as load_lora_adapter and unload_lora_adapter as in vllm.

After the user send the LoadLoraAdapterReq/UnloadLoraAdapterReq request to server, the serve will grab a write lock and wait for the request in progress to be finished. Then, the request will be transmitted to ModelRunner through several passes, and be handled by LoraManager owned by ModelRunner.

At LoraManager side, the loading of new lora adapter just follows the process of initialization: collecting new target modules, initialize new lora weights on CPU, and if open new space in memory buffer if needed.

The implementation of unloading and testing scripts to be done...

Checklist

Format your code according to the Contributor Guide.
Add unit tests as outlined in the Contributor Guide.
Update documentation as needed, including docstrings or example tutorials.

Fridge003 added 3 commits January 13, 2025 21:04

server side implementation of lora loading

86b454a

Implement lora loading for LoraManager

cdbe076

Reformatting

94c2015

Fridge003 mentioned this pull request Jan 16, 2025

[Feature] Lora optimization #2929

Open

11 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature] Support dynamic loading and unloading of Lora adapters #2891

[Feature] Support dynamic loading and unloading of Lora adapters #2891

Fridge003 commented Jan 14, 2025

[Feature] Support dynamic loading and unloading of Lora adapters #2891

Are you sure you want to change the base?

[Feature] Support dynamic loading and unloading of Lora adapters #2891

Conversation

Fridge003 commented Jan 14, 2025

Motivation

Modifications

Current implementation of LoRA modules

Implementation of dynamic serving LoRA

Checklist