Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature] Support dynamic loading and unloading of Lora adapters #2891

Draft
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

Fridge003
Copy link

Motivation

This PR aims to implement the dynamic LoRA feature mentioned in #2686.
This PR is still under developing, please comment if the code can be improved.

Modifications

Current implementation of LoRA modules

Current LoRA features are implemented under folder python/sglang/srt/lora, where three files lora.py, lora_manager.py, lora_config.py are included. Initial support can be referred to #1307.

In the __init__ function of ModelRunner, a LoraManager will be created if a valid lora_path is passed in server_args. The initialization of LoraManager contains two parts: first calling init_loras to load huggingface LoRA weights to CPU and replace the targeted layers with BaseLayerWithLoRA instances, then calling init_lora_memory_pool to preallocate the memory pool for S-Lora. The definition of lora modules in lora.py are implemented on the basis of vllm implementation.

Before forwarding the batch, LoraManager will call prepare_lora_batch method to load active Lora adapters from memory pool. During loading, lora weight not used in current batch can be evicted from buffer if necessary.

Unit tests are put under test/srt/models/test_lora.py. The test for inference can be passed, but the test for serving is skipped, so the feature of LoRA serving might require further check. The benchmark codes can be found in benchmark/lora/lora_bench.py.

Implementation of dynamic serving LoRA

Dynamical serving LoRA means the lora adapters can be loaded and unloaded at users' commands during server runtime. This feature has been supported in vllm (vllm lora doc). As mentioned in #1433, current implementation supports multi-lora serving, but the loading and unloading of Lora modules can only be done at initialization of servers.

The design of loading and unloading LoRA at API side can be similar to update_weights_from_disk API, since both of their behavior is changing the weights that current server is running on. In this design, the two APIs are named as load_lora_adapter and unload_lora_adapter as in vllm.

After the user send the LoadLoraAdapterReq/UnloadLoraAdapterReq request to server, the serve will grab a write lock and wait for the request in progress to be finished. Then, the request will be transmitted to ModelRunner through several passes, and be handled by LoraManager owned by ModelRunner.

At LoraManager side, the loading of new lora adapter just follows the process of initialization: collecting new target modules, initialize new lora weights on CPU, and if open new space in memory buffer if needed.

The implementation of unloading and testing scripts to be done...

Checklist

  • Format your code according to the Contributor Guide.
  • Add unit tests as outlined in the Contributor Guide.
  • Update documentation as needed, including docstrings or example tutorials.

@Fridge003 Fridge003 mentioned this pull request Jan 16, 2025
11 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant