LoRa Responses are Inconsistent with peft inference #700

RonanKMcGovern · 2024-11-30T18:18:47Z

System Info

2024-11-30T18:08:10.672576Z INFO lorax_launcher: Runtime environment:
Target: x86_64-unknown-linux-gnu
Cargo version: 1.79.0
Commit sha: N/A
Docker label: N/A
nvidia-smi:
Sat Nov 30 18:08:10 2024
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.127.08 Driver Version: 550.127.08 CUDA Version: 12.4 |
|-----------------------------------------+------------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 NVIDIA A40 On | 00000000:D2:00.0 Off | 0 |
| 0% 35C P0 74W / 300W | 37723MiB / 46068MiB | 0% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=========================================================================================|
+-----------------------------------------------------------------------------------------+
2024-11-30T18:08:10.672699Z INFO lorax_launcher: Args { model_id: "mistralai/Mistral-7B-Instruct-v0.1", adapter_id: None, source: "hub", default_adapter_source: None, adapter_source: "hub", revision: None, validation_workers: 2, sharded: None, num_shard: None, quantize: None, compile: false, compile_max_batch_size: 128, compile_max_rank: 64, speculative_tokens: None, speculation_max_batch_size: 32, preloaded_adapter_ids: [], preloaded_adapter_source: None, predibase_api_token: None, dtype: None, trust_remote_code: false, max_concurrent_requests: 128, max_best_of: 2, max_stop_sequences: 4, max_input_length: None, max_total_tokens: None, waiting_served_ratio: 1.2, max_batch_prefill_tokens: None, max_batch_total_tokens: None, max_waiting_tokens: 20, eager_prefill: None, chunked_prefill: None, prefix_caching: None, merge_adapter_weights: false, max_active_adapters: 1024, adapter_cycle_time_s: 2, adapter_memory_fraction: 0.1, hostname: "eab42661659c", port: 80, shard_uds_path: "/tmp/lorax-server", master_addr: "localhost", master_port: 29500, huggingface_hub_cache: Some("/data"), weights_cache_override: None, disable_custom_kernels: false, cuda_memory_fraction: 1.0, json_output: false, otlp_endpoint: None, cors_allow_origin: [], cors_allow_header: [], cors_expose_header: [], cors_allow_method: [], cors_allow_credentials: None, watermark_gamma: None, watermark_delta: None, ngrok: false, ngrok_authtoken: None, ngrok_edge: None, env: true, download_only: false, tokenizer_config_path: None, backend: FA2, embedding_dim: None, disable_sgmv: false }

Information

Docker
The CLI directly

Tasks

An officially supported command
My own modifications

Reproduction

The reproduction involves running inference using a) lorax and the setup above, compared to b) running inference with transformers and peft. I have run both on A40 machines on runpod.

LoraX approach

LoraX docker start command arguments:

--model-id Qwen/Qwen2.5-7B-Instruct --port 8000

LoraX Script to call the Runpod endpoint, aka `lorax_replication.py`

import time
import argparse
import os
from openai import OpenAI

def generate_response(client, model_id, prompt):
    messages = [{"role": "user", "content": prompt}]
    
    response = client.chat.completions.create(
        model=model_id,
        messages=messages,
        max_tokens=100,
        temperature=0.01,
        stream=True
    )

    collected_message = []
    for chunk in response:
        if chunk.choices[0].delta.content:
            collected_message.append(chunk.choices[0].delta.content)
    
    return "".join(collected_message)

def main(adapter_id=None):
    RUNPOD_ENDPOINT = os.getenv("RUNPOD_ENDPOINT")
    BASE_MODEL = os.getenv("BASE_MODEL", "Qwen/Qwen2.5-7B-Instruct")
    
    if not RUNPOD_ENDPOINT:
        raise ValueError("Please set RUNPOD_ENDPOINT environment variable")

    client = OpenAI(
        api_key="EMPTY",
        base_url=RUNPOD_ENDPOINT + "/v1",
    )

    prompt = "How many players are on the field on each team at the start of a drop-off?"

    # Generate with base model
    print("Base Model Response:")
    base_response = generate_response(client, BASE_MODEL, prompt)
    print(base_response)

    # Generate with adapter if specified
    if adapter_id:
        print("\nAdapter Model Response:")
        adapter_response = generate_response(client, adapter_id, prompt)
        print(adapter_response)

if __name__ == "__main__":
    parser = argparse.ArgumentParser(description="Run LoRAX inference")
    parser.add_argument("--adapter", type=str, help="HuggingFace adapter slug")
    args = parser.parse_args()
    main(args.adapter)

Transformers / PEFT script

# Install required packages if needed


import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel

# Load the base model and tokenizer
base_model_id = "Qwen/Qwen2.5-7B-Instruct"
adapter_id = "Trelis/Qwen2.5-7B-Instruct-touch-rugby-1"

tokenizer = AutoTokenizer.from_pretrained(base_model_id, trust_remote_code=True)
base_model = AutoModelForCausalLM.from_pretrained(
    base_model_id,
    torch_dtype=torch.bfloat16,
    device_map="auto",
    trust_remote_code=True
)

# Test prompt
prompt = "How many players are on the field on each team at the start of a drop-off?"

# Function to generate response
def generate_response(model, prompt):
    messages = [
        {"role": "user", "content": prompt}
    ]

    text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
    
    inputs = tokenizer(text, return_tensors="pt").to(model.device)
    
    outputs = model.generate(
        **inputs,
        max_new_tokens=100,
        temperature=0.01,
        do_sample=True,
    )
    
    response = tokenizer.decode(outputs[0], skip_special_tokens=True)
    return response

# Generate with base model
print("Base Model Response:")
base_response = generate_response(base_model, prompt)
print(base_response)

# Load adapter and generate
print("\nLoading adapter...")
adapter_model = PeftModel.from_pretrained(base_model, adapter_id)

print("\nAdapter Model Response:")
adapter_response = generate_response(adapter_model, prompt)
print(adapter_response)

# Free up memory
del base_model
del adapter_model
torch.cuda.empty_cache()

Results / Output

LoraX

The adapter response does not take on the adapter/fine-tune attributes). It does not know what sport the question is about or what a drop-off is.

uv run lorax-replication.py --adapter Trelis/Qwen2.5-7B-Instruct-touch-rugby-1
Base Model Response:
It seems there might be a bit of confusion in your question. The term "drop-off" is not commonly used in sports, and it's not clear which sport you are referring to. However, if you are asking about a specific sport, such as football (American or Canadian), soccer, or rugby, I can provide the correct information.

For example:
- In American football, at the start of a play, there are usually 11 players from each team on the field.
- In
Base Model Response:
It seems there might be a bit of confusion in your question. The term "drop-off" is not commonly used in sports, especially not in team sports like football, basketball, or soccer. Could you please clarify which sport you are referring to? 

For example:
- In soccer (football), there are 11 players on the field for each team at the start of a match.
- In American football, there are 11 players on the field for each team at the start of

Adapter Model Response:
In the context of Australian rules football, which is the sport referred to as "drop-off," the game starts with 18 players on the field for each team. This number includes the eight players in the back half (four in the full-forward line and four in the half-forward flank), the four players in the midfield, and the six players in the forward half (three in the forward line and three in the half-back flank).
(lorax) ronanmcgovern@Ronans-MacBook-Pro lorax % uv run lorax-replication.py --adapter Trelis/Qwen2.5-7B-Instruct-touch-rugby-1
Base Model Response:
It seems there might be a bit of confusion in your question. The term "drop-off" is not commonly used in sports, and it's not clear which sport you are referring to. However, if you are asking about a specific sport, such as football (American or Canadian), soccer, or rugby, I can provide the correct information.

For example:
- In American football, at the start of a play, there are typically 11 players from each team on the field.
- In

Adapter Model Response:
In the context of Australian rules football, which is the sport referred to as "drop-off," each team starts with 18 players on the field at the beginning of the game. This number of players is standard for a match and includes both forwards, midfielders, and defenders.

Transformers / peft Response

The model knows what a "drop-off" is and that this question is about touch rugby.

Base Model Response:
system
You are Qwen, created by Alibaba Cloud. You are a helpful assistant.
user
How many players are on the field on each team at the start of a drop-off?
assistant
It seems there might be some confusion in your question. The term "drop-off" is not commonly used in sports to describe the number of players on a field or court. Could you please clarify which sport you are referring to? 

For example:
- In soccer (football), there are 11 players on the field for each team.
- In American football, there are typically 11 players on the field for each team at the start of a play.
- In basketball, there are

Loading adapter...

Adapter Model Response:
system
You are Qwen, created by Alibaba Cloud. You are a helpful assistant.
user
How many players are on the field on each team at the start of a drop-off?
assistant
To determine how many players are on the field at the start of a drop-off, we can follow these steps:

1. **Understand the Composition of a Team**: Each team consists of 14 players, including the Interchange.

2. **Interchange Rules**: The Interchange is allowed to enter and leave the field during normal play without a Change of Possession (COP).

3. **Drop-Off Procedure**: A Drop-Off occurs when one team has fewer than six (

Additional Notes

Base models are loaded in bfloat16 in both cases
The very same adapters and base models are used from Huggingface
The adapter is public and you can check it here
I intentionally used a low rank of 8 in case somehow that gave issues.

Questions

Are there any known issues with loading safetensors? I don't see anything in the issues?
Are there limits on what modules can be trained? I only trained linear layers, which should be fine:

"target_modules": [
--
  | "gate_proj",
  | "k_proj",
  | "v_proj",
  | "down_proj",
  | "q_proj",
  | "o_proj",
  | "up_proj"
  | ],

Expected behavior

When inferencing the very same model and adapter with the same generation settings, the response should be the same (or at least there should be evidence of the fine-tune). There is not.

The text was updated successfully, but these errors were encountered:

RonanKMcGovern · 2024-12-06T16:56:39Z

This issue is caused by rs_lora being used. The hacky fix is to multiply alpha in the lora config by the rank.

See vllm-project/vllm#6909

RonanKMcGovern closed this as completed Dec 6, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

LoRa Responses are Inconsistent with peft inference #700

LoRa Responses are Inconsistent with peft inference #700

RonanKMcGovern commented Nov 30, 2024

RonanKMcGovern commented Dec 6, 2024

LoRa Responses are Inconsistent with peft inference #700

LoRa Responses are Inconsistent with peft inference #700

Comments

RonanKMcGovern commented Nov 30, 2024

System Info

Information

Tasks

Reproduction

LoraX approach

LoraX docker start command arguments:

LoraX Script to call the Runpod endpoint, aka lorax_replication.py

Transformers / PEFT script

Results / Output

LoraX

Transformers / peft Response

Additional Notes

Questions

Expected behavior

RonanKMcGovern commented Dec 6, 2024

LoraX Script to call the Runpod endpoint, aka `lorax_replication.py`