Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

LoRa Responses are Inconsistent with peft inference #700

Closed
2 of 4 tasks
RonanKMcGovern opened this issue Nov 30, 2024 · 1 comment
Closed
2 of 4 tasks

LoRa Responses are Inconsistent with peft inference #700

RonanKMcGovern opened this issue Nov 30, 2024 · 1 comment

Comments

@RonanKMcGovern
Copy link

System Info

2024-11-30T18:08:10.672576Z INFO lorax_launcher: Runtime environment:
Target: x86_64-unknown-linux-gnu
Cargo version: 1.79.0
Commit sha: N/A
Docker label: N/A
nvidia-smi:
Sat Nov 30 18:08:10 2024
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.127.08 Driver Version: 550.127.08 CUDA Version: 12.4 |
|-----------------------------------------+------------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 NVIDIA A40 On | 00000000:D2:00.0 Off | 0 |
| 0% 35C P0 74W / 300W | 37723MiB / 46068MiB | 0% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=========================================================================================|
+-----------------------------------------------------------------------------------------+
2024-11-30T18:08:10.672699Z INFO lorax_launcher: Args { model_id: "mistralai/Mistral-7B-Instruct-v0.1", adapter_id: None, source: "hub", default_adapter_source: None, adapter_source: "hub", revision: None, validation_workers: 2, sharded: None, num_shard: None, quantize: None, compile: false, compile_max_batch_size: 128, compile_max_rank: 64, speculative_tokens: None, speculation_max_batch_size: 32, preloaded_adapter_ids: [], preloaded_adapter_source: None, predibase_api_token: None, dtype: None, trust_remote_code: false, max_concurrent_requests: 128, max_best_of: 2, max_stop_sequences: 4, max_input_length: None, max_total_tokens: None, waiting_served_ratio: 1.2, max_batch_prefill_tokens: None, max_batch_total_tokens: None, max_waiting_tokens: 20, eager_prefill: None, chunked_prefill: None, prefix_caching: None, merge_adapter_weights: false, max_active_adapters: 1024, adapter_cycle_time_s: 2, adapter_memory_fraction: 0.1, hostname: "eab42661659c", port: 80, shard_uds_path: "/tmp/lorax-server", master_addr: "localhost", master_port: 29500, huggingface_hub_cache: Some("/data"), weights_cache_override: None, disable_custom_kernels: false, cuda_memory_fraction: 1.0, json_output: false, otlp_endpoint: None, cors_allow_origin: [], cors_allow_header: [], cors_expose_header: [], cors_allow_method: [], cors_allow_credentials: None, watermark_gamma: None, watermark_delta: None, ngrok: false, ngrok_authtoken: None, ngrok_edge: None, env: true, download_only: false, tokenizer_config_path: None, backend: FA2, embedding_dim: None, disable_sgmv: false }

Information

  • Docker
  • The CLI directly

Tasks

  • An officially supported command
  • My own modifications

Reproduction

The reproduction involves running inference using a) lorax and the setup above, compared to b) running inference with transformers and peft. I have run both on A40 machines on runpod.

LoraX approach

LoraX docker start command arguments:

--model-id Qwen/Qwen2.5-7B-Instruct --port 8000

LoraX Script to call the Runpod endpoint, aka lorax_replication.py

import time
import argparse
import os
from openai import OpenAI

def generate_response(client, model_id, prompt):
    messages = [{"role": "user", "content": prompt}]
    
    response = client.chat.completions.create(
        model=model_id,
        messages=messages,
        max_tokens=100,
        temperature=0.01,
        stream=True
    )

    collected_message = []
    for chunk in response:
        if chunk.choices[0].delta.content:
            collected_message.append(chunk.choices[0].delta.content)
    
    return "".join(collected_message)

def main(adapter_id=None):
    RUNPOD_ENDPOINT = os.getenv("RUNPOD_ENDPOINT")
    BASE_MODEL = os.getenv("BASE_MODEL", "Qwen/Qwen2.5-7B-Instruct")
    
    if not RUNPOD_ENDPOINT:
        raise ValueError("Please set RUNPOD_ENDPOINT environment variable")

    client = OpenAI(
        api_key="EMPTY",
        base_url=RUNPOD_ENDPOINT + "/v1",
    )

    prompt = "How many players are on the field on each team at the start of a drop-off?"

    # Generate with base model
    print("Base Model Response:")
    base_response = generate_response(client, BASE_MODEL, prompt)
    print(base_response)

    # Generate with adapter if specified
    if adapter_id:
        print("\nAdapter Model Response:")
        adapter_response = generate_response(client, adapter_id, prompt)
        print(adapter_response)

if __name__ == "__main__":
    parser = argparse.ArgumentParser(description="Run LoRAX inference")
    parser.add_argument("--adapter", type=str, help="HuggingFace adapter slug")
    args = parser.parse_args()
    main(args.adapter)

Transformers / PEFT script

# Install required packages if needed


import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel

# Load the base model and tokenizer
base_model_id = "Qwen/Qwen2.5-7B-Instruct"
adapter_id = "Trelis/Qwen2.5-7B-Instruct-touch-rugby-1"

tokenizer = AutoTokenizer.from_pretrained(base_model_id, trust_remote_code=True)
base_model = AutoModelForCausalLM.from_pretrained(
    base_model_id,
    torch_dtype=torch.bfloat16,
    device_map="auto",
    trust_remote_code=True
)

# Test prompt
prompt = "How many players are on the field on each team at the start of a drop-off?"

# Function to generate response
def generate_response(model, prompt):
    messages = [
        {"role": "user", "content": prompt}
    ]

    text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
    
    inputs = tokenizer(text, return_tensors="pt").to(model.device)
    
    outputs = model.generate(
        **inputs,
        max_new_tokens=100,
        temperature=0.01,
        do_sample=True,
    )
    
    response = tokenizer.decode(outputs[0], skip_special_tokens=True)
    return response

# Generate with base model
print("Base Model Response:")
base_response = generate_response(base_model, prompt)
print(base_response)

# Load adapter and generate
print("\nLoading adapter...")
adapter_model = PeftModel.from_pretrained(base_model, adapter_id)

print("\nAdapter Model Response:")
adapter_response = generate_response(adapter_model, prompt)
print(adapter_response)

# Free up memory
del base_model
del adapter_model
torch.cuda.empty_cache()

Results / Output

LoraX

The adapter response does not take on the adapter/fine-tune attributes). It does not know what sport the question is about or what a drop-off is.

uv run lorax-replication.py --adapter Trelis/Qwen2.5-7B-Instruct-touch-rugby-1
Base Model Response:
It seems there might be a bit of confusion in your question. The term "drop-off" is not commonly used in sports, and it's not clear which sport you are referring to. However, if you are asking about a specific sport, such as football (American or Canadian), soccer, or rugby, I can provide the correct information.

For example:
- In American football, at the start of a play, there are usually 11 players from each team on the field.
- In
Base Model Response:
It seems there might be a bit of confusion in your question. The term "drop-off" is not commonly used in sports, especially not in team sports like football, basketball, or soccer. Could you please clarify which sport you are referring to? 

For example:
- In soccer (football), there are 11 players on the field for each team at the start of a match.
- In American football, there are 11 players on the field for each team at the start of

Adapter Model Response:
In the context of Australian rules football, which is the sport referred to as "drop-off," the game starts with 18 players on the field for each team. This number includes the eight players in the back half (four in the full-forward line and four in the half-forward flank), the four players in the midfield, and the six players in the forward half (three in the forward line and three in the half-back flank).
(lorax) ronanmcgovern@Ronans-MacBook-Pro lorax % uv run lorax-replication.py --adapter Trelis/Qwen2.5-7B-Instruct-touch-rugby-1
Base Model Response:
It seems there might be a bit of confusion in your question. The term "drop-off" is not commonly used in sports, and it's not clear which sport you are referring to. However, if you are asking about a specific sport, such as football (American or Canadian), soccer, or rugby, I can provide the correct information.

For example:
- In American football, at the start of a play, there are typically 11 players from each team on the field.
- In

Adapter Model Response:
In the context of Australian rules football, which is the sport referred to as "drop-off," each team starts with 18 players on the field at the beginning of the game. This number of players is standard for a match and includes both forwards, midfielders, and defenders.

Transformers / peft Response

The model knows what a "drop-off" is and that this question is about touch rugby.

Base Model Response:
system
You are Qwen, created by Alibaba Cloud. You are a helpful assistant.
user
How many players are on the field on each team at the start of a drop-off?
assistant
It seems there might be some confusion in your question. The term "drop-off" is not commonly used in sports to describe the number of players on a field or court. Could you please clarify which sport you are referring to? 

For example:
- In soccer (football), there are 11 players on the field for each team.
- In American football, there are typically 11 players on the field for each team at the start of a play.
- In basketball, there are

Loading adapter...

Adapter Model Response:
system
You are Qwen, created by Alibaba Cloud. You are a helpful assistant.
user
How many players are on the field on each team at the start of a drop-off?
assistant
To determine how many players are on the field at the start of a drop-off, we can follow these steps:

1. **Understand the Composition of a Team**: Each team consists of 14 players, including the Interchange.

2. **Interchange Rules**: The Interchange is allowed to enter and leave the field during normal play without a Change of Possession (COP).

3. **Drop-Off Procedure**: A Drop-Off occurs when one team has fewer than six (

Additional Notes

  • Base models are loaded in bfloat16 in both cases
  • The very same adapters and base models are used from Huggingface
  • The adapter is public and you can check it here
  • I intentionally used a low rank of 8 in case somehow that gave issues.

Questions

  • Are there any known issues with loading safetensors? I don't see anything in the issues?
  • Are there limits on what modules can be trained? I only trained linear layers, which should be fine:
"target_modules": [
--
  | "gate_proj",
  | "k_proj",
  | "v_proj",
  | "down_proj",
  | "q_proj",
  | "o_proj",
  | "up_proj"
  | ],

Expected behavior

When inferencing the very same model and adapter with the same generation settings, the response should be the same (or at least there should be evidence of the fine-tune). There is not.

@RonanKMcGovern
Copy link
Author

This issue is caused by rs_lora being used. The hacky fix is to multiply alpha in the lora config by the rank.

See vllm-project/vllm#6909

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant