You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The reproduction involves running inference using a) lorax and the setup above, compared to b) running inference with transformers and peft. I have run both on A40 machines on runpod.
LoraX approach
LoraX docker start command arguments:
--model-id Qwen/Qwen2.5-7B-Instruct --port 8000
LoraX Script to call the Runpod endpoint, aka lorax_replication.py
import time
import argparse
import os
from openai import OpenAI
def generate_response(client, model_id, prompt):
messages = [{"role": "user", "content": prompt}]
response = client.chat.completions.create(
model=model_id,
messages=messages,
max_tokens=100,
temperature=0.01,
stream=True
)
collected_message = []
for chunk in response:
if chunk.choices[0].delta.content:
collected_message.append(chunk.choices[0].delta.content)
return "".join(collected_message)
def main(adapter_id=None):
RUNPOD_ENDPOINT = os.getenv("RUNPOD_ENDPOINT")
BASE_MODEL = os.getenv("BASE_MODEL", "Qwen/Qwen2.5-7B-Instruct")
if not RUNPOD_ENDPOINT:
raise ValueError("Please set RUNPOD_ENDPOINT environment variable")
client = OpenAI(
api_key="EMPTY",
base_url=RUNPOD_ENDPOINT + "/v1",
)
prompt = "How many players are on the field on each team at the start of a drop-off?"
# Generate with base model
print("Base Model Response:")
base_response = generate_response(client, BASE_MODEL, prompt)
print(base_response)
# Generate with adapter if specified
if adapter_id:
print("\nAdapter Model Response:")
adapter_response = generate_response(client, adapter_id, prompt)
print(adapter_response)
if __name__ == "__main__":
parser = argparse.ArgumentParser(description="Run LoRAX inference")
parser.add_argument("--adapter", type=str, help="HuggingFace adapter slug")
args = parser.parse_args()
main(args.adapter)
Transformers / PEFT script
# Install required packages if needed
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel
# Load the base model and tokenizer
base_model_id = "Qwen/Qwen2.5-7B-Instruct"
adapter_id = "Trelis/Qwen2.5-7B-Instruct-touch-rugby-1"
tokenizer = AutoTokenizer.from_pretrained(base_model_id, trust_remote_code=True)
base_model = AutoModelForCausalLM.from_pretrained(
base_model_id,
torch_dtype=torch.bfloat16,
device_map="auto",
trust_remote_code=True
)
# Test prompt
prompt = "How many players are on the field on each team at the start of a drop-off?"
# Function to generate response
def generate_response(model, prompt):
messages = [
{"role": "user", "content": prompt}
]
text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer(text, return_tensors="pt").to(model.device)
outputs = model.generate(
**inputs,
max_new_tokens=100,
temperature=0.01,
do_sample=True,
)
response = tokenizer.decode(outputs[0], skip_special_tokens=True)
return response
# Generate with base model
print("Base Model Response:")
base_response = generate_response(base_model, prompt)
print(base_response)
# Load adapter and generate
print("\nLoading adapter...")
adapter_model = PeftModel.from_pretrained(base_model, adapter_id)
print("\nAdapter Model Response:")
adapter_response = generate_response(adapter_model, prompt)
print(adapter_response)
# Free up memory
del base_model
del adapter_model
torch.cuda.empty_cache()
Results / Output
LoraX
The adapter response does not take on the adapter/fine-tune attributes). It does not know what sport the question is about or what a drop-off is.
uv run lorax-replication.py --adapter Trelis/Qwen2.5-7B-Instruct-touch-rugby-1
Base Model Response:
It seems there might be a bit of confusion in your question. The term "drop-off" is not commonly used in sports, and it's not clear which sport you are referring to. However, if you are asking about a specific sport, such as football (American or Canadian), soccer, or rugby, I can provide the correct information.
For example:
- In American football, at the start of a play, there are usually 11 players from each team on the field.
- In
Base Model Response:
It seems there might be a bit of confusion in your question. The term "drop-off" is not commonly used in sports, especially not in team sports like football, basketball, or soccer. Could you please clarify which sport you are referring to?
For example:
- In soccer (football), there are 11 players on the field for each team at the start of a match.
- In American football, there are 11 players on the field for each team at the start of
Adapter Model Response:
In the context of Australian rules football, which is the sport referred to as "drop-off," the game starts with 18 players on the field for each team. This number includes the eight players in the back half (four in the full-forward line and four in the half-forward flank), the four players in the midfield, and the six players in the forward half (three in the forward line and three in the half-back flank).
(lorax) ronanmcgovern@Ronans-MacBook-Pro lorax % uv run lorax-replication.py --adapter Trelis/Qwen2.5-7B-Instruct-touch-rugby-1
Base Model Response:
It seems there might be a bit of confusion in your question. The term "drop-off" is not commonly used in sports, and it's not clear which sport you are referring to. However, if you are asking about a specific sport, such as football (American or Canadian), soccer, or rugby, I can provide the correct information.
For example:
- In American football, at the start of a play, there are typically 11 players from each team on the field.
- In
Adapter Model Response:
In the context of Australian rules football, which is the sport referred to as "drop-off," each team starts with 18 players on the field at the beginning of the game. This number of players is standard for a match and includes both forwards, midfielders, and defenders.
Transformers / peft Response
The model knows what a "drop-off" is and that this question is about touch rugby.
Base Model Response:
system
You are Qwen, created by Alibaba Cloud. You are a helpful assistant.
user
How many players are on the field on each team at the start of a drop-off?
assistant
It seems there might be some confusion in your question. The term "drop-off" is not commonly used in sports to describe the number of players on a field or court. Could you please clarify which sport you are referring to?
For example:
- In soccer (football), there are 11 players on the field for each team.
- In American football, there are typically 11 players on the field for each team at the start of a play.
- In basketball, there are
Loading adapter...
Adapter Model Response:
system
You are Qwen, created by Alibaba Cloud. You are a helpful assistant.
user
How many players are on the field on each team at the start of a drop-off?
assistant
To determine how many players are on the field at the start of a drop-off, we can follow these steps:
1. **Understand the Composition of a Team**: Each team consists of 14 players, including the Interchange.
2. **Interchange Rules**: The Interchange is allowed to enter and leave the field during normal play without a Change of Possession (COP).
3. **Drop-Off Procedure**: A Drop-Off occurs when one team has fewer than six (
Additional Notes
Base models are loaded in bfloat16 in both cases
The very same adapters and base models are used from Huggingface
When inferencing the very same model and adapter with the same generation settings, the response should be the same (or at least there should be evidence of the fine-tune). There is not.
The text was updated successfully, but these errors were encountered:
System Info
2024-11-30T18:08:10.672576Z INFO lorax_launcher: Runtime environment:
Target: x86_64-unknown-linux-gnu
Cargo version: 1.79.0
Commit sha: N/A
Docker label: N/A
nvidia-smi:
Sat Nov 30 18:08:10 2024
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.127.08 Driver Version: 550.127.08 CUDA Version: 12.4 |
|-----------------------------------------+------------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 NVIDIA A40 On | 00000000:D2:00.0 Off | 0 |
| 0% 35C P0 74W / 300W | 37723MiB / 46068MiB | 0% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
+-----------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=========================================================================================|
+-----------------------------------------------------------------------------------------+
2024-11-30T18:08:10.672699Z INFO lorax_launcher: Args { model_id: "mistralai/Mistral-7B-Instruct-v0.1", adapter_id: None, source: "hub", default_adapter_source: None, adapter_source: "hub", revision: None, validation_workers: 2, sharded: None, num_shard: None, quantize: None, compile: false, compile_max_batch_size: 128, compile_max_rank: 64, speculative_tokens: None, speculation_max_batch_size: 32, preloaded_adapter_ids: [], preloaded_adapter_source: None, predibase_api_token: None, dtype: None, trust_remote_code: false, max_concurrent_requests: 128, max_best_of: 2, max_stop_sequences: 4, max_input_length: None, max_total_tokens: None, waiting_served_ratio: 1.2, max_batch_prefill_tokens: None, max_batch_total_tokens: None, max_waiting_tokens: 20, eager_prefill: None, chunked_prefill: None, prefix_caching: None, merge_adapter_weights: false, max_active_adapters: 1024, adapter_cycle_time_s: 2, adapter_memory_fraction: 0.1, hostname: "eab42661659c", port: 80, shard_uds_path: "/tmp/lorax-server", master_addr: "localhost", master_port: 29500, huggingface_hub_cache: Some("/data"), weights_cache_override: None, disable_custom_kernels: false, cuda_memory_fraction: 1.0, json_output: false, otlp_endpoint: None, cors_allow_origin: [], cors_allow_header: [], cors_expose_header: [], cors_allow_method: [], cors_allow_credentials: None, watermark_gamma: None, watermark_delta: None, ngrok: false, ngrok_authtoken: None, ngrok_edge: None, env: true, download_only: false, tokenizer_config_path: None, backend: FA2, embedding_dim: None, disable_sgmv: false }
Information
Tasks
Reproduction
The reproduction involves running inference using a) lorax and the setup above, compared to b) running inference with transformers and peft. I have run both on A40 machines on runpod.
LoraX approach
LoraX docker start command arguments:
LoraX Script to call the Runpod endpoint, aka
lorax_replication.py
Transformers / PEFT script
Results / Output
LoraX
Transformers / peft Response
Additional Notes
Questions
Expected behavior
When inferencing the very same model and adapter with the same generation settings, the response should be the same (or at least there should be evidence of the fine-tune). There is not.
The text was updated successfully, but these errors were encountered: