Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Not able to preload local adapters #699

Open
2 of 4 tasks
xyang16 opened this issue Nov 28, 2024 · 0 comments
Open
2 of 4 tasks

Not able to preload local adapters #699

xyang16 opened this issue Nov 28, 2024 · 0 comments

Comments

@xyang16
Copy link
Contributor

xyang16 commented Nov 28, 2024

System Info

lorax 0.12.1 container

Information

  • Docker
  • The CLI directly

Tasks

  • An officially supported command
  • My own modifications

Reproduction

I followed the document https://github.com/predibase/lorax/blob/main/docs/models/adapters/index.md#local

  1. Download adapters to local
cd /home/ubuntu/model/
mkdir -p lora-llama3-8b-instruct/adapters

Download in python

from huggingface_hub import snapshot_download
allow_patterns = ["*.json", "*.safetensors", "*.bin", "*.pt", "*.txt", "*.model"]
snapshot_download("li-long/Llama-3.1-8b-lora_shiji-model", local_dir="lora-llama3-8b-instruct/adapters/shiji", local_dir_use_symlinks=False, allow_patterns=allow_patterns)
  1. Start container
model=unsloth/llama-3-8b-Instruct
num_shard=4
preloaded_adapter_ids=/root/model/adapters/shiji
preloaded_adapter_source=local
max_active_adapters=1
volume=/tmp/.cache/huggingface/hub/

docker run -it --gpus all --shm-size 10g -p 8080:80 \
  -v $volume:/data \
  -v /home/ubuntu/model/lora-llama3-8b-instruct:/root/model \
  ghcr.io/predibase/lorax:0.12.1 \
  --model-id $model \
  --num-shard $num_shard \
  --preloaded-adapter-ids $preloaded_adapter_ids \
  --preloaded-adapter-source $preloaded_adapter_source
  --max-active-adapters $max_active_adapters
  1. Prompt
curl 127.0.0.1:8080/generate \
    -X POST \
    -d '{
        "inputs": "[INST] Natalia sold clips to 48 of her friends in April, and then she sold half as many clips in May. How many clips did Natalia sell altogether in April and May? [/INST]",
        "parameters": {
            "max_new_tokens": 64,
            "adapter_id": "/root/model/adapters/shiji",
            "adapter_source": "local"
        }
    }' \
    -H 'Content-Type: application/json'

Got error:

2024-11-27T22:44:42.636400Z ERROR lorax_launcher: interceptor.py:41 Method Prefill encountered an error.
Traceback (most recent call last):
  File "/opt/conda/bin/lorax-server", line 8, in <module>
    sys.exit(app())
  File "/opt/conda/lib/python3.10/site-packages/typer/main.py", line 311, in __call__
    return get_command(self)(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/click/core.py", line 1157, in __call__
    return self.main(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/typer/core.py", line 778, in main
    return _main(
  File "/opt/conda/lib/python3.10/site-packages/typer/core.py", line 216, in _main
    rv = self.invoke(ctx)
  File "/opt/conda/lib/python3.10/site-packages/click/core.py", line 1688, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/opt/conda/lib/python3.10/site-packages/click/core.py", line 1434, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/opt/conda/lib/python3.10/site-packages/click/core.py", line 783, in invoke
    return __callback(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/typer/main.py", line 683, in wrapper
    return callback(**use_params)  # type: ignore
  File "/opt/conda/lib/python3.10/site-packages/lorax_server/cli.py", line 92, in serve
    server.serve(
  File "/opt/conda/lib/python3.10/site-packages/lorax_server/server.py", line 449, in serve
    asyncio.run(
  File "/opt/conda/lib/python3.10/asyncio/runners.py", line 44, in run
    return loop.run_until_complete(main)
  File "/opt/conda/lib/python3.10/asyncio/base_events.py", line 636, in run_until_complete
    self.run_forever()
  File "/opt/conda/lib/python3.10/asyncio/base_events.py", line 603, in run_forever
    self._run_once()
  File "/opt/conda/lib/python3.10/asyncio/base_events.py", line 1909, in _run_once
    handle._run()
  File "/opt/conda/lib/python3.10/asyncio/events.py", line 80, in _run
    self._context.run(self._callback, *self._args)
  File "/opt/conda/lib/python3.10/site-packages/grpc_interceptor/server.py", line 165, in invoke_intercept_method
    return await self.intercept(
> File "/opt/conda/lib/python3.10/site-packages/lorax_server/interceptor.py", line 38, in intercept
    return await response
  File "/opt/conda/lib/python3.10/site-packages/opentelemetry/instrumentation/grpc/_aio_server.py", line 82, in _unary_interceptor
    raise error
  File "/opt/conda/lib/python3.10/site-packages/opentelemetry/instrumentation/grpc/_aio_server.py", line 73, in _unary_interceptor
    return await behavior(request_or_iterator, context)
  File "/opt/conda/lib/python3.10/site-packages/lorax_server/server.py", line 111, in Prefill
    generations, next_batch = self.model.generate_token(batch)
  File "/opt/conda/lib/python3.10/contextlib.py", line 79, in inner
    return func(*args, **kwds)
  File "/opt/conda/lib/python3.10/site-packages/lorax_server/models/flash_causal_lm.py", line 1608, in generate_token
    out, speculative_logits = self.forward(batch, adapter_data)
  File "/opt/conda/lib/python3.10/site-packages/lorax_server/models/flash_causal_lm.py", line 1529, in forward
    out = model.forward(
  File "/opt/conda/lib/python3.10/site-packages/lorax_server/models/custom_modeling/flash_llama_modeling.py", line 604, in forward
    hidden_states = self.model(
  File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
    return forward_call(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/lorax_server/models/custom_modeling/flash_llama_modeling.py", line 541, in forward
    hidden_states, residual = layer(
  File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
    return forward_call(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/lorax_server/models/custom_modeling/flash_llama_modeling.py", line 466, in forward
    attn_output = self.self_attn(
  File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
    return forward_call(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/lorax_server/models/custom_modeling/flash_llama_modeling.py", line 317, in forward
    qkv = self.query_key_value(hidden_states, adapter_data)
  File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
    return forward_call(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/lorax_server/utils/layers.py", line 250, in forward
    result = self.forward_layer_type(result, input, adapter_data, layer_name, start_idx, end_idx)
  File "/opt/conda/lib/python3.10/site-packages/lorax_server/utils/layers.py", line 96, in forward_layer_type
    adapter_data.punica_wrapper.add_lora(
  File "/opt/conda/lib/python3.10/site-packages/lorax_server/utils/punica.py", line 730, in add_lora
    self.add_shrink(buffer, x, wa_t_all, scale)
  File "/opt/conda/lib/python3.10/site-packages/lorax_server/utils/punica.py", line 652, in add_shrink
    shrink_fun(y, x, w_t_all, scale)
  File "/opt/conda/lib/python3.10/site-packages/lorax_server/utils/punica.py", line 559, in shrink_prefill
    sgmv_shrink(
  File "/opt/conda/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
    return func(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/lorax_server/utils/ops/sgmv_shrink.py", line 142, in sgmv_shrink
    assert inputs.size(1) == lora_a_weights.size(-1)
AssertionError

Expected behavior

Should be able to preload adapters.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant