Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[V1] Improve AsyncLLM and API Server #11237

Closed
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
135 commits
Select commit Hold shift + click to select a range
c2ad07c
first rev of 3 process architecture
robertgshaw2-neuralmagic Dec 16, 2024
f0b3e36
finally able to generate text
robertgshaw2-neuralmagic Dec 16, 2024
ce8aa2c
breaking under load
robertgshaw2-neuralmagic Dec 16, 2024
457d618
working e2e
robertgshaw2-neuralmagic Dec 16, 2024
c980dbd
workign e2e
robertgshaw2-neuralmagic Dec 16, 2024
cba2d54
stash
robertgshaw2-neuralmagic Dec 16, 2024
3ae44a8
stash
robertgshaw2-neuralmagic Dec 16, 2024
3ef5687
remove async stream
robertgshaw2-neuralmagic Dec 16, 2024
b350084
fix protocol
robertgshaw2-neuralmagic Dec 16, 2024
abd7fa3
clean up completion client
robertgshaw2-neuralmagic Dec 16, 2024
6986457
stash
robertgshaw2-neuralmagic Dec 16, 2024
816e965
updated
robertgshaw2-neuralmagic Dec 16, 2024
cebf287
updated comment
robertgshaw2-neuralmagic Dec 16, 2024
adcc3d2
remove comptibility
robertgshaw2-neuralmagic Dec 16, 2024
4344f1b
format
robertgshaw2-neuralmagic Dec 16, 2024
d7b42a0
format/comments
robertgshaw2-neuralmagic Dec 16, 2024
c987a76
update comment
robertgshaw2-neuralmagic Dec 16, 2024
f3ff0e0
format
robertgshaw2-neuralmagic Dec 16, 2024
fbf647f
updated examples
robertgshaw2-neuralmagic Dec 16, 2024
b1105b9
more cleaning
robertgshaw2-neuralmagic Dec 16, 2024
ea7289b
make pr smaller
robertgshaw2-neuralmagic Dec 16, 2024
06dcb1b
updated
robertgshaw2-neuralmagic Dec 17, 2024
9628575
added log
robertgshaw2-neuralmagic Dec 17, 2024
5d824df
remove log
robertgshaw2-neuralmagic Dec 17, 2024
26814f1
updated
robertgshaw2-neuralmagic Dec 17, 2024
3263f6b
Merge branch 'api-server-performance' into remove-async-stream
robertgshaw2-neuralmagic Dec 17, 2024
1205764
Stash
robertgshaw2-neuralmagic Dec 17, 2024
73da178
stash
robertgshaw2-neuralmagic Dec 17, 2024
9830fbe
stash
robertgshaw2-neuralmagic Dec 18, 2024
661ee44
stash
robertgshaw2-neuralmagic Dec 18, 2024
6f12525
stash
robertgshaw2-neuralmagic Dec 18, 2024
fd91f4b
stash
robertgshaw2-neuralmagic Dec 18, 2024
6c99a4f
stash
robertgshaw2-neuralmagic Dec 18, 2024
dfa4526
stahs
robertgshaw2-neuralmagic Dec 19, 2024
e3d6b0e
stash
robertgshaw2-neuralmagic Dec 19, 2024
2c0a793
yay
robertgshaw2-neuralmagic Dec 20, 2024
ee791b2
no more preemptions
robertgshaw2-neuralmagic Dec 20, 2024
3713502
stash current state of async llm
robertgshaw2-neuralmagic Dec 21, 2024
bcd45be
stash profile'
robertgshaw2-neuralmagic Dec 21, 2024
2c06795
Revert "stash profile'"
robertgshaw2-neuralmagic Dec 21, 2024
4571da6
updated
robertgshaw2-neuralmagic Dec 21, 2024
c5dacd4
remove output kind from api server
robertgshaw2-neuralmagic Dec 21, 2024
23d3e60
updated
robertgshaw2-neuralmagic Dec 21, 2024
3acf5c2
cleanup
robertgshaw2-neuralmagic Dec 21, 2024
84ff3c2
cleanup
robertgshaw2-neuralmagic Dec 21, 2024
ddf1426
updated
robertgshaw2-neuralmagic Dec 21, 2024
895fd0d
updated
robertgshaw2-neuralmagic Dec 21, 2024
1184615
cleanup
robertgshaw2-neuralmagic Dec 21, 2024
7da9b1a
cleanup
robertgshaw2-neuralmagic Dec 21, 2024
f61c26a
cleanup
robertgshaw2-neuralmagic Dec 21, 2024
4e3de90
updated
robertgshaw2-neuralmagic Dec 21, 2024
07e4fa2
updated
robertgshaw2-neuralmagic Dec 21, 2024
2022a4f
format
robertgshaw2-neuralmagic Dec 21, 2024
10c7092
updated
robertgshaw2-neuralmagic Dec 21, 2024
a0620ac
cleanup
robertgshaw2-neuralmagic Dec 21, 2024
12b3e06
more cleanup
robertgshaw2-neuralmagic Dec 21, 2024
e092664
more cleanup
robertgshaw2-neuralmagic Dec 21, 2024
32df238
updated
robertgshaw2-neuralmagic Dec 21, 2024
7dff863
more cleanup
robertgshaw2-neuralmagic Dec 21, 2024
103729d
updated
robertgshaw2-neuralmagic Dec 21, 2024
380086c
updated
robertgshaw2-neuralmagic Dec 21, 2024
6f0adfe
working again
robertgshaw2-neuralmagic Dec 21, 2024
5d2c9ae
design without incremental streaming seems okay
robertgshaw2-neuralmagic Dec 21, 2024
0574b89
updated
robertgshaw2-neuralmagic Dec 21, 2024
19a7cd0
updated
robertgshaw2-neuralmagic Dec 22, 2024
067d487
updated
robertgshaw2-neuralmagic Dec 22, 2024
c1c8749
more cleanup
robertgshaw2-neuralmagic Dec 22, 2024
ceacadd
working e2e with the fds
robertgshaw2-neuralmagic Dec 22, 2024
630c72f
fix
robertgshaw2-neuralmagic Dec 22, 2024
e021012
updated
robertgshaw2-neuralmagic Dec 22, 2024
c58d0ff
performance is now good
robertgshaw2-neuralmagic Dec 22, 2024
d7af4bc
updated
robertgshaw2-neuralmagic Dec 22, 2024
6d214d5
merged
robertgshaw2-neuralmagic Dec 22, 2024
0914351
updated
robertgshaw2-neuralmagic Dec 22, 2024
28da5b3
updated
robertgshaw2-neuralmagic Dec 22, 2024
e14def6
updated
robertgshaw2-neuralmagic Dec 22, 2024
074af11
updated
robertgshaw2-neuralmagic Dec 22, 2024
3df5288
updated
robertgshaw2-neuralmagic Dec 22, 2024
546b0de
updated
robertgshaw2-neuralmagic Dec 22, 2024
c700c4a
remove
robertgshaw2-neuralmagic Dec 22, 2024
5b568da
send messages only when needed
robertgshaw2-neuralmagic Dec 23, 2024
93c4ea4
added flag for request id headers
robertgshaw2-neuralmagic Dec 23, 2024
548ae69
fixed too long line
robertgshaw2-neuralmagic Dec 23, 2024
729df02
updated
robertgshaw2-neuralmagic Dec 23, 2024
6ec9dcb
updated
robertgshaw2-neuralmagic Dec 23, 2024
fc6a20d
make pr smaller
robertgshaw2-neuralmagic Dec 23, 2024
930ccc2
update logging timing
robertgshaw2-neuralmagic Dec 23, 2024
52d370f
cleanup nits
robertgshaw2-neuralmagic Dec 23, 2024
8939e2e
cleanup
robertgshaw2-neuralmagic Dec 23, 2024
c2c2e57
add sigquit handlers for shutdown
robertgshaw2-neuralmagic Dec 23, 2024
51b498d
updated
robertgshaw2-neuralmagic Dec 23, 2024
84c08b1
signifcantly better error handling
robertgshaw2-neuralmagic Dec 23, 2024
91aceba
proper shutdown of output loop
robertgshaw2-neuralmagic Dec 23, 2024
87e7ebd
update comment
robertgshaw2-neuralmagic Dec 23, 2024
2e3257c
updated
robertgshaw2-neuralmagic Dec 23, 2024
3b13d89
support in LLMEngine
robertgshaw2-neuralmagic Dec 23, 2024
6f383f2
updated
robertgshaw2-neuralmagic Dec 23, 2024
30d7333
nit
robertgshaw2-neuralmagic Dec 23, 2024
bd49c9c
updated
robertgshaw2-neuralmagic Dec 23, 2024
2192ae6
make PR cleaner
robertgshaw2-neuralmagic Dec 23, 2024
611d1b0
make PR cleaner
robertgshaw2-neuralmagic Dec 23, 2024
b12d0e6
make pr cleaner
robertgshaw2-neuralmagic Dec 23, 2024
1dac1f1
more cleanup
robertgshaw2-neuralmagic Dec 23, 2024
40c5cd5
more cleanup
robertgshaw2-neuralmagic Dec 23, 2024
a1e17c4
updated
robertgshaw2-neuralmagic Dec 23, 2024
921a56a
updated comment
robertgshaw2-neuralmagic Dec 23, 2024
19aadbb
updated
robertgshaw2-neuralmagic Dec 23, 2024
ddae79c
updated
robertgshaw2-neuralmagic Dec 23, 2024
4b00ae0
factor out proc handle code
robertgshaw2-neuralmagic Dec 23, 2024
467d63e
actually save before commiting
robertgshaw2-neuralmagic Dec 23, 2024
afd4b52
actually save before commiting
robertgshaw2-neuralmagic Dec 23, 2024
395742e
again
robertgshaw2-neuralmagic Dec 23, 2024
2d6ceb8
updated
robertgshaw2-neuralmagic Dec 23, 2024
a19cb83
cleanup
robertgshaw2-neuralmagic Dec 23, 2024
1695fdd
cleaning
robertgshaw2-neuralmagic Dec 23, 2024
b2f845b
updated
robertgshaw2-neuralmagic Dec 23, 2024
12df407
remove epoch
robertgshaw2-neuralmagic Dec 23, 2024
b7843c9
update
robertgshaw2-neuralmagic Dec 23, 2024
a6368a7
fix typing
robertgshaw2-neuralmagic Dec 23, 2024
315efea
remove prints
robertgshaw2-neuralmagic Dec 23, 2024
740567f
updated
robertgshaw2-neuralmagic Dec 23, 2024
cbc043e
fixup
robertgshaw2-neuralmagic Dec 23, 2024
8061078
mypy
robertgshaw2-neuralmagic Dec 23, 2024
8372665
stash
robertgshaw2-neuralmagic Dec 23, 2024
6b4f2bb
almost there with llm engine
robertgshaw2-neuralmagic Dec 23, 2024
db7d055
format'
robertgshaw2-neuralmagic Dec 23, 2024
98053d6
clean
robertgshaw2-neuralmagic Dec 24, 2024
4713e29
updated
robertgshaw2-neuralmagic Dec 24, 2024
4f946eb
nit
robertgshaw2-neuralmagic Dec 24, 2024
59c6430
Update vllm/v1/utils.py
robertgshaw2-neuralmagic Dec 24, 2024
9dceec4
Merge branch 'main' into remove-async-stream
robertgshaw2-neuralmagic Dec 24, 2024
856838d
updated
robertgshaw2-neuralmagic Dec 24, 2024
94fe4af
updated
robertgshaw2-neuralmagic Dec 24, 2024
127045a
stash
robertgshaw2-neuralmagic Dec 24, 2024
1352386
remove log
robertgshaw2-neuralmagic Dec 24, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
5 changes: 4 additions & 1 deletion benchmarks/benchmark_throughput.py
Original file line number Diff line number Diff line change
Expand Up @@ -414,14 +414,17 @@ def main(args: argparse.Namespace):
for request in requests)
total_output_tokens = sum(request.expected_output_len
for request in requests)
total_input_tokens = total_num_tokens - total_output_tokens
if is_multi_modal:
print("\033[91mWARNING\033[0m: Multi-modal request detected. The "
"following metrics are not accurate because image tokens are not"
" counted. See vllm-project/vllm/issues/9778 for details.")
# TODO(vllm-project/vllm/issues/9778): Count molti-modal token length.
print(f"Throughput: {len(requests) / elapsed_time:.2f} requests/s, "
f"{total_num_tokens / elapsed_time:.2f} total tokens/s, "
f"{total_output_tokens / elapsed_time:.2f} output tokens/s")
f"{total_output_tokens / elapsed_time:.2f} output tokens/s, "
f"{total_input_tokens / len(requests)} input tokens/req, "
f"{(total_output_tokens) / len(requests)} output tokens/req, ")

# Output JSON results if specified
if args.output_json:
Expand Down
6 changes: 3 additions & 3 deletions tests/v1/engine/test_engine_core.py
Original file line number Diff line number Diff line change
Expand Up @@ -8,7 +8,7 @@
from vllm.engine.arg_utils import EngineArgs
from vllm.platforms import current_platform
from vllm.usage.usage_lib import UsageContext
from vllm.v1.engine import EngineCoreRequest
from vllm.v1.engine import EngineRequest
from vllm.v1.engine.async_llm import AsyncLLM
from vllm.v1.engine.core import EngineCore

Expand All @@ -22,8 +22,8 @@
PROMPT_TOKENS = TOKENIZER(PROMPT).input_ids


def make_request() -> EngineCoreRequest:
return EngineCoreRequest(
def make_request() -> EngineRequest:
return EngineRequest(
request_id=uuid.uuid4(),
prompt=PROMPT,
prompt_token_ids=PROMPT_TOKENS,
Expand Down
6 changes: 3 additions & 3 deletions tests/v1/engine/test_engine_core_client.py
Original file line number Diff line number Diff line change
Expand Up @@ -10,7 +10,7 @@
from vllm.engine.arg_utils import EngineArgs
from vllm.platforms import current_platform
from vllm.usage.usage_lib import UsageContext
from vllm.v1.engine import EngineCoreRequest
from vllm.v1.engine import EngineRequest
from vllm.v1.engine.async_llm import AsyncLLM
from vllm.v1.engine.core_client import EngineCoreClient

Expand All @@ -24,8 +24,8 @@
PROMPT_TOKENS = TOKENIZER(PROMPT).input_ids


def make_request(params: SamplingParams) -> EngineCoreRequest:
return EngineCoreRequest(
def make_request(params: SamplingParams) -> EngineRequest:
return EngineRequest(
request_id=str(uuid.uuid4()),
prompt=PROMPT,
prompt_token_ids=PROMPT_TOKENS,
Expand Down
4 changes: 3 additions & 1 deletion vllm/entrypoints/api_server.py
Original file line number Diff line number Diff line change
Expand Up @@ -21,7 +21,7 @@
from vllm.logger import init_logger
from vllm.sampling_params import SamplingParams
from vllm.usage.usage_lib import UsageContext
from vllm.utils import FlexibleArgumentParser, random_uuid
from vllm.utils import FlexibleArgumentParser, random_uuid, set_ulimit
from vllm.version import __version__ as VLLM_VERSION

logger = init_logger("vllm.entrypoints.api_server")
Expand Down Expand Up @@ -119,6 +119,8 @@ async def run_server(args: Namespace,
logger.info("vLLM API server version %s", VLLM_VERSION)
logger.info("args: %s", args)

set_ulimit()

app = await init_app(args, llm_engine)
assert engine is not None

Expand Down
30 changes: 23 additions & 7 deletions vllm/entrypoints/openai/api_server.py
Original file line number Diff line number Diff line change
Expand Up @@ -68,7 +68,7 @@
from vllm.logger import init_logger
from vllm.usage.usage_lib import UsageContext
from vllm.utils import (FlexibleArgumentParser, get_open_zmq_ipc_path,
is_valid_ipv6_address)
is_valid_ipv6_address, kill_process_tree, set_ulimit)
from vllm.version import __version__ as VLLM_VERSION

TIMEOUT_KEEP_ALIVE = 5 # seconds
Expand Down Expand Up @@ -585,12 +585,18 @@ async def authentication(request: Request, call_next):
status_code=401)
return await call_next(request)

@app.middleware("http")
async def add_request_id(request: Request, call_next):
request_id = request.headers.get("X-Request-Id") or uuid.uuid4().hex
response = await call_next(request)
response.headers["X-Request-Id"] = request_id
return response
if args.enable_request_id_headers:
logger.warning(
"CAUTION: Enabling X-Request-Id headers in the API Server. "
"This can harm performance at high QPS.")

@app.middleware("http")
async def add_request_id(request: Request, call_next):
request_id = request.headers.get(
"X-Request-Id") or uuid.uuid4().hex
response = await call_next(request)
response.headers["X-Request-Id"] = request_id
return response

for middleware in args.middleware:
module_path, object_name = middleware.rsplit(".", 1)
Expand Down Expand Up @@ -721,12 +727,22 @@ async def run_server(args, **uvicorn_kwargs) -> None:
sock_addr = (args.host or "", args.port)
sock = create_server_socket(sock_addr)

# workaround to ensure user has enough fds available for uvicorn + ipc
set_ulimit()

def signal_handler(*_) -> None:
# Interrupt server on sigterm while initializing
raise KeyboardInterrupt("terminated")

signal.signal(signal.SIGTERM, signal_handler)

# The child processes will send SIGQUIT to this process when
# any error happens. This process then clean up the whole tree.
def sigquit_handler(signum, frame):
kill_process_tree(os.getpid())

signal.signal(signal.SIGQUIT, sigquit_handler)

async with build_async_engine_client(args) as engine_client:
app = build_app(args)

Expand Down
6 changes: 5 additions & 1 deletion vllm/entrypoints/openai/cli_args.py
Original file line number Diff line number Diff line change
Expand Up @@ -196,7 +196,11 @@ def make_arg_parser(parser: FlexibleArgumentParser) -> FlexibleArgumentParser:
action="store_true",
help="If specified, will run the OpenAI frontend server in the same "
"process as the model serving engine.")

parser.add_argument(
"--enable-request-id-headers",
action="store_true",
help="If specified, API server will add X-Request-Id header to "
"responses. Caution: this hurts performance at high QPS.")
parser.add_argument(
"--enable-auto-tool-choice",
action="store_true",
Expand Down
24 changes: 24 additions & 0 deletions vllm/utils.py
Original file line number Diff line number Diff line change
Expand Up @@ -10,13 +10,15 @@
import inspect
import ipaddress
import os
import resource
import signal
import socket
import subprocess
import sys
import tempfile
import threading
import time
import traceback
import uuid
import warnings
import weakref
Expand Down Expand Up @@ -1613,6 +1615,28 @@ def resolve_obj_by_qualname(qualname: str) -> Any:
return getattr(module, obj_name)


def set_ulimit(target_soft_limit=65535):
resource_type = resource.RLIMIT_NOFILE
current_soft, current_hard = resource.getrlimit(resource_type)

if current_soft < target_soft_limit:
try:
resource.setrlimit(resource_type,
(target_soft_limit, current_hard))
except ValueError as e:
logger.warning(
"Found ulimit of %s and failed to automatically increase"
"with error %s. This can cause fd limit errors like"
"`OSError: [Errno 24] Too many open files`. Consider "
"increasing with ulimit -n", current_soft, e)


def get_exception_traceback():
etype, value, tb = sys.exc_info()
err_str = "".join(traceback.format_exception(etype, value, tb))
return err_str


def kill_process_tree(pid: int):
"""
Kills all descendant processes of the given pid by sending SIGKILL.
Expand Down
60 changes: 22 additions & 38 deletions vllm/v1/engine/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -6,35 +6,14 @@

from vllm.lora.request import LoRARequest
from vllm.multimodal import MultiModalKwargs, MultiModalPlaceholderDict
from vllm.sampling_params import RequestOutputKind, SamplingParams
from vllm.sampling_params import SamplingParams


@dataclass
class DetokenizerRequest:

class EngineRequest:
request_id: str
prompt: Optional[str]
prompt_token_ids: List[int]
skip_special_tokens: bool
spaces_between_special_tokens: bool
output_kind: RequestOutputKind

stop: List[str]
include_stop_str_in_output: bool


@dataclass
class EngineCoreRequest:

# NOTE: prompt and prompt_token_ids should be DecoderOnlyInput,
# but this object is currently not playing well with msgspec
# due to circular imports and typing we have in data.py

request_id: str
#NOTE(Nick): I don't think we need to pass prompt here since it should
# always be tokenized?
prompt: Optional[str]
prompt_token_ids: List[int]
mm_inputs: Optional[List[Optional[MultiModalKwargs]]]
mm_hashes: Optional[List[str]]
mm_placeholders: Optional[MultiModalPlaceholderDict]
Expand All @@ -44,6 +23,20 @@ class EngineCoreRequest:
lora_request: Optional[LoRARequest]


@dataclass
class EngineAbortRequest:
request_ids: List[str]


@dataclass
class EngineProfileRequest:
is_start: bool


EngineRequestUnion = Union[EngineRequest, EngineAbortRequest,
EngineProfileRequest]


class EngineCoreOutput(
msgspec.Struct,
array_like=True, # type: ignore[call-arg]
Expand All @@ -70,19 +63,10 @@ class EngineCoreOutputs(
outputs: List[EngineCoreOutput]


@dataclass
class EngineCoreProfile:
is_start: bool


class EngineCoreRequestType(enum.Enum):
class EngineRequestType(enum.Enum):
"""
Request types defined as hex byte strings, so it can be sent over sockets
without separate encoding step.
"""
Request types defined as hex byte strings, so it can be sent over sockets
without separate encoding step.
"""
ADD = b'\x00'
ABORT = b'\x01'
PROFILE = b'\x02'


EngineCoreRequestUnion = Union[EngineCoreRequest, EngineCoreProfile, List[str]]
FROM_ENGINE_CORE = b'\x00'
FROM_ENGINE = b'\x01'
Loading
Loading