diff --git a/docs/source/models/generative_models.md b/docs/source/models/generative_models.md index 7aeaba855dcfb..e97014dbef156 100644 --- a/docs/source/models/generative_models.md +++ b/docs/source/models/generative_models.md @@ -120,19 +120,7 @@ outputs = llm.chat(conversation, chat_template=custom_template) ## Online Inference -Our [OpenAI Compatible Server](../serving/openai_compatible_server) can be used for online inference. -Please click on the above link for more details on how to launch the server. +Our [OpenAI Compatible Server](../serving/openai_compatible_server) provides endpoints that correspond to the offline APIs: -### Completions API - -Our Completions API is similar to `LLM.generate` but only accepts text. -It is compatible with [OpenAI Completions API](https://platform.openai.com/docs/api-reference/completions) -so that you can use OpenAI client to interact with it. -A code example can be found in [examples/openai_completion_client.py](https://github.com/vllm-project/vllm/blob/main/examples/openai_completion_client.py). - -### Chat API - -Our Chat API is similar to `LLM.chat`, accepting both text and [multi-modal inputs](#multimodal-inputs). -It is compatible with [OpenAI Chat Completions API](https://platform.openai.com/docs/api-reference/chat) -so that you can use OpenAI client to interact with it. -A code example can be found in [examples/openai_chat_completion_client.py](https://github.com/vllm-project/vllm/blob/main/examples/openai_chat_completion_client.py). +- [Completions API](#completions-api) is similar to `LLM.generate` but only accepts text. +- [Chat API](#chat-api) is similar to `LLM.chat`, accepting both text and [multi-modal inputs](#multimodal-inputs) for models with a chat template. diff --git a/docs/source/models/pooling_models.md b/docs/source/models/pooling_models.md index 20a7b8f33947d..6d034f652d2ab 100644 --- a/docs/source/models/pooling_models.md +++ b/docs/source/models/pooling_models.md @@ -106,22 +106,8 @@ A code example can be found in [examples/offline_inference_scoring.py](https://g ## Online Inference -Our [OpenAI Compatible Server](../serving/openai_compatible_server.md) can be used for online inference. -Please click on the above link for more details on how to launch the server. +Our [OpenAI Compatible Server](../serving/openai_compatible_server.md) provides endpoints that correspond to the offline APIs: -### Embeddings API - -Our Embeddings API is similar to `LLM.embed`, accepting both text and [multi-modal inputs](#multimodal-inputs). - -The text-only API is compatible with [OpenAI Embeddings API](https://platform.openai.com/docs/api-reference/embeddings) -so that you can use OpenAI client to interact with it. -A code example can be found in [examples/openai_embedding_client.py](https://github.com/vllm-project/vllm/blob/main/examples/openai_embedding_client.py). - -The multi-modal API is an extension of the [OpenAI Embeddings API](https://platform.openai.com/docs/api-reference/embeddings) -that incorporates [OpenAI Chat Completions API](https://platform.openai.com/docs/api-reference/chat), -so it is not part of the OpenAI standard. Please see [](#multimodal-inputs) for more details on how to use it. - -### Score API - -Our Score API is similar to `LLM.score`. -Please see [this page](#score-api) for more details on how to use it. +- [Pooling API](#pooling-api) is similar to `LLM.encode`, being applicable to all types of pooling models. +- [Embeddings API](#embeddings-api) is similar to `LLM.embed`, accepting both text and [multi-modal inputs](#multimodal-inputs) for embedding models. +- [Score API](#score-api) is similar to `LLM.score` for cross-encoder models. diff --git a/docs/source/serving/openai_compatible_server.md b/docs/source/serving/openai_compatible_server.md index 934a7cea7b9cb..597618cc5a215 100644 --- a/docs/source/serving/openai_compatible_server.md +++ b/docs/source/serving/openai_compatible_server.md @@ -42,6 +42,8 @@ In addition, we have the following custom APIs: - [Tokenizer API](#tokenizer-api) (`/tokenize`, `/detokenize`) - Applicable to any model with a tokenizer. +- [Pooling API](#pooling-api) (`/pooling`) + - Applicable to all [pooling models](../models/pooling_models.md). - [Score API](#score-api) (`/score`) - Only applicable to [cross-encoder models](../models/pooling_models.md) (`--task score`). @@ -179,7 +181,12 @@ The order of priorities is `command line > config file values > defaults`. (completions-api)= ### Completions API -Refer to [OpenAI's API reference](https://platform.openai.com/docs/api-reference/completions) for more details. +Our Completions API is compatible with [OpenAI's Completions API](https://platform.openai.com/docs/api-reference/completions); +you can use the [official OpenAI Python client](https://github.com/openai/openai-python) to interact with it. + +#### Code example + +See [examples/openai_completion_client.py](https://github.com/vllm-project/vllm/blob/main/examples/openai_completion_client.py). #### Extra parameters @@ -200,15 +207,20 @@ The following extra parameters are supported: ``` (chat-api)= -### Chat Completions API +### Chat API -Refer to [OpenAI's API reference](https://platform.openai.com/docs/api-reference/chat) for more details. +Our Chat API is compatible with [OpenAI's Chat Completions API](https://platform.openai.com/docs/api-reference/chat); +you can use the [official OpenAI Python client](https://github.com/openai/openai-python) to interact with it. We support both [Vision](https://platform.openai.com/docs/guides/vision)- and [Audio](https://platform.openai.com/docs/guides/audio?audio-generation-quickstart-example=audio-in)-related parameters; see our [Multimodal Inputs](../usage/multimodal_inputs.md) guide for more information. - *Note: `image_url.detail` parameter is not supported.* +#### Code example + +See [examples/openai_chat_completion_client.py](https://github.com/vllm-project/vllm/blob/main/examples/openai_chat_completion_client.py). + #### Extra parameters The following [sampling parameters (click through to see documentation)](../dev/sampling_params.md) are supported. @@ -230,15 +242,20 @@ The following extra parameters are supported: (embeddings-api)= ### Embeddings API -Refer to [OpenAI's API reference](https://platform.openai.com/docs/api-reference/embeddings) for more details. +Our Embeddings API is compatible with [OpenAI's Embeddings API](https://platform.openai.com/docs/api-reference/embeddings); +you can use the [official OpenAI Python client](https://github.com/openai/openai-python) to interact with it. -If the model has a [chat template](#chat-template), you can replace `inputs` with a list of `messages` (same schema as [Chat Completions API](#chat-api)) +If the model has a [chat template](#chat-template), you can replace `inputs` with a list of `messages` (same schema as [Chat API](#chat-api)) which will be treated as a single prompt to the model. ```{tip} -This enables multi-modal inputs to be passed to embedding models, see [this page](../usage/multimodal_inputs.md) for details. +This enables multi-modal inputs to be passed to embedding models, see [this page](#multimodal-inputs) for details. ``` +#### Code example + +See [examples/openai_embedding_client.py](https://github.com/vllm-project/vllm/blob/main/examples/openai_embedding_client.py). + #### Extra parameters The following [pooling parameters (click through to see documentation)](../dev/pooling_params.md) are supported. @@ -268,20 +285,35 @@ For chat-like input (i.e. if `messages` is passed), these extra parameters are s (tokenizer-api)= ### Tokenizer API -The Tokenizer API is a simple wrapper over [HuggingFace-style tokenizers](https://huggingface.co/docs/transformers/en/main_classes/tokenizer). +Our Tokenizer API is a simple wrapper over [HuggingFace-style tokenizers](https://huggingface.co/docs/transformers/en/main_classes/tokenizer). It consists of two endpoints: - `/tokenize` corresponds to calling `tokenizer.encode()`. - `/detokenize` corresponds to calling `tokenizer.decode()`. +(pooling-api)= +### Pooling API + +Our Pooling API encodes input prompts using a [pooling model](../models/pooling_models.md) and returns the corresponding hidden states. + +The input format is the same as [Embeddings API](#embeddings-api), but the output data can contain an arbitrary nested list, not just a 1-D list of floats. + +#### Code example + +See [examples/openai_pooling_client.py](https://github.com/vllm-project/vllm/blob/main/examples/openai_pooling_client.py). + (score-api)= ### Score API -The Score API applies a cross-encoder model to predict scores for sentence pairs. +Our Score API applies a cross-encoder model to predict scores for sentence pairs. Usually, the score for a sentence pair refers to the similarity between two sentences, on a scale of 0 to 1. You can find the documentation for these kind of models at [sbert.net](https://www.sbert.net/docs/package_reference/cross_encoder/cross_encoder.html). +#### Code example + +See [examples/openai_cross_encoder_score.py](https://github.com/vllm-project/vllm/blob/main/examples/openai_cross_encoder_score.py). + #### Single inference You can pass a string to both `text_1` and `text_2`, forming a single sentence pair. diff --git a/examples/openai_cross_encoder_score.py b/examples/openai_cross_encoder_score.py index a06af8df5d3fe..365a684d53f2b 100644 --- a/examples/openai_cross_encoder_score.py +++ b/examples/openai_cross_encoder_score.py @@ -20,9 +20,9 @@ def post_http_request(prompt: dict, api_url: str) -> requests.Response: parser.add_argument("--host", type=str, default="localhost") parser.add_argument("--port", type=int, default=8000) parser.add_argument("--model", type=str, default="BAAI/bge-reranker-v2-m3") + args = parser.parse_args() api_url = f"http://{args.host}:{args.port}/score" - model_name = args.model text_1 = "What is the capital of Brazil?" diff --git a/examples/openai_pooling_client.py b/examples/openai_pooling_client.py new file mode 100644 index 0000000000000..37ec8f2fb6be3 --- /dev/null +++ b/examples/openai_pooling_client.py @@ -0,0 +1,51 @@ +""" +Example online usage of Pooling API. + +Run `vllm serve --task ` +to start up the server in vLLM. +""" +import argparse +import pprint + +import requests + + +def post_http_request(prompt: dict, api_url: str) -> requests.Response: + headers = {"User-Agent": "Test Client"} + response = requests.post(api_url, headers=headers, json=prompt) + return response + + +if __name__ == "__main__": + parser = argparse.ArgumentParser() + parser.add_argument("--host", type=str, default="localhost") + parser.add_argument("--port", type=int, default=8000) + parser.add_argument("--model", + type=str, + default="jason9693/Qwen2.5-1.5B-apeach") + + args = parser.parse_args() + api_url = f"http://{args.host}:{args.port}/pooling" + model_name = args.model + + # Input like Completions API + prompt = {"model": model_name, "input": "vLLM is great!"} + pooling_response = post_http_request(prompt=prompt, api_url=api_url) + print("Pooling Response:") + pprint.pprint(pooling_response.json()) + + # Input like Chat API + prompt = { + "model": + model_name, + "messages": [{ + "role": "user", + "content": [{ + "type": "text", + "text": "vLLM is great!" + }], + }] + } + pooling_response = post_http_request(prompt=prompt, api_url=api_url) + print("Pooling Response:") + pprint.pprint(pooling_response.json()) diff --git a/tests/entrypoints/openai/test_embedding.py b/tests/entrypoints/openai/test_embedding.py index 9f2b77dde2a7f..b52a5b28c9cff 100644 --- a/tests/entrypoints/openai/test_embedding.py +++ b/tests/entrypoints/openai/test_embedding.py @@ -6,6 +6,7 @@ import pytest_asyncio import requests +from vllm.entrypoints.openai.protocol import EmbeddingResponse from vllm.transformers_utils.tokenizer import get_tokenizer from ...utils import RemoteOpenAIServer @@ -17,6 +18,8 @@ @pytest.fixture(scope="module") def server(): args = [ + "--task", + "embed", # use half precision for speed and memory savings in CI environment "--dtype", "bfloat16", @@ -45,11 +48,14 @@ async def test_single_embedding(client: openai.AsyncOpenAI, model_name: str): ] # test single embedding - embeddings = await client.embeddings.create( + embedding_response = await client.embeddings.create( model=model_name, input=input_texts, encoding_format="float", ) + embeddings = EmbeddingResponse.model_validate( + embedding_response.model_dump(mode="json")) + assert embeddings.id is not None assert len(embeddings.data) == 1 assert len(embeddings.data[0].embedding) == 4096 @@ -59,11 +65,14 @@ async def test_single_embedding(client: openai.AsyncOpenAI, model_name: str): # test using token IDs input_tokens = [1, 1, 1, 1, 1] - embeddings = await client.embeddings.create( + embedding_response = await client.embeddings.create( model=model_name, input=input_tokens, encoding_format="float", ) + embeddings = EmbeddingResponse.model_validate( + embedding_response.model_dump(mode="json")) + assert embeddings.id is not None assert len(embeddings.data) == 1 assert len(embeddings.data[0].embedding) == 4096 @@ -80,11 +89,14 @@ async def test_batch_embedding(client: openai.AsyncOpenAI, model_name: str): "The cat sat on the mat.", "A feline was resting on a rug.", "Stars twinkle brightly in the night sky." ] - embeddings = await client.embeddings.create( + embedding_response = await client.embeddings.create( model=model_name, input=input_texts, encoding_format="float", ) + embeddings = EmbeddingResponse.model_validate( + embedding_response.model_dump(mode="json")) + assert embeddings.id is not None assert len(embeddings.data) == 3 assert len(embeddings.data[0].embedding) == 4096 @@ -95,11 +107,14 @@ async def test_batch_embedding(client: openai.AsyncOpenAI, model_name: str): # test List[List[int]] input_tokens = [[4, 5, 7, 9, 20], [15, 29, 499], [24, 24, 24, 24, 24], [25, 32, 64, 77]] - embeddings = await client.embeddings.create( + embedding_response = await client.embeddings.create( model=model_name, input=input_tokens, encoding_format="float", ) + embeddings = EmbeddingResponse.model_validate( + embedding_response.model_dump(mode="json")) + assert embeddings.id is not None assert len(embeddings.data) == 4 assert len(embeddings.data[0].embedding) == 4096 @@ -124,14 +139,16 @@ async def test_conversation_embedding(server: RemoteOpenAIServer, "content": "Stars twinkle brightly in the night sky.", }] - chat_response = requests.post(server.url_for("v1/embeddings"), - json={ - "model": model_name, - "messages": messages, - "encoding_format": "float", - }) + chat_response = requests.post( + server.url_for("v1/embeddings"), + json={ + "model": model_name, + "messages": messages, + "encoding_format": "float", + }, + ) chat_response.raise_for_status() - chat_embeddings = chat_response.json() + chat_embeddings = EmbeddingResponse.model_validate(chat_response.json()) tokenizer = get_tokenizer(tokenizer_name=model_name, tokenizer_mode="fast") prompt = tokenizer.apply_chat_template( @@ -148,13 +165,15 @@ async def test_conversation_embedding(server: RemoteOpenAIServer, # To be consistent with chat extra_body={"add_special_tokens": False}, ) - completion_embeddings = completion_response.model_dump(mode="json") + completion_embeddings = EmbeddingResponse.model_validate( + completion_response.model_dump(mode="json")) - assert chat_embeddings.pop("id") is not None - assert completion_embeddings.pop("id") is not None - assert chat_embeddings.pop("created") <= completion_embeddings.pop( - "created") - assert chat_embeddings == completion_embeddings + assert chat_embeddings.id is not None + assert completion_embeddings.id is not None + assert chat_embeddings.created <= completion_embeddings.created + assert chat_embeddings.model_dump( + exclude={"id", "created"}) == (completion_embeddings.model_dump( + exclude={"id", "created"})) @pytest.mark.asyncio @@ -204,10 +223,13 @@ async def test_single_embedding_truncation(client: openai.AsyncOpenAI, ] # test single embedding - embeddings = await client.embeddings.create( + embedding_response = await client.embeddings.create( model=model_name, input=input_texts, extra_body={"truncate_prompt_tokens": 10}) + embeddings = EmbeddingResponse.model_validate( + embedding_response.model_dump(mode="json")) + assert embeddings.id is not None assert len(embeddings.data) == 1 assert len(embeddings.data[0].embedding) == 4096 @@ -219,10 +241,12 @@ async def test_single_embedding_truncation(client: openai.AsyncOpenAI, 1, 24428, 289, 18341, 26165, 285, 19323, 283, 289, 26789, 3871, 28728, 9901, 340, 2229, 385, 340, 315, 28741, 28804, 2 ] - embeddings = await client.embeddings.create( + embedding_response = await client.embeddings.create( model=model_name, input=input_tokens, extra_body={"truncate_prompt_tokens": 10}) + embeddings = EmbeddingResponse.model_validate( + embedding_response.model_dump(mode="json")) assert embeddings.id is not None assert len(embeddings.data) == 1 @@ -241,10 +265,10 @@ async def test_single_embedding_truncation_invalid(client: openai.AsyncOpenAI, ] with pytest.raises(openai.BadRequestError): - embeddings = await client.embeddings.create( + response = await client.embeddings.create( model=model_name, input=input_texts, extra_body={"truncate_prompt_tokens": 8193}) - assert "error" in embeddings.object + assert "error" in response.object assert "truncate_prompt_tokens value is greater than max_model_len. "\ - "Please, select a smaller truncation size." in embeddings.message + "Please, select a smaller truncation size." in response.message diff --git a/tests/entrypoints/openai/test_pooling.py b/tests/entrypoints/openai/test_pooling.py new file mode 100644 index 0000000000000..9c49239398cd2 --- /dev/null +++ b/tests/entrypoints/openai/test_pooling.py @@ -0,0 +1,238 @@ +import base64 + +import numpy as np +import pytest +import requests + +from vllm.entrypoints.openai.protocol import PoolingResponse +from vllm.transformers_utils.tokenizer import get_tokenizer + +from ...utils import RemoteOpenAIServer + +MODEL_NAME = "jason9693/Qwen2.5-1.5B-apeach" +DUMMY_CHAT_TEMPLATE = """{% for message in messages %}{{message['role'] + ': ' + message['content'] + '\\n'}}{% endfor %}""" # noqa: E501 + + +@pytest.fixture(scope="module") +def server(): + args = [ + "--task", + "classify", + # use half precision for speed and memory savings in CI environment + "--dtype", + "bfloat16", + "--enforce-eager", + "--max-model-len", + "8192", + "--chat-template", + DUMMY_CHAT_TEMPLATE, + ] + + with RemoteOpenAIServer(MODEL_NAME, args) as remote_server: + yield remote_server + + +@pytest.mark.asyncio +@pytest.mark.parametrize("model_name", [MODEL_NAME]) +async def test_single_pooling(server: RemoteOpenAIServer, model_name: str): + input_texts = [ + "The chef prepared a delicious meal.", + ] + + # test single pooling + response = requests.post( + server.url_for("pooling"), + json={ + "model": model_name, + "input": input_texts, + "encoding_format": "float" + }, + ) + response.raise_for_status() + poolings = PoolingResponse.model_validate(response.json()) + + assert poolings.id is not None + assert len(poolings.data) == 1 + assert len(poolings.data[0].data) == 2 + assert poolings.usage.completion_tokens == 0 + assert poolings.usage.prompt_tokens == 7 + assert poolings.usage.total_tokens == 7 + + # test using token IDs + input_tokens = [1, 1, 1, 1, 1] + response = requests.post( + server.url_for("pooling"), + json={ + "model": model_name, + "input": input_tokens, + "encoding_format": "float" + }, + ) + response.raise_for_status() + poolings = PoolingResponse.model_validate(response.json()) + + assert poolings.id is not None + assert len(poolings.data) == 1 + assert len(poolings.data[0].data) == 2 + assert poolings.usage.completion_tokens == 0 + assert poolings.usage.prompt_tokens == 5 + assert poolings.usage.total_tokens == 5 + + +@pytest.mark.asyncio +@pytest.mark.parametrize("model_name", [MODEL_NAME]) +async def test_batch_pooling(server: RemoteOpenAIServer, model_name: str): + # test List[str] + input_texts = [ + "The cat sat on the mat.", "A feline was resting on a rug.", + "Stars twinkle brightly in the night sky." + ] + response = requests.post( + server.url_for("pooling"), + json={ + "model": model_name, + "input": input_texts, + "encoding_format": "float" + }, + ) + response.raise_for_status() + poolings = PoolingResponse.model_validate(response.json()) + + assert poolings.id is not None + assert len(poolings.data) == 3 + assert len(poolings.data[0].data) == 2 + assert poolings.usage.completion_tokens == 0 + assert poolings.usage.prompt_tokens == 25 + assert poolings.usage.total_tokens == 25 + + # test List[List[int]] + input_tokens = [[4, 5, 7, 9, 20], [15, 29, 499], [24, 24, 24, 24, 24], + [25, 32, 64, 77]] + response = requests.post( + server.url_for("pooling"), + json={ + "model": model_name, + "input": input_tokens, + "encoding_format": "float" + }, + ) + response.raise_for_status() + poolings = PoolingResponse.model_validate(response.json()) + + assert poolings.id is not None + assert len(poolings.data) == 4 + assert len(poolings.data[0].data) == 2 + assert poolings.usage.completion_tokens == 0 + assert poolings.usage.prompt_tokens == 17 + assert poolings.usage.total_tokens == 17 + + +@pytest.mark.asyncio +@pytest.mark.parametrize("model_name", [MODEL_NAME]) +async def test_conversation_pooling(server: RemoteOpenAIServer, + model_name: str): + messages = [{ + "role": "user", + "content": "The cat sat on the mat.", + }, { + "role": "assistant", + "content": "A feline was resting on a rug.", + }, { + "role": "user", + "content": "Stars twinkle brightly in the night sky.", + }] + + chat_response = requests.post( + server.url_for("pooling"), + json={ + "model": model_name, + "messages": messages, + "encoding_format": "float", + }, + ) + chat_response.raise_for_status() + chat_poolings = PoolingResponse.model_validate(chat_response.json()) + + tokenizer = get_tokenizer(tokenizer_name=model_name, tokenizer_mode="fast") + prompt = tokenizer.apply_chat_template( + messages, + chat_template=DUMMY_CHAT_TEMPLATE, + add_generation_prompt=True, + continue_final_message=False, + tokenize=False, + ) + completions_response = requests.post( + server.url_for("pooling"), + json={ + "model": model_name, + "input": prompt, + "encoding_format": "float", + # To be consistent with chat + "add_special_tokens": False, + }, + ) + completions_response.raise_for_status() + completion_poolings = PoolingResponse.model_validate( + completions_response.json()) + + assert chat_poolings.id is not None + assert completion_poolings.id is not None + assert chat_poolings.created <= completion_poolings.created + assert chat_poolings.model_dump( + exclude={"id", "created"}) == (completion_poolings.model_dump( + exclude={"id", "created"})) + + +@pytest.mark.asyncio +@pytest.mark.parametrize("model_name", [MODEL_NAME]) +async def test_batch_base64_pooling(server: RemoteOpenAIServer, + model_name: str): + input_texts = [ + "Hello my name is", + "The best thing about vLLM is that it supports many different models" + ] + + float_response = requests.post( + server.url_for("pooling"), + json={ + "input": input_texts, + "model": model_name, + "encoding_format": "float", + }, + ) + float_response.raise_for_status() + responses_float = PoolingResponse.model_validate(float_response.json()) + + base64_response = requests.post( + server.url_for("pooling"), + json={ + "input": input_texts, + "model": model_name, + "encoding_format": "base64", + }, + ) + base64_response.raise_for_status() + responses_base64 = PoolingResponse.model_validate(base64_response.json()) + + decoded_responses_base64_data = [] + for data in responses_base64.data: + decoded_responses_base64_data.append( + np.frombuffer(base64.b64decode(data.data), + dtype="float32").tolist()) + + assert responses_float.data[0].data == decoded_responses_base64_data[0] + assert responses_float.data[1].data == decoded_responses_base64_data[1] + + # Default response is float32 decoded from base64 by OpenAI Client + default_response = requests.post( + server.url_for("pooling"), + json={ + "input": input_texts, + "model": model_name, + }, + ) + default_response.raise_for_status() + responses_default = PoolingResponse.model_validate(default_response.json()) + + assert responses_float.data[0].data == responses_default.data[0].data + assert responses_float.data[1].data == responses_default.data[1].data diff --git a/tests/entrypoints/openai/test_vision_embedding.py b/tests/entrypoints/openai/test_vision_embedding.py index 43c63daacb17f..3731b2dcdeae1 100644 --- a/tests/entrypoints/openai/test_vision_embedding.py +++ b/tests/entrypoints/openai/test_vision_embedding.py @@ -1,9 +1,9 @@ from typing import Dict import pytest -import pytest_asyncio import requests +from vllm.entrypoints.openai.protocol import EmbeddingResponse from vllm.multimodal.utils import encode_image_base64, fetch_image from ...utils import VLLM_PATH, RemoteOpenAIServer @@ -46,12 +46,6 @@ def server(): yield remote_server -@pytest_asyncio.fixture -async def client(server): - async with server.get_async_client() as async_client: - yield async_client - - @pytest.fixture(scope="session") def base64_encoded_image() -> Dict[str, str]: return { @@ -82,18 +76,20 @@ async def test_image_embedding(server: RemoteOpenAIServer, model_name: str, ], }] - response = requests.post(server.url_for("v1/embeddings"), - json={ - "model": model_name, - "messages": messages, - "encoding_format": "float" - }) + response = requests.post( + server.url_for("v1/embeddings"), + json={ + "model": model_name, + "messages": messages, + "encoding_format": "float" + }, + ) response.raise_for_status() - - embeddings = response.json() - assert embeddings["id"] is not None - assert len(embeddings["data"]) == 1 - assert len(embeddings["data"][0]["embedding"]) == 3072 - assert embeddings["usage"]["completion_tokens"] == 0 - assert embeddings["usage"]["prompt_tokens"] == 765 - assert embeddings["usage"]["total_tokens"] == 765 + embeddings = EmbeddingResponse.model_validate(response.json()) + + assert embeddings.id is not None + assert len(embeddings.data) == 1 + assert len(embeddings.data[0].embedding) == 3072 + assert embeddings.usage.completion_tokens == 0 + assert embeddings.usage.prompt_tokens == 765 + assert embeddings.usage.total_tokens == 765 diff --git a/vllm/entrypoints/openai/api_server.py b/vllm/entrypoints/openai/api_server.py index 2e5b769a825ce..3e50613a73dd3 100644 --- a/vllm/entrypoints/openai/api_server.py +++ b/vllm/entrypoints/openai/api_server.py @@ -45,8 +45,11 @@ DetokenizeRequest, DetokenizeResponse, EmbeddingRequest, - EmbeddingResponse, ErrorResponse, + EmbeddingResponse, + EmbeddingResponseData, + ErrorResponse, LoadLoraAdapterRequest, + PoolingRequest, PoolingResponse, ScoreRequest, ScoreResponse, TokenizeRequest, TokenizeResponse, @@ -56,6 +59,7 @@ from vllm.entrypoints.openai.serving_completion import OpenAIServingCompletion from vllm.entrypoints.openai.serving_embedding import OpenAIServingEmbedding from vllm.entrypoints.openai.serving_engine import BaseModelPath, OpenAIServing +from vllm.entrypoints.openai.serving_pooling import OpenAIServingPooling from vllm.entrypoints.openai.serving_score import OpenAIServingScores from vllm.entrypoints.openai.serving_tokenization import ( OpenAIServingTokenization) @@ -284,6 +288,10 @@ def completion(request: Request) -> Optional[OpenAIServingCompletion]: return request.app.state.openai_serving_completion +def pooling(request: Request) -> Optional[OpenAIServingPooling]: + return request.app.state.openai_serving_pooling + + def embedding(request: Request) -> Optional[OpenAIServingEmbedding]: return request.app.state.openai_serving_embedding @@ -395,10 +403,36 @@ async def create_completion(request: CompletionRequest, raw_request: Request): async def create_embedding(request: EmbeddingRequest, raw_request: Request): handler = embedding(raw_request) if handler is None: - return base(raw_request).create_error_response( - message="The model does not support Embeddings API") + fallback_handler = pooling(raw_request) + if fallback_handler is None: + return base(raw_request).create_error_response( + message="The model does not support Embeddings API") + + logger.warning( + "Embeddings API will become exclusive to embedding models " + "in a future release. To return the hidden states directly, " + "use the Pooling API (`/pooling`) instead.") + + res = await fallback_handler.create_pooling(request, raw_request) + if isinstance(res, PoolingResponse): + generator = EmbeddingResponse( + id=res.id, + object=res.object, + created=res.created, + model=res.model, + data=[ + EmbeddingResponseData( + index=d.index, + embedding=d.data, # type: ignore + ) for d in res.data + ], + usage=res.usage, + ) + else: + generator = res + else: + generator = await handler.create_embedding(request, raw_request) - generator = await handler.create_embedding(request, raw_request) if isinstance(generator, ErrorResponse): return JSONResponse(content=generator.model_dump(), status_code=generator.code) @@ -408,6 +442,24 @@ async def create_embedding(request: EmbeddingRequest, raw_request: Request): assert_never(generator) +@router.post("/pooling") +@with_cancellation +async def create_pooling(request: PoolingRequest, raw_request: Request): + handler = pooling(raw_request) + if handler is None: + return base(raw_request).create_error_response( + message="The model does not support Pooling API") + + generator = await handler.create_pooling(request, raw_request) + if isinstance(generator, ErrorResponse): + return JSONResponse(content=generator.model_dump(), + status_code=generator.code) + elif isinstance(generator, PoolingResponse): + return JSONResponse(content=generator.model_dump()) + + assert_never(generator) + + @router.post("/score") @with_cancellation async def create_score(request: ScoreRequest, raw_request: Request): @@ -605,7 +657,7 @@ def init_app_state( request_logger=request_logger, return_tokens_as_token_ids=args.return_tokens_as_token_ids, ) if model_config.runner_type == "generate" else None - state.openai_serving_embedding = OpenAIServingEmbedding( + state.openai_serving_pooling = OpenAIServingPooling( engine_client, model_config, base_model_paths, @@ -613,13 +665,20 @@ def init_app_state( chat_template=resolved_chat_template, chat_template_content_format=args.chat_template_content_format, ) if model_config.runner_type == "pooling" else None + state.openai_serving_embedding = OpenAIServingEmbedding( + engine_client, + model_config, + base_model_paths, + request_logger=request_logger, + chat_template=resolved_chat_template, + chat_template_content_format=args.chat_template_content_format, + ) if model_config.task == "embed" else None state.openai_serving_scores = OpenAIServingScores( engine_client, model_config, base_model_paths, request_logger=request_logger - ) if (model_config.runner_type == "pooling" \ - and model_config.is_cross_encoder) else None + ) if model_config.task == "score" else None state.openai_serving_tokenization = OpenAIServingTokenization( engine_client, model_config, diff --git a/vllm/entrypoints/openai/protocol.py b/vllm/entrypoints/openai/protocol.py index 1d8b0d19f9516..14e41346df775 100644 --- a/vllm/entrypoints/openai/protocol.py +++ b/vllm/entrypoints/openai/protocol.py @@ -963,6 +963,10 @@ def to_pooling_params(self): EmbeddingRequest = Union[EmbeddingCompletionRequest, EmbeddingChatRequest] +PoolingCompletionRequest = EmbeddingCompletionRequest +PoolingChatRequest = EmbeddingChatRequest +PoolingRequest = Union[PoolingCompletionRequest, PoolingChatRequest] + class ScoreRequest(OpenAIBaseModel): model: str @@ -1058,6 +1062,21 @@ class EmbeddingResponse(OpenAIBaseModel): usage: UsageInfo +class PoolingResponseData(OpenAIBaseModel): + index: int + object: str = "pooling" + data: Union[List[List[float]], List[float], str] + + +class PoolingResponse(OpenAIBaseModel): + id: str = Field(default_factory=lambda: f"pool-{random_uuid()}") + object: str = "list" + created: int = Field(default_factory=lambda: int(time.time())) + model: str + data: List[PoolingResponseData] + usage: UsageInfo + + class ScoreResponseData(OpenAIBaseModel): index: int object: str = "score" diff --git a/vllm/entrypoints/openai/run_batch.py b/vllm/entrypoints/openai/run_batch.py index 675daf54c0d0d..572ed27b39083 100644 --- a/vllm/entrypoints/openai/run_batch.py +++ b/vllm/entrypoints/openai/run_batch.py @@ -232,7 +232,7 @@ async def main(args): request_logger=request_logger, chat_template=None, chat_template_content_format="auto", - ) if model_config.runner_type == "pooling" else None + ) if model_config.task == "embed" else None tracker = BatchProgressTracker() logger.info("Reading batch from %s...", args.input_file) diff --git a/vllm/entrypoints/openai/serving_embedding.py b/vllm/entrypoints/openai/serving_embedding.py index 879276646d2ba..b8fb9d6bd77f2 100644 --- a/vllm/entrypoints/openai/serving_embedding.py +++ b/vllm/entrypoints/openai/serving_embedding.py @@ -40,36 +40,6 @@ def _get_embedding( assert_never(encoding_format) -def request_output_to_embedding_response( - final_res_batch: List[PoolingRequestOutput], request_id: str, - created_time: int, model_name: str, - encoding_format: Literal["float", "base64"]) -> EmbeddingResponse: - data: List[EmbeddingResponseData] = [] - num_prompt_tokens = 0 - for idx, final_res in enumerate(final_res_batch): - embedding_res = EmbeddingRequestOutput.from_base(final_res) - prompt_token_ids = final_res.prompt_token_ids - - embedding = _get_embedding(embedding_res.outputs, encoding_format) - embedding_data = EmbeddingResponseData(index=idx, embedding=embedding) - data.append(embedding_data) - - num_prompt_tokens += len(prompt_token_ids) - - usage = UsageInfo( - prompt_tokens=num_prompt_tokens, - total_tokens=num_prompt_tokens, - ) - - return EmbeddingResponse( - id=request_id, - created=created_time, - model=model_name, - data=data, - usage=usage, - ) - - class OpenAIServingEmbedding(OpenAIServing): def __init__( @@ -114,7 +84,7 @@ async def create_embedding( model_name = request.model request_id = f"embd-{self._base_request_id(raw_request)}" - created_time = int(time.monotonic()) + created_time = int(time.time()) truncate_prompt_tokens = None @@ -218,9 +188,13 @@ async def create_embedding( final_res_batch_checked = cast(List[PoolingRequestOutput], final_res_batch) - response = request_output_to_embedding_response( - final_res_batch_checked, request_id, created_time, model_name, - encoding_format) + response = self.request_output_to_embedding_response( + final_res_batch_checked, + request_id, + created_time, + model_name, + encoding_format, + ) except asyncio.CancelledError: return self.create_error_response("Client disconnected") except ValueError as e: @@ -228,3 +202,40 @@ async def create_embedding( return self.create_error_response(str(e)) return response + + def request_output_to_embedding_response( + self, + final_res_batch: List[PoolingRequestOutput], + request_id: str, + created_time: int, + model_name: str, + encoding_format: Literal["float", "base64"], + ) -> EmbeddingResponse: + items: List[EmbeddingResponseData] = [] + num_prompt_tokens = 0 + + for idx, final_res in enumerate(final_res_batch): + embedding_res = EmbeddingRequestOutput.from_base(final_res) + + item = EmbeddingResponseData( + index=idx, + embedding=_get_embedding(embedding_res.outputs, + encoding_format), + ) + prompt_token_ids = final_res.prompt_token_ids + + items.append(item) + num_prompt_tokens += len(prompt_token_ids) + + usage = UsageInfo( + prompt_tokens=num_prompt_tokens, + total_tokens=num_prompt_tokens, + ) + + return EmbeddingResponse( + id=request_id, + created=created_time, + model=model_name, + data=items, + usage=usage, + ) diff --git a/vllm/entrypoints/openai/serving_pooling.py b/vllm/entrypoints/openai/serving_pooling.py new file mode 100644 index 0000000000000..01852f0df1eca --- /dev/null +++ b/vllm/entrypoints/openai/serving_pooling.py @@ -0,0 +1,234 @@ +import asyncio +import base64 +import time +from typing import AsyncGenerator, Final, List, Literal, Optional, Union, cast + +import numpy as np +from fastapi import Request +from typing_extensions import assert_never + +from vllm.config import ModelConfig +from vllm.engine.protocol import EngineClient +from vllm.entrypoints.chat_utils import ChatTemplateContentFormatOption +from vllm.entrypoints.logger import RequestLogger +from vllm.entrypoints.openai.protocol import (ErrorResponse, + PoolingChatRequest, + PoolingRequest, PoolingResponse, + PoolingResponseData, UsageInfo) +from vllm.entrypoints.openai.serving_engine import BaseModelPath, OpenAIServing +from vllm.logger import init_logger +from vllm.outputs import PoolingOutput, PoolingRequestOutput +from vllm.utils import merge_async_iterators + +logger = init_logger(__name__) + + +def _get_data( + output: PoolingOutput, + encoding_format: Literal["float", "base64"], +) -> Union[List[float], str]: + if encoding_format == "float": + return output.data.tolist() + elif encoding_format == "base64": + # Force to use float32 for base64 encoding + # to match the OpenAI python client behavior + pooling_bytes = np.array(output.data, dtype="float32").tobytes() + return base64.b64encode(pooling_bytes).decode("utf-8") + + assert_never(encoding_format) + + +class OpenAIServingPooling(OpenAIServing): + + def __init__( + self, + engine_client: EngineClient, + model_config: ModelConfig, + base_model_paths: List[BaseModelPath], + *, + request_logger: Optional[RequestLogger], + chat_template: Optional[str], + chat_template_content_format: ChatTemplateContentFormatOption, + ) -> None: + super().__init__(engine_client=engine_client, + model_config=model_config, + base_model_paths=base_model_paths, + lora_modules=None, + prompt_adapters=None, + request_logger=request_logger) + + self.chat_template = chat_template + self.chat_template_content_format: Final = chat_template_content_format + + async def create_pooling( + self, + request: PoolingRequest, + raw_request: Optional[Request] = None, + ) -> Union[PoolingResponse, ErrorResponse]: + """ + See https://platform.openai.com/docs/api-reference/embeddings/create + for the API specification. This API mimics the OpenAI Embedding API. + """ + error_check_ret = await self._check_model(request) + if error_check_ret is not None: + return error_check_ret + + encoding_format = request.encoding_format + if request.dimensions is not None: + return self.create_error_response( + "dimensions is currently not supported") + + model_name = request.model + request_id = f"pool-{self._base_request_id(raw_request)}" + created_time = int(time.time()) + + truncate_prompt_tokens = None + + if request.truncate_prompt_tokens is not None: + if request.truncate_prompt_tokens <= self.max_model_len: + truncate_prompt_tokens = request.truncate_prompt_tokens + else: + return self.create_error_response( + "truncate_prompt_tokens value is " + "greater than max_model_len." + " Please, select a smaller truncation size.") + + try: + ( + lora_request, + prompt_adapter_request, + ) = self._maybe_get_adapters(request) + + tokenizer = await self.engine_client.get_tokenizer(lora_request) + + if prompt_adapter_request is not None: + raise NotImplementedError("Prompt adapter is not supported " + "for pooling models") + + if isinstance(request, PoolingChatRequest): + ( + _, + request_prompts, + engine_prompts, + ) = await self._preprocess_chat( + request, + tokenizer, + request.messages, + chat_template=request.chat_template or self.chat_template, + chat_template_content_format=self. + chat_template_content_format, + # In pooling requests, we are not generating tokens, + # so there is no need to append extra tokens to the input + add_generation_prompt=False, + continue_final_message=False, + truncate_prompt_tokens=truncate_prompt_tokens, + add_special_tokens=request.add_special_tokens, + ) + else: + (request_prompts, + engine_prompts) = await self._preprocess_completion( + request, + tokenizer, + request.input, + truncate_prompt_tokens=truncate_prompt_tokens, + add_special_tokens=request.add_special_tokens, + ) + except ValueError as e: + logger.exception("Error in preprocessing prompt inputs") + return self.create_error_response(str(e)) + + # Schedule the request and get the result generator. + generators: List[AsyncGenerator[PoolingRequestOutput, None]] = [] + try: + pooling_params = request.to_pooling_params() + + for i, engine_prompt in enumerate(engine_prompts): + request_id_item = f"{request_id}-{i}" + + self._log_inputs(request_id_item, + request_prompts[i], + params=pooling_params, + lora_request=lora_request, + prompt_adapter_request=prompt_adapter_request) + + trace_headers = (None if raw_request is None else await + self._get_trace_headers(raw_request.headers)) + + generator = self.engine_client.encode( + engine_prompt, + pooling_params, + request_id_item, + lora_request=lora_request, + trace_headers=trace_headers, + priority=request.priority, + ) + + generators.append(generator) + except ValueError as e: + # TODO: Use a vllm-specific Validation Error + return self.create_error_response(str(e)) + + result_generator = merge_async_iterators(*generators) + + num_prompts = len(engine_prompts) + + # Non-streaming response + final_res_batch: List[Optional[PoolingRequestOutput]] + final_res_batch = [None] * num_prompts + try: + async for i, res in result_generator: + final_res_batch[i] = res + + assert all(final_res is not None for final_res in final_res_batch) + + final_res_batch_checked = cast(List[PoolingRequestOutput], + final_res_batch) + + response = self.request_output_to_pooling_response( + final_res_batch_checked, + request_id, + created_time, + model_name, + encoding_format, + ) + except asyncio.CancelledError: + return self.create_error_response("Client disconnected") + except ValueError as e: + # TODO: Use a vllm-specific Validation Error + return self.create_error_response(str(e)) + + return response + + def request_output_to_pooling_response( + self, + final_res_batch: List[PoolingRequestOutput], + request_id: str, + created_time: int, + model_name: str, + encoding_format: Literal["float", "base64"], + ) -> PoolingResponse: + items: List[PoolingResponseData] = [] + num_prompt_tokens = 0 + + for idx, final_res in enumerate(final_res_batch): + item = PoolingResponseData( + index=idx, + data=_get_data(final_res.outputs, encoding_format), + ) + prompt_token_ids = final_res.prompt_token_ids + + items.append(item) + num_prompt_tokens += len(prompt_token_ids) + + usage = UsageInfo( + prompt_tokens=num_prompt_tokens, + total_tokens=num_prompt_tokens, + ) + + return PoolingResponse( + id=request_id, + created=created_time, + model=model_name, + data=items, + usage=usage, + ) diff --git a/vllm/entrypoints/openai/serving_score.py b/vllm/entrypoints/openai/serving_score.py index 101d170bee4d6..a8a126e697641 100644 --- a/vllm/entrypoints/openai/serving_score.py +++ b/vllm/entrypoints/openai/serving_score.py @@ -20,32 +20,6 @@ logger = init_logger(__name__) -def request_output_to_score_response( - final_res_batch: List[PoolingRequestOutput], request_id: str, - created_time: int, model_name: str) -> ScoreResponse: - data: List[ScoreResponseData] = [] - num_prompt_tokens = 0 - for idx, final_res in enumerate(final_res_batch): - classify_res = ScoringRequestOutput.from_base(final_res) - - score_data = ScoreResponseData(index=idx, - score=classify_res.outputs.score) - data.append(score_data) - - usage = UsageInfo( - prompt_tokens=num_prompt_tokens, - total_tokens=num_prompt_tokens, - ) - - return ScoreResponse( - id=request_id, - created=created_time, - model=model_name, - data=data, - usage=usage, - ) - - def make_pairs(text_1: Union[List[str], str], text_2: Union[List[str], str]) -> List: if isinstance(text_1, (str, dict)): @@ -103,7 +77,7 @@ async def create_score( model_name = request.model request_id = f"score-{self._base_request_id(raw_request)}" - created_time = int(time.monotonic()) + created_time = int(time.time()) truncate_prompt_tokens = request.truncate_prompt_tokens request_prompts = [] @@ -203,8 +177,12 @@ async def create_score( final_res_batch_checked = cast(List[PoolingRequestOutput], final_res_batch) - response = request_output_to_score_response( - final_res_batch_checked, request_id, created_time, model_name) + response = self.request_output_to_score_response( + final_res_batch_checked, + request_id, + created_time, + model_name, + ) except asyncio.CancelledError: return self.create_error_response("Client disconnected") except ValueError as e: @@ -212,3 +190,38 @@ async def create_score( return self.create_error_response(str(e)) return response + + def request_output_to_score_response( + self, + final_res_batch: List[PoolingRequestOutput], + request_id: str, + created_time: int, + model_name: str, + ) -> ScoreResponse: + items: List[ScoreResponseData] = [] + num_prompt_tokens = 0 + + for idx, final_res in enumerate(final_res_batch): + classify_res = ScoringRequestOutput.from_base(final_res) + + item = ScoreResponseData( + index=idx, + score=classify_res.outputs.score, + ) + prompt_token_ids = final_res.prompt_token_ids + + items.append(item) + num_prompt_tokens += len(prompt_token_ids) + + usage = UsageInfo( + prompt_tokens=num_prompt_tokens, + total_tokens=num_prompt_tokens, + ) + + return ScoreResponse( + id=request_id, + created=created_time, + model=model_name, + data=items, + usage=usage, + ) diff --git a/vllm/outputs.py b/vllm/outputs.py index 2ecdf74ee59b3..b519c159b1531 100644 --- a/vllm/outputs.py +++ b/vllm/outputs.py @@ -355,7 +355,8 @@ def from_seq_group(seq_group: SequenceGroup) -> "PoolingRequestOutput": pooled_data = seq_group.pooled_data assert pooled_data is not None - output = PoolingOutput(pooled_data) + data = pooled_data.to(dtype=torch.float32, device="cpu") + output = PoolingOutput(data) prompt_token_ids = seq_group.prompt_token_ids finished = seq_group.is_finished()