[Frontend] Online Pooling API (#11457)

Signed-off-by: DarkLight1337 <[email protected]>
vllm-project · Dec 24, 2024 · 9edca6b · 9edca6b
1 parent 4f074fb
commit 9edca6b
Show file tree

Hide file tree

Showing 15 changed files with 809 additions and 157 deletions.
diff --git a/docs/source/models/generative_models.md b/docs/source/models/generative_models.md
@@ -120,19 +120,7 @@ outputs = llm.chat(conversation, chat_template=custom_template)
 
 ## Online Inference
 
-Our [OpenAI Compatible Server](../serving/openai_compatible_server) can be used for online inference.
-Please click on the above link for more details on how to launch the server.
+Our [OpenAI Compatible Server](../serving/openai_compatible_server) provides endpoints that correspond to the offline APIs:
 
-### Completions API
-
-Our Completions API is similar to `LLM.generate` but only accepts text.
-It is compatible with [OpenAI Completions API](https://platform.openai.com/docs/api-reference/completions)
-so that you can use OpenAI client to interact with it.
-A code example can be found in [examples/openai_completion_client.py](https://github.com/vllm-project/vllm/blob/main/examples/openai_completion_client.py).
-
-### Chat API
-
-Our Chat API is similar to `LLM.chat`, accepting both text and [multi-modal inputs](#multimodal-inputs).
-It is compatible with [OpenAI Chat Completions API](https://platform.openai.com/docs/api-reference/chat)
-so that you can use OpenAI client to interact with it.
-A code example can be found in [examples/openai_chat_completion_client.py](https://github.com/vllm-project/vllm/blob/main/examples/openai_chat_completion_client.py).
+- [Completions API](#completions-api) is similar to `LLM.generate` but only accepts text.
+- [Chat API](#chat-api)  is similar to `LLM.chat`, accepting both text and [multi-modal inputs](#multimodal-inputs) for models with a chat template.
diff --git a/docs/source/models/pooling_models.md b/docs/source/models/pooling_models.md
@@ -106,22 +106,8 @@ A code example can be found in [examples/offline_inference_scoring.py](https://g
 
 ## Online Inference
 
-Our [OpenAI Compatible Server](../serving/openai_compatible_server.md) can be used for online inference.
-Please click on the above link for more details on how to launch the server.
+Our [OpenAI Compatible Server](../serving/openai_compatible_server.md) provides endpoints that correspond to the offline APIs:
 
-### Embeddings API
-
-Our Embeddings API is similar to `LLM.embed`, accepting both text and [multi-modal inputs](#multimodal-inputs).
-
-The text-only API is compatible with [OpenAI Embeddings API](https://platform.openai.com/docs/api-reference/embeddings)
-so that you can use OpenAI client to interact with it.
-A code example can be found in [examples/openai_embedding_client.py](https://github.com/vllm-project/vllm/blob/main/examples/openai_embedding_client.py).
-
-The multi-modal API is an extension of the [OpenAI Embeddings API](https://platform.openai.com/docs/api-reference/embeddings)
-that incorporates [OpenAI Chat Completions API](https://platform.openai.com/docs/api-reference/chat),
-so it is not part of the OpenAI standard. Please see [](#multimodal-inputs) for more details on how to use it.
-
-### Score API
-
-Our Score API is similar to `LLM.score`.
-Please see [this page](#score-api) for more details on how to use it.
+- [Pooling API](#pooling-api) is similar to `LLM.encode`, being applicable to all types of pooling models.
+- [Embeddings API](#embeddings-api) is similar to `LLM.embed`, accepting both text and [multi-modal inputs](#multimodal-inputs) for embedding models.
+- [Score API](#score-api) is similar to `LLM.score` for cross-encoder models.
diff --git a/docs/source/serving/openai_compatible_server.md b/docs/source/serving/openai_compatible_server.md
@@ -42,6 +42,8 @@ In addition, we have the following custom APIs:
 
 - [Tokenizer API](#tokenizer-api) (`/tokenize`, `/detokenize`)
   - Applicable to any model with a tokenizer.
+- [Pooling API](#pooling-api) (`/pooling`)
+  - Applicable to all [pooling models](../models/pooling_models.md).
 - [Score API](#score-api) (`/score`)
   - Only applicable to [cross-encoder models](../models/pooling_models.md) (`--task score`).
 
@@ -179,7 +181,12 @@ The order of priorities is `command line > config file values > defaults`.
 (completions-api)=
 ### Completions API
 
-Refer to [OpenAI's API reference](https://platform.openai.com/docs/api-reference/completions) for more details.
+Our Completions API is compatible with [OpenAI's Completions API](https://platform.openai.com/docs/api-reference/completions);
+you can use the [official OpenAI Python client](https://github.com/openai/openai-python) to interact with it.
+
+#### Code example
+
+See [examples/openai_completion_client.py](https://github.com/vllm-project/vllm/blob/main/examples/openai_completion_client.py).
 
 #### Extra parameters
 
@@ -200,15 +207,20 @@ The following extra parameters are supported:
 ```
 
 (chat-api)=
-### Chat Completions API
+### Chat API
 
-Refer to [OpenAI's API reference](https://platform.openai.com/docs/api-reference/chat) for more details.
+Our Chat API is compatible with [OpenAI's Chat Completions API](https://platform.openai.com/docs/api-reference/chat);
+you can use the [official OpenAI Python client](https://github.com/openai/openai-python) to interact with it.
 
 We support both [Vision](https://platform.openai.com/docs/guides/vision)- and
 [Audio](https://platform.openai.com/docs/guides/audio?audio-generation-quickstart-example=audio-in)-related parameters;
 see our [Multimodal Inputs](../usage/multimodal_inputs.md) guide for more information.
 - *Note: `image_url.detail` parameter is not supported.*
 
+#### Code example
+
+See [examples/openai_chat_completion_client.py](https://github.com/vllm-project/vllm/blob/main/examples/openai_chat_completion_client.py).
+
 #### Extra parameters
 
 The following [sampling parameters (click through to see documentation)](../dev/sampling_params.md) are supported.
@@ -230,15 +242,20 @@ The following extra parameters are supported:
 (embeddings-api)=
 ### Embeddings API
 
-Refer to [OpenAI's API reference](https://platform.openai.com/docs/api-reference/embeddings) for more details.
+Our Embeddings API is compatible with [OpenAI's Embeddings API](https://platform.openai.com/docs/api-reference/embeddings);
+you can use the [official OpenAI Python client](https://github.com/openai/openai-python) to interact with it.
 
-If the model has a [chat template](#chat-template), you can replace `inputs` with a list of `messages` (same schema as [Chat Completions API](#chat-api))
+If the model has a [chat template](#chat-template), you can replace `inputs` with a list of `messages` (same schema as [Chat API](#chat-api))
 which will be treated as a single prompt to the model.
 
 ```{tip}
-This enables multi-modal inputs to be passed to embedding models, see [this page](../usage/multimodal_inputs.md) for details.
+This enables multi-modal inputs to be passed to embedding models, see [this page](#multimodal-inputs) for details.
 ```
 
+#### Code example
+
+See [examples/openai_embedding_client.py](https://github.com/vllm-project/vllm/blob/main/examples/openai_embedding_client.py).
+
 #### Extra parameters
 
 The following [pooling parameters (click through to see documentation)](../dev/pooling_params.md) are supported.
@@ -268,20 +285,35 @@ For chat-like input (i.e. if `messages` is passed), these extra parameters are s
 (tokenizer-api)=
 ### Tokenizer API
 
-The Tokenizer API is a simple wrapper over [HuggingFace-style tokenizers](https://huggingface.co/docs/transformers/en/main_classes/tokenizer).
+Our Tokenizer API is a simple wrapper over [HuggingFace-style tokenizers](https://huggingface.co/docs/transformers/en/main_classes/tokenizer).
 It consists of two endpoints:
 
 - `/tokenize` corresponds to calling `tokenizer.encode()`.
 - `/detokenize` corresponds to calling `tokenizer.decode()`.
 
+(pooling-api)=
+### Pooling API
+
+Our Pooling API encodes input prompts using a [pooling model](../models/pooling_models.md) and returns the corresponding hidden states.
+
+The input format is the same as [Embeddings API](#embeddings-api), but the output data can contain an arbitrary nested list, not just a 1-D list of floats.
+
+#### Code example
+
+See [examples/openai_pooling_client.py](https://github.com/vllm-project/vllm/blob/main/examples/openai_pooling_client.py).
+
 (score-api)=
 ### Score API
 
-The Score API applies a cross-encoder model to predict scores for sentence pairs.
+Our Score API applies a cross-encoder model to predict scores for sentence pairs.
 Usually, the score for a sentence pair refers to the similarity between two sentences, on a scale of 0 to 1.
 
 You can find the documentation for these kind of models at [sbert.net](https://www.sbert.net/docs/package_reference/cross_encoder/cross_encoder.html).
 
+#### Code example
+
+See [examples/openai_cross_encoder_score.py](https://github.com/vllm-project/vllm/blob/main/examples/openai_cross_encoder_score.py).
+
 #### Single inference
 
 You can pass a string to both `text_1` and `text_2`, forming a single sentence pair.

diff --git a/examples/openai_cross_encoder_score.py b/examples/openai_cross_encoder_score.py
@@ -20,9 +20,9 @@ def post_http_request(prompt: dict, api_url: str) -> requests.Response:
     parser.add_argument("--host", type=str, default="localhost")
     parser.add_argument("--port", type=int, default=8000)
     parser.add_argument("--model", type=str, default="BAAI/bge-reranker-v2-m3")
+
     args = parser.parse_args()
     api_url = f"http://{args.host}:{args.port}/score"
-
     model_name = args.model
 
     text_1 = "What is the capital of Brazil?"

diff --git a/examples/openai_pooling_client.py b/examples/openai_pooling_client.py
@@ -0,0 +1,51 @@
+"""
+Example online usage of Pooling API.
+
+Run `vllm serve <model> --task <embed|classify|reward|score>`
+to start up the server in vLLM.
+"""
+import argparse
+import pprint
+
+import requests
+
+
+def post_http_request(prompt: dict, api_url: str) -> requests.Response:
+    headers = {"User-Agent": "Test Client"}
+    response = requests.post(api_url, headers=headers, json=prompt)
+    return response
+
+
+if __name__ == "__main__":
+    parser = argparse.ArgumentParser()
+    parser.add_argument("--host", type=str, default="localhost")
+    parser.add_argument("--port", type=int, default=8000)
+    parser.add_argument("--model",
+                        type=str,
+                        default="jason9693/Qwen2.5-1.5B-apeach")
+
+    args = parser.parse_args()
+    api_url = f"http://{args.host}:{args.port}/pooling"
+    model_name = args.model
+
+    # Input like Completions API
+    prompt = {"model": model_name, "input": "vLLM is great!"}
+    pooling_response = post_http_request(prompt=prompt, api_url=api_url)
+    print("Pooling Response:")
+    pprint.pprint(pooling_response.json())
+
+    # Input like Chat API
+    prompt = {
+        "model":
+        model_name,
+        "messages": [{
+            "role": "user",
+            "content": [{
+                "type": "text",
+                "text": "vLLM is great!"
+            }],
+        }]
+    }
+    pooling_response = post_http_request(prompt=prompt, api_url=api_url)
+    print("Pooling Response:")
+    pprint.pprint(pooling_response.json())
diff --git a/tests/entrypoints/openai/test_embedding.py b/tests/entrypoints/openai/test_embedding.py
@@ -6,6 +6,7 @@
 import pytest_asyncio
 import requests
 
+from vllm.entrypoints.openai.protocol import EmbeddingResponse
 from vllm.transformers_utils.tokenizer import get_tokenizer
 
 from ...utils import RemoteOpenAIServer
@@ -17,6 +18,8 @@
 @pytest.fixture(scope="module")
 def server():
     args = [
+        "--task",
+        "embed",
         # use half precision for speed and memory savings in CI environment
         "--dtype",
         "bfloat16",
@@ -45,11 +48,14 @@ async def test_single_embedding(client: openai.AsyncOpenAI, model_name: str):
     ]
 
     # test single embedding
-    embeddings = await client.embeddings.create(
+    embedding_response = await client.embeddings.create(
         model=model_name,
         input=input_texts,
         encoding_format="float",
     )
+    embeddings = EmbeddingResponse.model_validate(
+        embedding_response.model_dump(mode="json"))
+
     assert embeddings.id is not None
     assert len(embeddings.data) == 1
     assert len(embeddings.data[0].embedding) == 4096
@@ -59,11 +65,14 @@ async def test_single_embedding(client: openai.AsyncOpenAI, model_name: str):
 
     # test using token IDs
     input_tokens = [1, 1, 1, 1, 1]
-    embeddings = await client.embeddings.create(
+    embedding_response = await client.embeddings.create(
         model=model_name,
         input=input_tokens,
         encoding_format="float",
     )
+    embeddings = EmbeddingResponse.model_validate(
+        embedding_response.model_dump(mode="json"))
+
     assert embeddings.id is not None
     assert len(embeddings.data) == 1
     assert len(embeddings.data[0].embedding) == 4096
@@ -80,11 +89,14 @@ async def test_batch_embedding(client: openai.AsyncOpenAI, model_name: str):
         "The cat sat on the mat.", "A feline was resting on a rug.",
         "Stars twinkle brightly in the night sky."
     ]
-    embeddings = await client.embeddings.create(
+    embedding_response = await client.embeddings.create(
         model=model_name,
         input=input_texts,
         encoding_format="float",
     )
+    embeddings = EmbeddingResponse.model_validate(
+        embedding_response.model_dump(mode="json"))
+
     assert embeddings.id is not None
     assert len(embeddings.data) == 3
     assert len(embeddings.data[0].embedding) == 4096
@@ -95,11 +107,14 @@ async def test_batch_embedding(client: openai.AsyncOpenAI, model_name: str):
     # test List[List[int]]
     input_tokens = [[4, 5, 7, 9, 20], [15, 29, 499], [24, 24, 24, 24, 24],
                     [25, 32, 64, 77]]
-    embeddings = await client.embeddings.create(
+    embedding_response = await client.embeddings.create(
         model=model_name,
         input=input_tokens,
         encoding_format="float",
     )
+    embeddings = EmbeddingResponse.model_validate(
+        embedding_response.model_dump(mode="json"))
+
     assert embeddings.id is not None
     assert len(embeddings.data) == 4
     assert len(embeddings.data[0].embedding) == 4096
@@ -124,14 +139,16 @@ async def test_conversation_embedding(server: RemoteOpenAIServer,
         "content": "Stars twinkle brightly in the night sky.",
     }]
 
-    chat_response = requests.post(server.url_for("v1/embeddings"),
-                                  json={
-                                      "model": model_name,
-                                      "messages": messages,
-                                      "encoding_format": "float",
-                                  })
+    chat_response = requests.post(
+        server.url_for("v1/embeddings"),
+        json={
+            "model": model_name,
+            "messages": messages,
+            "encoding_format": "float",
+        },
+    )
     chat_response.raise_for_status()
-    chat_embeddings = chat_response.json()
+    chat_embeddings = EmbeddingResponse.model_validate(chat_response.json())
 
     tokenizer = get_tokenizer(tokenizer_name=model_name, tokenizer_mode="fast")
     prompt = tokenizer.apply_chat_template(
@@ -148,13 +165,15 @@ async def test_conversation_embedding(server: RemoteOpenAIServer,
         # To be consistent with chat
         extra_body={"add_special_tokens": False},
     )
-    completion_embeddings = completion_response.model_dump(mode="json")
+    completion_embeddings = EmbeddingResponse.model_validate(
+        completion_response.model_dump(mode="json"))
 
-    assert chat_embeddings.pop("id") is not None
-    assert completion_embeddings.pop("id") is not None
-    assert chat_embeddings.pop("created") <= completion_embeddings.pop(
-        "created")
-    assert chat_embeddings == completion_embeddings
+    assert chat_embeddings.id is not None
+    assert completion_embeddings.id is not None
+    assert chat_embeddings.created <= completion_embeddings.created
+    assert chat_embeddings.model_dump(
+        exclude={"id", "created"}) == (completion_embeddings.model_dump(
+            exclude={"id", "created"}))
 
 
 @pytest.mark.asyncio
@@ -204,10 +223,13 @@ async def test_single_embedding_truncation(client: openai.AsyncOpenAI,
     ]
 
     # test single embedding
-    embeddings = await client.embeddings.create(
+    embedding_response = await client.embeddings.create(
         model=model_name,
         input=input_texts,
         extra_body={"truncate_prompt_tokens": 10})
+    embeddings = EmbeddingResponse.model_validate(
+        embedding_response.model_dump(mode="json"))
+
     assert embeddings.id is not None
     assert len(embeddings.data) == 1
     assert len(embeddings.data[0].embedding) == 4096
@@ -219,10 +241,12 @@ async def test_single_embedding_truncation(client: openai.AsyncOpenAI,
         1, 24428, 289, 18341, 26165, 285, 19323, 283, 289, 26789, 3871, 28728,
         9901, 340, 2229, 385, 340, 315, 28741, 28804, 2
     ]
-    embeddings = await client.embeddings.create(
+    embedding_response = await client.embeddings.create(
         model=model_name,
         input=input_tokens,
         extra_body={"truncate_prompt_tokens": 10})
+    embeddings = EmbeddingResponse.model_validate(
+        embedding_response.model_dump(mode="json"))
 
     assert embeddings.id is not None
     assert len(embeddings.data) == 1
@@ -241,10 +265,10 @@ async def test_single_embedding_truncation_invalid(client: openai.AsyncOpenAI,
     ]
 
     with pytest.raises(openai.BadRequestError):
-        embeddings = await client.embeddings.create(
+        response = await client.embeddings.create(
             model=model_name,
             input=input_texts,
             extra_body={"truncate_prompt_tokens": 8193})
-        assert "error" in embeddings.object
+        assert "error" in response.object
         assert "truncate_prompt_tokens value is greater than max_model_len. "\
-               "Please, select a smaller truncation size." in embeddings.message
+               "Please, select a smaller truncation size." in response.message