Skip to content

Commit

Permalink
[Frontend] Online Pooling API (#11457)
Browse files Browse the repository at this point in the history
Signed-off-by: DarkLight1337 <[email protected]>
  • Loading branch information
DarkLight1337 authored Dec 24, 2024
1 parent 4f074fb commit 9edca6b
Show file tree
Hide file tree
Showing 15 changed files with 809 additions and 157 deletions.
18 changes: 3 additions & 15 deletions docs/source/models/generative_models.md
Original file line number Diff line number Diff line change
Expand Up @@ -120,19 +120,7 @@ outputs = llm.chat(conversation, chat_template=custom_template)

## Online Inference

Our [OpenAI Compatible Server](../serving/openai_compatible_server) can be used for online inference.
Please click on the above link for more details on how to launch the server.
Our [OpenAI Compatible Server](../serving/openai_compatible_server) provides endpoints that correspond to the offline APIs:

### Completions API

Our Completions API is similar to `LLM.generate` but only accepts text.
It is compatible with [OpenAI Completions API](https://platform.openai.com/docs/api-reference/completions)
so that you can use OpenAI client to interact with it.
A code example can be found in [examples/openai_completion_client.py](https://github.com/vllm-project/vllm/blob/main/examples/openai_completion_client.py).

### Chat API

Our Chat API is similar to `LLM.chat`, accepting both text and [multi-modal inputs](#multimodal-inputs).
It is compatible with [OpenAI Chat Completions API](https://platform.openai.com/docs/api-reference/chat)
so that you can use OpenAI client to interact with it.
A code example can be found in [examples/openai_chat_completion_client.py](https://github.com/vllm-project/vllm/blob/main/examples/openai_chat_completion_client.py).
- [Completions API](#completions-api) is similar to `LLM.generate` but only accepts text.
- [Chat API](#chat-api) is similar to `LLM.chat`, accepting both text and [multi-modal inputs](#multimodal-inputs) for models with a chat template.
22 changes: 4 additions & 18 deletions docs/source/models/pooling_models.md
Original file line number Diff line number Diff line change
Expand Up @@ -106,22 +106,8 @@ A code example can be found in [examples/offline_inference_scoring.py](https://g

## Online Inference

Our [OpenAI Compatible Server](../serving/openai_compatible_server.md) can be used for online inference.
Please click on the above link for more details on how to launch the server.
Our [OpenAI Compatible Server](../serving/openai_compatible_server.md) provides endpoints that correspond to the offline APIs:

### Embeddings API

Our Embeddings API is similar to `LLM.embed`, accepting both text and [multi-modal inputs](#multimodal-inputs).

The text-only API is compatible with [OpenAI Embeddings API](https://platform.openai.com/docs/api-reference/embeddings)
so that you can use OpenAI client to interact with it.
A code example can be found in [examples/openai_embedding_client.py](https://github.com/vllm-project/vllm/blob/main/examples/openai_embedding_client.py).

The multi-modal API is an extension of the [OpenAI Embeddings API](https://platform.openai.com/docs/api-reference/embeddings)
that incorporates [OpenAI Chat Completions API](https://platform.openai.com/docs/api-reference/chat),
so it is not part of the OpenAI standard. Please see [](#multimodal-inputs) for more details on how to use it.

### Score API

Our Score API is similar to `LLM.score`.
Please see [this page](#score-api) for more details on how to use it.
- [Pooling API](#pooling-api) is similar to `LLM.encode`, being applicable to all types of pooling models.
- [Embeddings API](#embeddings-api) is similar to `LLM.embed`, accepting both text and [multi-modal inputs](#multimodal-inputs) for embedding models.
- [Score API](#score-api) is similar to `LLM.score` for cross-encoder models.
48 changes: 40 additions & 8 deletions docs/source/serving/openai_compatible_server.md
Original file line number Diff line number Diff line change
Expand Up @@ -42,6 +42,8 @@ In addition, we have the following custom APIs:

- [Tokenizer API](#tokenizer-api) (`/tokenize`, `/detokenize`)
- Applicable to any model with a tokenizer.
- [Pooling API](#pooling-api) (`/pooling`)
- Applicable to all [pooling models](../models/pooling_models.md).
- [Score API](#score-api) (`/score`)
- Only applicable to [cross-encoder models](../models/pooling_models.md) (`--task score`).

Expand Down Expand Up @@ -179,7 +181,12 @@ The order of priorities is `command line > config file values > defaults`.
(completions-api)=
### Completions API

Refer to [OpenAI's API reference](https://platform.openai.com/docs/api-reference/completions) for more details.
Our Completions API is compatible with [OpenAI's Completions API](https://platform.openai.com/docs/api-reference/completions);
you can use the [official OpenAI Python client](https://github.com/openai/openai-python) to interact with it.

#### Code example

See [examples/openai_completion_client.py](https://github.com/vllm-project/vllm/blob/main/examples/openai_completion_client.py).

#### Extra parameters

Expand All @@ -200,15 +207,20 @@ The following extra parameters are supported:
```

(chat-api)=
### Chat Completions API
### Chat API

Refer to [OpenAI's API reference](https://platform.openai.com/docs/api-reference/chat) for more details.
Our Chat API is compatible with [OpenAI's Chat Completions API](https://platform.openai.com/docs/api-reference/chat);
you can use the [official OpenAI Python client](https://github.com/openai/openai-python) to interact with it.

We support both [Vision](https://platform.openai.com/docs/guides/vision)- and
[Audio](https://platform.openai.com/docs/guides/audio?audio-generation-quickstart-example=audio-in)-related parameters;
see our [Multimodal Inputs](../usage/multimodal_inputs.md) guide for more information.
- *Note: `image_url.detail` parameter is not supported.*

#### Code example

See [examples/openai_chat_completion_client.py](https://github.com/vllm-project/vllm/blob/main/examples/openai_chat_completion_client.py).

#### Extra parameters

The following [sampling parameters (click through to see documentation)](../dev/sampling_params.md) are supported.
Expand All @@ -230,15 +242,20 @@ The following extra parameters are supported:
(embeddings-api)=
### Embeddings API

Refer to [OpenAI's API reference](https://platform.openai.com/docs/api-reference/embeddings) for more details.
Our Embeddings API is compatible with [OpenAI's Embeddings API](https://platform.openai.com/docs/api-reference/embeddings);
you can use the [official OpenAI Python client](https://github.com/openai/openai-python) to interact with it.

If the model has a [chat template](#chat-template), you can replace `inputs` with a list of `messages` (same schema as [Chat Completions API](#chat-api))
If the model has a [chat template](#chat-template), you can replace `inputs` with a list of `messages` (same schema as [Chat API](#chat-api))
which will be treated as a single prompt to the model.

```{tip}
This enables multi-modal inputs to be passed to embedding models, see [this page](../usage/multimodal_inputs.md) for details.
This enables multi-modal inputs to be passed to embedding models, see [this page](#multimodal-inputs) for details.
```

#### Code example

See [examples/openai_embedding_client.py](https://github.com/vllm-project/vllm/blob/main/examples/openai_embedding_client.py).

#### Extra parameters

The following [pooling parameters (click through to see documentation)](../dev/pooling_params.md) are supported.
Expand Down Expand Up @@ -268,20 +285,35 @@ For chat-like input (i.e. if `messages` is passed), these extra parameters are s
(tokenizer-api)=
### Tokenizer API

The Tokenizer API is a simple wrapper over [HuggingFace-style tokenizers](https://huggingface.co/docs/transformers/en/main_classes/tokenizer).
Our Tokenizer API is a simple wrapper over [HuggingFace-style tokenizers](https://huggingface.co/docs/transformers/en/main_classes/tokenizer).
It consists of two endpoints:

- `/tokenize` corresponds to calling `tokenizer.encode()`.
- `/detokenize` corresponds to calling `tokenizer.decode()`.

(pooling-api)=
### Pooling API

Our Pooling API encodes input prompts using a [pooling model](../models/pooling_models.md) and returns the corresponding hidden states.

The input format is the same as [Embeddings API](#embeddings-api), but the output data can contain an arbitrary nested list, not just a 1-D list of floats.

#### Code example

See [examples/openai_pooling_client.py](https://github.com/vllm-project/vllm/blob/main/examples/openai_pooling_client.py).

(score-api)=
### Score API

The Score API applies a cross-encoder model to predict scores for sentence pairs.
Our Score API applies a cross-encoder model to predict scores for sentence pairs.
Usually, the score for a sentence pair refers to the similarity between two sentences, on a scale of 0 to 1.

You can find the documentation for these kind of models at [sbert.net](https://www.sbert.net/docs/package_reference/cross_encoder/cross_encoder.html).

#### Code example

See [examples/openai_cross_encoder_score.py](https://github.com/vllm-project/vllm/blob/main/examples/openai_cross_encoder_score.py).

#### Single inference

You can pass a string to both `text_1` and `text_2`, forming a single sentence pair.
Expand Down
2 changes: 1 addition & 1 deletion examples/openai_cross_encoder_score.py
Original file line number Diff line number Diff line change
Expand Up @@ -20,9 +20,9 @@ def post_http_request(prompt: dict, api_url: str) -> requests.Response:
parser.add_argument("--host", type=str, default="localhost")
parser.add_argument("--port", type=int, default=8000)
parser.add_argument("--model", type=str, default="BAAI/bge-reranker-v2-m3")

args = parser.parse_args()
api_url = f"http://{args.host}:{args.port}/score"

model_name = args.model

text_1 = "What is the capital of Brazil?"
Expand Down
51 changes: 51 additions & 0 deletions examples/openai_pooling_client.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,51 @@
"""
Example online usage of Pooling API.
Run `vllm serve <model> --task <embed|classify|reward|score>`
to start up the server in vLLM.
"""
import argparse
import pprint

import requests


def post_http_request(prompt: dict, api_url: str) -> requests.Response:
headers = {"User-Agent": "Test Client"}
response = requests.post(api_url, headers=headers, json=prompt)
return response


if __name__ == "__main__":
parser = argparse.ArgumentParser()
parser.add_argument("--host", type=str, default="localhost")
parser.add_argument("--port", type=int, default=8000)
parser.add_argument("--model",
type=str,
default="jason9693/Qwen2.5-1.5B-apeach")

args = parser.parse_args()
api_url = f"http://{args.host}:{args.port}/pooling"
model_name = args.model

# Input like Completions API
prompt = {"model": model_name, "input": "vLLM is great!"}
pooling_response = post_http_request(prompt=prompt, api_url=api_url)
print("Pooling Response:")
pprint.pprint(pooling_response.json())

# Input like Chat API
prompt = {
"model":
model_name,
"messages": [{
"role": "user",
"content": [{
"type": "text",
"text": "vLLM is great!"
}],
}]
}
pooling_response = post_http_request(prompt=prompt, api_url=api_url)
print("Pooling Response:")
pprint.pprint(pooling_response.json())
68 changes: 46 additions & 22 deletions tests/entrypoints/openai/test_embedding.py
Original file line number Diff line number Diff line change
Expand Up @@ -6,6 +6,7 @@
import pytest_asyncio
import requests

from vllm.entrypoints.openai.protocol import EmbeddingResponse
from vllm.transformers_utils.tokenizer import get_tokenizer

from ...utils import RemoteOpenAIServer
Expand All @@ -17,6 +18,8 @@
@pytest.fixture(scope="module")
def server():
args = [
"--task",
"embed",
# use half precision for speed and memory savings in CI environment
"--dtype",
"bfloat16",
Expand Down Expand Up @@ -45,11 +48,14 @@ async def test_single_embedding(client: openai.AsyncOpenAI, model_name: str):
]

# test single embedding
embeddings = await client.embeddings.create(
embedding_response = await client.embeddings.create(
model=model_name,
input=input_texts,
encoding_format="float",
)
embeddings = EmbeddingResponse.model_validate(
embedding_response.model_dump(mode="json"))

assert embeddings.id is not None
assert len(embeddings.data) == 1
assert len(embeddings.data[0].embedding) == 4096
Expand All @@ -59,11 +65,14 @@ async def test_single_embedding(client: openai.AsyncOpenAI, model_name: str):

# test using token IDs
input_tokens = [1, 1, 1, 1, 1]
embeddings = await client.embeddings.create(
embedding_response = await client.embeddings.create(
model=model_name,
input=input_tokens,
encoding_format="float",
)
embeddings = EmbeddingResponse.model_validate(
embedding_response.model_dump(mode="json"))

assert embeddings.id is not None
assert len(embeddings.data) == 1
assert len(embeddings.data[0].embedding) == 4096
Expand All @@ -80,11 +89,14 @@ async def test_batch_embedding(client: openai.AsyncOpenAI, model_name: str):
"The cat sat on the mat.", "A feline was resting on a rug.",
"Stars twinkle brightly in the night sky."
]
embeddings = await client.embeddings.create(
embedding_response = await client.embeddings.create(
model=model_name,
input=input_texts,
encoding_format="float",
)
embeddings = EmbeddingResponse.model_validate(
embedding_response.model_dump(mode="json"))

assert embeddings.id is not None
assert len(embeddings.data) == 3
assert len(embeddings.data[0].embedding) == 4096
Expand All @@ -95,11 +107,14 @@ async def test_batch_embedding(client: openai.AsyncOpenAI, model_name: str):
# test List[List[int]]
input_tokens = [[4, 5, 7, 9, 20], [15, 29, 499], [24, 24, 24, 24, 24],
[25, 32, 64, 77]]
embeddings = await client.embeddings.create(
embedding_response = await client.embeddings.create(
model=model_name,
input=input_tokens,
encoding_format="float",
)
embeddings = EmbeddingResponse.model_validate(
embedding_response.model_dump(mode="json"))

assert embeddings.id is not None
assert len(embeddings.data) == 4
assert len(embeddings.data[0].embedding) == 4096
Expand All @@ -124,14 +139,16 @@ async def test_conversation_embedding(server: RemoteOpenAIServer,
"content": "Stars twinkle brightly in the night sky.",
}]

chat_response = requests.post(server.url_for("v1/embeddings"),
json={
"model": model_name,
"messages": messages,
"encoding_format": "float",
})
chat_response = requests.post(
server.url_for("v1/embeddings"),
json={
"model": model_name,
"messages": messages,
"encoding_format": "float",
},
)
chat_response.raise_for_status()
chat_embeddings = chat_response.json()
chat_embeddings = EmbeddingResponse.model_validate(chat_response.json())

tokenizer = get_tokenizer(tokenizer_name=model_name, tokenizer_mode="fast")
prompt = tokenizer.apply_chat_template(
Expand All @@ -148,13 +165,15 @@ async def test_conversation_embedding(server: RemoteOpenAIServer,
# To be consistent with chat
extra_body={"add_special_tokens": False},
)
completion_embeddings = completion_response.model_dump(mode="json")
completion_embeddings = EmbeddingResponse.model_validate(
completion_response.model_dump(mode="json"))

assert chat_embeddings.pop("id") is not None
assert completion_embeddings.pop("id") is not None
assert chat_embeddings.pop("created") <= completion_embeddings.pop(
"created")
assert chat_embeddings == completion_embeddings
assert chat_embeddings.id is not None
assert completion_embeddings.id is not None
assert chat_embeddings.created <= completion_embeddings.created
assert chat_embeddings.model_dump(
exclude={"id", "created"}) == (completion_embeddings.model_dump(
exclude={"id", "created"}))


@pytest.mark.asyncio
Expand Down Expand Up @@ -204,10 +223,13 @@ async def test_single_embedding_truncation(client: openai.AsyncOpenAI,
]

# test single embedding
embeddings = await client.embeddings.create(
embedding_response = await client.embeddings.create(
model=model_name,
input=input_texts,
extra_body={"truncate_prompt_tokens": 10})
embeddings = EmbeddingResponse.model_validate(
embedding_response.model_dump(mode="json"))

assert embeddings.id is not None
assert len(embeddings.data) == 1
assert len(embeddings.data[0].embedding) == 4096
Expand All @@ -219,10 +241,12 @@ async def test_single_embedding_truncation(client: openai.AsyncOpenAI,
1, 24428, 289, 18341, 26165, 285, 19323, 283, 289, 26789, 3871, 28728,
9901, 340, 2229, 385, 340, 315, 28741, 28804, 2
]
embeddings = await client.embeddings.create(
embedding_response = await client.embeddings.create(
model=model_name,
input=input_tokens,
extra_body={"truncate_prompt_tokens": 10})
embeddings = EmbeddingResponse.model_validate(
embedding_response.model_dump(mode="json"))

assert embeddings.id is not None
assert len(embeddings.data) == 1
Expand All @@ -241,10 +265,10 @@ async def test_single_embedding_truncation_invalid(client: openai.AsyncOpenAI,
]

with pytest.raises(openai.BadRequestError):
embeddings = await client.embeddings.create(
response = await client.embeddings.create(
model=model_name,
input=input_texts,
extra_body={"truncate_prompt_tokens": 8193})
assert "error" in embeddings.object
assert "error" in response.object
assert "truncate_prompt_tokens value is greater than max_model_len. "\
"Please, select a smaller truncation size." in embeddings.message
"Please, select a smaller truncation size." in response.message
Loading

0 comments on commit 9edca6b

Please sign in to comment.