Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Put pytriton.client in the separate package/wheel. #62

Open
flyingleafe opened this issue Feb 5, 2024 · 3 comments
Open

Put pytriton.client in the separate package/wheel. #62

flyingleafe opened this issue Feb 5, 2024 · 3 comments
Labels
enhancement New feature or request non-stale This label can be used to prevent marking issues or PRs as Stale

Comments

@flyingleafe
Copy link

Is your feature request related to a problem? Please describe.
We are building the serving solution for DL logic using Pytriton at work. We ourselves would like to separate the client stubs from the server logic as separate packages. The idea is that the users of our models use the client package, which does not have all the heavy dependencies our server code has.

Unfortunately, if we implement our client code using pytriton, the entire pytriton becomes a dependency, which includes Triton inference server itself and cuda-python package. Installing those dependencies would be a major inconvenience for the client package users and an entirely unnecessary one since neither of those heavy dependencies is actually used in the pytriton.client code implementation.

Describe the solution you'd like
It would be great and elegant if pytriton.client submodule was in a separate package, e.g. nvidia-pytriton-client, which nvidia-pytriton could depend upon. nvidia-pytriton-client itself would not require inclusion of Triton server in the wheel; the tritonclient dependency could also be reduced for it from tritonclient[all] to tritonclient[grpc, http] (removing cuda dependency group). This will allow our derived client package for our service to be very light in dependencies. I am quite sure that this will be very useful for other projects facing similar problems.

Describe alternatives you've considered
Alternatively, nvidia-pytriton package itself can by default go without Triton server and tritonclient[cuda] dependencies, which would only be included if the optional dependency group is provided (e.g. nvidia-pytriton[all]).

Additional context
If the core maintainers have no time to do so, I could prepare a pull request myself, since it seems to be a straightforward refactoring; however, such PR should be synchronized with your workflow with great care, since it alters the packaging structure and hence any CI you might have.

@piotrm-nvidia piotrm-nvidia added enhancement New feature or request non-stale This label can be used to prevent marking issues or PRs as Stale labels Feb 5, 2024
@martin-liu
Copy link

@piotrm-nvidia , are there any updates on the potential prioritization of this?

@piotrm-nvidia
Copy link
Collaborator

@martin-liu We are considering migrating PyTriton client to tritonclient repository. It will have new API much better aligned with Triton.

We have several proposals for new API revamp.

Synchronous interface

# Decoupled model with streaming

from tritonclient import Client

# Change url to 'http://localhost:8000/' for utilizing HTTP client
client = Client(url='grpc://loacalhost:8001')

input_tensor_as_numpy = np.array(...)

# Infer should be async similar to the exising Python APIs
responses = client.model('simple').infer(inputs={'input': input_tensor_as_numpy})

for response in responses:
	numpy_array = np.asarray(response.outputs['output'])

client.close()


# None-decoupled model

from tritonclient import Client

# Change url to 'http://localhost:8000/' for utilizing HTTP client
client = Client(url='grpc://loacalhost:8001')

input_tensor_as_numpy = np.array(...)

# Infer should be sync similar to the exising Python APIs
responses = client.model('simple').infer(inputs={'input': input_tensor_as_numpy})

numpy_array = np.asarray(list(responses)[0].outputs['output'])

client.close()

Active waiting

from tritonclient import Client
import time

input_tensor_as_numpy = np.array(..)

client = Client(url='grpc://localhost:8001')
client.wait_for_readiness()

model = client.model('simple', wait_for_ready=True, timeout=wait_time)

responses = model.infer(inputs={'input': input_tensor_as_numpy})

for response in responses:
	numpy_array = np.asarray(response.outputs['output'])
client.close()

Async client example

from tritonclient.aio import AsyncClient

# Change url to 'http://localhost:8000/' for utilizing HTTP client
# Opening client connection is asynchronous call
client = AsyncClient(url='grpc://loacalhost:8001')
await client.wait_for_readiness(wait_timeout=wait_timeout)
# Opening model connection is asynchronous call
model = client.model('simple')
await model.wait_for_readiness()
# Infer should be async similar to the exising Python APIs
responses = await model.infer(inputs={'input': np.array(..)}


async for response in responses:
	numpy_array = np.asarray(response.outputs['output'])

Context Manager Example

from tritonclient import Client
import numpy as np
# Context manager closes client
with Client(url='grpc://localhost:8001') as client:
      model = client.model('simple')

response = model.infer(inputs={"input": np.array(..)})


# Numpy tensor result is default output
      print(response['output'])

Using Client with GPT tokenizer

from tritonclient import Client
from transformers import GPT2Tokenizer, GPT2Model

tokenizer = GPT2Tokenizer.from_pretrained('gpt2')
return_tensors = "pt" if local_model else "np" 
inputs = tokenizer("Hello, my dog is cute", return_tensors=return_tensors)

if local_model:
        model = GPT2Model.from_pretrained('gpt2')
        response = model(**inputs)
else:
        client = Client('grpc://localhost:8001').model('gpt2')
 	response = client.infer(inputs)
        client.close()
print(response)

What do you think about such solution? What do you love or hate about current client? What do you think about these examples of new client API?

@martin-liu
Copy link

@piotrm-nvidia, migrating to tritonclient sounds like a great move!

Regarding the code examples, they are generally well-crafted. However, I have a couple of questions:

  • The distinction between decoupled and non-decoupled modes is not immediately intuitive. Would it be beneficial to have a clearer, more explicit way to differentiate between them in the code?
  • The wait_for_readiness seems verbose. Would it be possible to handle this implicitly to streamline the code?

Also do you have a rough ETA of the migration?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request non-stale This label can be used to prevent marking issues or PRs as Stale
Projects
None yet
Development

No branches or pull requests

3 participants