Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Heterogeneous parallel processing to avoid CPU & GPU Idle time #258

Draft
wants to merge 47 commits into
base: main
Choose a base branch
from

Conversation

Usama3059
Copy link

Before submitting
  • Was this discussed/agreed via a Github issue? (no need for typos and docs improvements)
  • Did you read the contributor guideline, Pull Request section?
  • Did you make sure to update the docs?
  • Did you write any new necessary tests?

⚠️ How does this PR impact the user? ⚠️

Processing requires efficient use of both CPU and GPUs, and complex inference pipelines requires some preprocessing and transformations which requires CPU and then GPU for model inference; to put them in parallel batch it would make both device busy and hence fast inference

GOOD:

This will help to separate the process on each device and hence use them efficiently to handle large scale processing pipelines.

BAD:

I guess implementation can be done much better so post processing can also be make parallel on cpu or better API interface can be introduce also, just want to put the idea first to get the reviews from the repo owners.

PRs without this will not be merged.


What does this PR do?

  • give user two methods to separate CPU and GPU task
  • add use_heter_pipeline parameter to place these process on parallel

PR review

Anyone in the community is free to review the PR once the tests have passed.
If we didn't discuss your PR in GitHub issues there's a high chance it will not be merged.

Did you have fun?

Make sure you had fun coding 🙃

@Usama3059
Copy link
Author

Usama3059 commented Aug 31, 2024

Summarized point is explained in image attached below:
main1

@aniketmaurya
Copy link
Collaborator

aniketmaurya commented Aug 31, 2024

hi @Usama3059, appreciate the PR! however, it is not clear -

  • what is the real world actual use case?
  • can you a specific example where you need this and can't be done with existing API?

Regarding the diagram you attached above, we batch the 8 requests so GPU will process them in parallel but your diagram show them sequentially.

@Usama3059
Copy link
Author

Usama3059 commented Aug 31, 2024

Hello @aniketmaurya, consider total 32 reqs with max_batch_size = 8, so hence there are 4 sequential processes, right?

In Complex AI system there are combination of CPU and GPU based workloads, CPU abased preprocessing involves for example Opencv image transformation, filtration, external data downloading etc. In NLP for example external API calls, Vector search, prompt formatting or filtration, collecting from DBs, etc.

Now in the bottom part of diag, both CPU and GPU is busy and hence less cost and more resource efficient and faster response. Let me know if you need more information,

@williamFalcon
Copy link
Contributor

williamFalcon commented Aug 31, 2024

sounds exciting @Usama3059! can you show benchmarks to see how it compares with current master. The speed impact needs to be very clear and meaningful for us to consider merging it with the added internal complexity 😊. But sounds promising!

cc @lantiga

@lantiga
Copy link
Collaborator

lantiga commented Aug 31, 2024

hey @Usama3059, great contribution

I was actually thinking about options to get to the same effect (overlap preprocessing and gpu compute).

What I had in mind originally was to have decode request be executed prior
to the request being placed on the queue, assuming preprocessing is not batched, as it usually the case.

However your idea probably goes in a better direction. I just need to think about it a bit more. The UX could probably be improved.

We also need to refine the performance profile.
For instance, I would not use async in the worker, but use separate process workers for cpu and gpu.

Overall though this is a good direction to take, let’s explore it further!

@aniketmaurya
Copy link
Collaborator

Sounds great @Usama3059, thanks for the explanation!! Looking forward to it.
Also one thing to consider would be how we are going to call LitAPI.setup which usually instantiates a model or some heavy operation on the worker process.

@Usama3059
Copy link
Author

Thanks, @williamFalcon Yes will soon share the detailed benchmarks in different scenarios as speed improvements depends on them.
@lantiga Yes totally agree with having separate process workers for CPU and GPU for better flexibility and scalability, as we can set more CPU workers and batched into few GPU workers for its better memory usage. Please let me know if I've missed any important details.

@williamFalcon williamFalcon marked this pull request as draft September 3, 2024 13:21
@Usama3059
Copy link
Author

Usama3059 commented Sep 3, 2024

@lantiga @aniketmaurya I have added the initial code for separate process workers. Now, we can process multiple request batches in multiple workers on both CPU and GPU simultaneously, making both CPU & GPU busy and better memory usage. Please do let me know if there is anything I've missed.

Also, @aniketmaurya for LitAPI.setup what are the API changes or ideas you are thinking on this?

@lantiga
Copy link
Collaborator

lantiga commented Sep 4, 2024

So here's my proposal in order to clarify the API:

  • we add an optional preprocess hook to LitAPI (optional, like batch), with the same signature as predict
  • we add an arg to LitServer with the number of preprocess workers (by default equal to the regular workers if unspecified)

We can't call the regular setup method when spawning the preprocess worker. We could have a separate setup method, but for the time being I wouldn't have it. Usually pre-processing doesn't need pre-loading models etc, so we can keep it simple, and confine users to this limitation.

I believe this would make it simpler for users to think about a preprocess method that is called after batch and before predict, and that lives on a separate process pool. It doesn't even need to be on CPU or GPU, we don't make assumptions there.

What do you think @Usama3059 @aniketmaurya @bhimrazy ?

@Usama3059
Copy link
Author

@lantiga I think the proposed API is a great idea for initial simplicity at this stage, and I also agree with LitAPI.setup. Just one question, since preprocessing is a lightweight process, is there a need for a separate batch_size for it?"

@lantiga
Copy link
Collaborator

lantiga commented Sep 5, 2024

@lantiga I think the proposed API is a great idea for initial simplicity at this stage, and I also agree with LitAPI.setup. Just one question, since preprocessing is a lightweight process, is there a need for a separate batch_size for it?"

No, preprocess will take the same batch size as predict, otherwise it’ll get complicated for users

The lifecycle will be
decode_request -> batch -> preprocess -> predict -> unbatch -> encode response

As a side note, I’m imagining that if you are implementing preprocess you might leave batch with the default implementation.

Copy link

codecov bot commented Sep 14, 2024

Codecov Report

Attention: Patch coverage is 68.21705% with 82 lines in your changes missing coverage. Please review.

Project coverage is 89%. Comparing base (f475369) to head (68593a1).

Additional details and impacted files
@@         Coverage Diff          @@
##           main   #258    +/-   ##
====================================
- Coverage    95%    89%    -6%     
====================================
  Files        14     14            
  Lines      1082   1246   +164     
====================================
+ Hits       1025   1106    +81     
- Misses       57    140    +83     

Copy link
Collaborator

@aniketmaurya aniketmaurya left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good @Usama3059!

  • Can we rename ready_to_inference_queue to something like preprocess_queue. The process is called preprocess_process, so I think it makes sense to use ^ this name.
  • CI is failing for the following issue, could you please look into it:
tests/test_loops.py::test_run_streaming_loop_timeout
  /opt/conda/lib/python3.10/site-packages/_pytest/threadexception.py:82: PytestUnhandledThreadExceptionWarning: Exception in thread Thread-10 (run_streaming_loop)
  
  Traceback (most recent call last):
    File "/opt/conda/lib/python3.10/threading.py", line 1016, in _bootstrap_inner
      self.run()
    File "/opt/conda/lib/python3.10/threading.py", line 953, in run
      self._target(*self._args, **self._kwargs)
    File "/opt/conda/lib/python3.10/site-packages/litserve/loops.py", line 286, in run_streaming_loop
      time.monotonic() - timestamp > lit_api.request_timeout
  TypeError: unsupported operand type(s) for -: 'float' and 'NoneType'

src/litserve/server.py Outdated Show resolved Hide resolved
src/litserve/server.py Outdated Show resolved Hide resolved
src/litserve/server.py Outdated Show resolved Hide resolved
src/litserve/server.py Outdated Show resolved Hide resolved
src/litserve/server.py Outdated Show resolved Hide resolved
process_list.append(process)
else:
for worker_id, device in enumerate(self.inference_workers):
self.ready_to_inference_queue = None
Copy link
Collaborator

@aniketmaurya aniketmaurya Sep 16, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why set to self (self.ready_to_inference_queue), it could be local variable too.

src/litserve/server.py Outdated Show resolved Hide resolved
src/litserve/loops.py Show resolved Hide resolved
.gitignore Outdated Show resolved Hide resolved
@Usama3059
Copy link
Author

@aniketmaurya I think this warning time.monotonic() - timestamp > lit_api.request_timeout TypeError: unsupported operand type(s) for -: 'float' and 'NoneType' occurs because of the sentinel value request_queue.put((None, None, None, None)) I have tested the current code and showing the same warning also. Your thoughts on this?
`

Comment on lines +254 to +277
target=preprocess_worker,
args=(
self.lit_api,
self.lit_spec,
device,
worker_id,
self.request_queue,
self.preprocess_queue,
self.response_queues,
self.max_batch_size,
self.batch_timeout,
),
)
process.start()
process_list.append(process)

for worker_id, device in enumerate(self.inference_workers):
if len(device) == 1:
device = device[0]
worker_id = f"inference_{worker_id}"
self.workers_setup_status[worker_id] = False
ctx = mp.get_context("spawn")
process = ctx.Process(
target=inference_worker,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All good except one concern @lantiga @Usama3059 that if we start a new process with litapi then we also have to call litapi.setup which will load the model on "non inference worker" process too.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@aniketmaurya Based on discussion, we're considering preprocess workers as optional and not requiring any setup for simplicity. The initialized LitAPI will be passed to the preprocess workers, and the setup will only happen on the inference workers, Let me know if I’m missing any information

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's consider the following example where we create tokenizer and model in the setup method, now with preprocess enabled it won't be able to access self.tokenizer unless we setup the LitAPI in preprocess process as well.

from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch
from litserve import LitAPI, LitServer

class BERTLitAPI(LitAPI):
    def setup(self, device):
        self.tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
        self.model = AutoModelForSequenceClassification.from_pretrained("bert-base-uncased")
       
    def decode_request(self, request):
        return request["text"]

    def preprocess(self, inputs):
        return self.tokenizer(inputs, return_tensors="pt", padding=True, truncation=True, max_length=512)

    ....

if __name__ == "__main__":
    api = BERTLitAPI()
    server = LitServer(api, accelerator='cuda', devices=1)
    server.run(port=8000)

Copy link
Author

@Usama3059 Usama3059 Sep 18, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ankitsharma07 Could we provide a error/warning to the user as shown in the attached image? your thoughts on this
warn

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes warning can solve one part where users know what is wrong. Another thing is that setup would be called #num_preprocess_workers times and thus the model would be initialized more time than expected possibly leading to OOM.

Copy link
Author

@Usama3059 Usama3059 Sep 18, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@aniketmaurya I guess for simplicity, by 'warning,' I mean it will prevent the launch_inference_worker from starting and raise this warning to correct the code. We can place this check after the enable_process_worker, and since preprocess workers are optional and will only be used if needed.
Let me know if you have any other ideas?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's a valid concern @aniketmaurya
we could have an optional setup_preprocessing hook as well in the future

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@aniketmaurya is there anything else that needs to be worked on for this PR?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants