Heterogeneous parallel processing to avoid CPU & GPU Idle time #258

Usama3059 · 2024-08-31T11:57:40Z

Before submitting

Was this discussed/agreed via a Github issue? (no need for typos and docs improvements)
Did you read the contributor guideline, Pull Request section?
Did you make sure to update the docs?
Did you write any new necessary tests?

⚠️ How does this PR impact the user? ⚠️

Processing requires efficient use of both CPU and GPUs, and complex inference pipelines requires some preprocessing and transformations which requires CPU and then GPU for model inference; to put them in parallel batch it would make both device busy and hence fast inference

GOOD:

This will help to separate the process on each device and hence use them efficiently to handle large scale processing pipelines.

BAD:

I guess implementation can be done much better so post processing can also be make parallel on cpu or better API interface can be introduce also, just want to put the idea first to get the reviews from the repo owners.

PRs without this will not be merged.

What does this PR do?

give user two methods to separate CPU and GPU task
add use_heter_pipeline parameter to place these process on parallel

PR review

Anyone in the community is free to review the PR once the tests have passed.
If we didn't discuss your PR in GitHub issues there's a high chance it will not be merged.

Did you have fun?

Make sure you had fun coding 🙃

for more information, see https://pre-commit.ci

Usama3059 · 2024-08-31T12:00:17Z

Summarized point is explained in image attached below:

aniketmaurya · 2024-08-31T13:05:42Z

hi @Usama3059, appreciate the PR! however, it is not clear -

what is the real world actual use case?
can you a specific example where you need this and can't be done with existing API?

Regarding the diagram you attached above, we batch the 8 requests so GPU will process them in parallel but your diagram show them sequentially.

Usama3059 · 2024-08-31T15:13:54Z

Hello @aniketmaurya, consider total 32 reqs with max_batch_size = 8, so hence there are 4 sequential processes, right?

In Complex AI system there are combination of CPU and GPU based workloads, CPU abased preprocessing involves for example Opencv image transformation, filtration, external data downloading etc. In NLP for example external API calls, Vector search, prompt formatting or filtration, collecting from DBs, etc.

Now in the bottom part of diag, both CPU and GPU is busy and hence less cost and more resource efficient and faster response. Let me know if you need more information,

williamFalcon · 2024-08-31T19:35:28Z

sounds exciting @Usama3059! can you show benchmarks to see how it compares with current master. The speed impact needs to be very clear and meaningful for us to consider merging it with the added internal complexity 😊. But sounds promising!

cc @lantiga

lantiga · 2024-08-31T20:08:07Z

hey @Usama3059, great contribution

I was actually thinking about options to get to the same effect (overlap preprocessing and gpu compute).

What I had in mind originally was to have decode request be executed prior
to the request being placed on the queue, assuming preprocessing is not batched, as it usually the case.

However your idea probably goes in a better direction. I just need to think about it a bit more. The UX could probably be improved.

We also need to refine the performance profile.
For instance, I would not use async in the worker, but use separate process workers for cpu and gpu.

Overall though this is a good direction to take, let’s explore it further!

aniketmaurya · 2024-09-02T10:55:21Z

Sounds great @Usama3059, thanks for the explanation!! Looking forward to it.
Also one thing to consider would be how we are going to call LitAPI.setup which usually instantiates a model or some heavy operation on the worker process.

Usama3059 · 2024-09-02T16:14:29Z

Thanks, @williamFalcon Yes will soon share the detailed benchmarks in different scenarios as speed improvements depends on them.
@lantiga Yes totally agree with having separate process workers for CPU and GPU for better flexibility and scalability, as we can set more CPU workers and batched into few GPU workers for its better memory usage. Please let me know if I've missed any important details.

for more information, see https://pre-commit.ci

Usama3059 · 2024-09-03T15:19:17Z

@lantiga @aniketmaurya I have added the initial code for separate process workers. Now, we can process multiple request batches in multiple workers on both CPU and GPU simultaneously, making both CPU & GPU busy and better memory usage. Please do let me know if there is anything I've missed.

Also, @aniketmaurya for LitAPI.setup what are the API changes or ideas you are thinking on this?

lantiga · 2024-09-04T12:35:10Z

So here's my proposal in order to clarify the API:

we add an optional preprocess hook to LitAPI (optional, like batch), with the same signature as predict
we add an arg to LitServer with the number of preprocess workers (by default equal to the regular workers if unspecified)

We can't call the regular setup method when spawning the preprocess worker. We could have a separate setup method, but for the time being I wouldn't have it. Usually pre-processing doesn't need pre-loading models etc, so we can keep it simple, and confine users to this limitation.

I believe this would make it simpler for users to think about a preprocess method that is called after batch and before predict, and that lives on a separate process pool. It doesn't even need to be on CPU or GPU, we don't make assumptions there.

What do you think @Usama3059 @aniketmaurya @bhimrazy ?

Usama3059 · 2024-09-05T17:12:09Z

@lantiga I think the proposed API is a great idea for initial simplicity at this stage, and I also agree with LitAPI.setup. Just one question, since preprocessing is a lightweight process, is there a need for a separate batch_size for it?"

lantiga · 2024-09-05T18:21:30Z

@lantiga I think the proposed API is a great idea for initial simplicity at this stage, and I also agree with LitAPI.setup. Just one question, since preprocessing is a lightweight process, is there a need for a separate batch_size for it?"

No, preprocess will take the same batch size as predict, otherwise it’ll get complicated for users

The lifecycle will be
decode_request -> batch -> preprocess -> predict -> unbatch -> encode response

As a side note, I’m imagining that if you are implementing preprocess you might leave batch with the default implementation.

for more information, see https://pre-commit.ci

codecov · 2024-09-14T15:57:42Z

Codecov Report

Attention: Patch coverage is 68.21705% with 82 lines in your changes missing coverage. Please review.

Project coverage is 89%. Comparing base (f475369) to head (68593a1).

Additional details and impacted files

@@         Coverage Diff          @@
##           main   #258    +/-   ##
====================================
- Coverage    95%    89%    -6%     
====================================
  Files        14     14            
  Lines      1082   1246   +164     
====================================
+ Hits       1025   1106    +81     
- Misses       57    140    +83

for more information, see https://pre-commit.ci

aniketmaurya

Looks good @Usama3059!

Can we rename ready_to_inference_queue to something like preprocess_queue. The process is called preprocess_process, so I think it makes sense to use ^ this name.
CI is failing for the following issue, could you please look into it:

tests/test_loops.py::test_run_streaming_loop_timeout
  /opt/conda/lib/python3.10/site-packages/_pytest/threadexception.py:82: PytestUnhandledThreadExceptionWarning: Exception in thread Thread-10 (run_streaming_loop)
  
  Traceback (most recent call last):
    File "/opt/conda/lib/python3.10/threading.py", line 1016, in _bootstrap_inner
      self.run()
    File "/opt/conda/lib/python3.10/threading.py", line 953, in run
      self._target(*self._args, **self._kwargs)
    File "/opt/conda/lib/python3.10/site-packages/litserve/loops.py", line 286, in run_streaming_loop
      time.monotonic() - timestamp > lit_api.request_timeout
  TypeError: unsupported operand type(s) for -: 'float' and 'NoneType'

src/litserve/server.py

aniketmaurya · 2024-09-16T11:34:06Z

src/litserve/server.py

+                process_list.append(process)
+        else:
+            for worker_id, device in enumerate(self.inference_workers):
+                self.ready_to_inference_queue = None


why set to self (self.ready_to_inference_queue), it could be local variable too.

src/litserve/server.py

src/litserve/loops.py

.gitignore

for more information, see https://pre-commit.ci

…to main

Usama3059 · 2024-09-16T17:26:26Z

@aniketmaurya I think this warning time.monotonic() - timestamp > lit_api.request_timeout TypeError: unsupported operand type(s) for -: 'float' and 'NoneType' occurs because of the sentinel value request_queue.put((None, None, None, None)) I have tested the current code and showing the same warning also. Your thoughts on this?
`

src/litserve/server.py

for more information, see https://pre-commit.ci

aniketmaurya · 2024-09-16T18:29:37Z

src/litserve/server.py

+                    target=preprocess_worker,
+                    args=(
+                        self.lit_api,
+                        self.lit_spec,
+                        device,
+                        worker_id,
+                        self.request_queue,
+                        self.preprocess_queue,
+                        self.response_queues,
+                        self.max_batch_size,
+                        self.batch_timeout,
+                    ),
+                )
+                process.start()
+                process_list.append(process)
+
+            for worker_id, device in enumerate(self.inference_workers):
+                if len(device) == 1:
+                    device = device[0]
+                worker_id = f"inference_{worker_id}"
+                self.workers_setup_status[worker_id] = False
+                ctx = mp.get_context("spawn")
+                process = ctx.Process(
+                    target=inference_worker,


All good except one concern @lantiga @Usama3059 that if we start a new process with litapi then we also have to call litapi.setup which will load the model on "non inference worker" process too.

@aniketmaurya Based on discussion, we're considering preprocess workers as optional and not requiring any setup for simplicity. The initialized LitAPI will be passed to the preprocess workers, and the setup will only happen on the inference workers, Let me know if I’m missing any information

let's consider the following example where we create tokenizer and model in the setup method, now with preprocess enabled it won't be able to access self.tokenizer unless we setup the LitAPI in preprocess process as well.

from transformers import AutoTokenizer, AutoModelForSequenceClassification import torch from litserve import LitAPI, LitServer class BERTLitAPI(LitAPI): def setup(self, device): self.tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased") self.model = AutoModelForSequenceClassification.from_pretrained("bert-base-uncased") def decode_request(self, request): return request["text"] def preprocess(self, inputs): return self.tokenizer(inputs, return_tensors="pt", padding=True, truncation=True, max_length=512) .... if __name__ == "__main__": api = BERTLitAPI() server = LitServer(api, accelerator='cuda', devices=1) server.run(port=8000)

@ankitsharma07 Could we provide a error/warning to the user as shown in the attached image? your thoughts on this

yes warning can solve one part where users know what is wrong. Another thing is that setup would be called #num_preprocess_workers times and thus the model would be initialized more time than expected possibly leading to OOM.

@aniketmaurya I guess for simplicity, by 'warning,' I mean it will prevent the launch_inference_worker from starting and raise this warning to correct the code. We can place this check after the enable_process_worker, and since preprocess workers are optional and will only be used if needed.
Let me know if you have any other ideas?

it's a valid concern @aniketmaurya
we could have an optional setup_preprocessing hook as well in the future

@aniketmaurya is there anything else that needs to be worked on for this PR?

for more information, see https://pre-commit.ci

Usama3059 added 2 commits August 31, 2024 16:32

Heterogeneous computing featured added

6dff1e1

Update .gitignore

5ee1b08

Usama3059 requested review from lantiga, aniketmaurya, awaelchli and Andrei-Aksionov as code owners August 31, 2024 11:57

[pre-commit.ci] auto fixes from pre-commit.com hooks

2463ab3

for more information, see https://pre-commit.ci

Usama3059 and others added 5 commits September 2, 2024 22:00

Merge branch 'Lightning-AI:main' into main

1c253e0

[pre-commit.ci] auto fixes from pre-commit.com hooks

39e5515

for more information, see https://pre-commit.ci

create separate process workers,forCPU & GPU

4eead85

separate workers for CPU & GPU test

a7b5dac

[pre-commit.ci] auto fixes from pre-commit.com hooks

bbce8a4

for more information, see https://pre-commit.ci

williamFalcon marked this pull request as draft September 3, 2024 13:21

Usama3059 and others added 3 commits September 3, 2024 18:50

call via api added

1e1dc09

Merge branch 'main' of https://github.com/Usama3059/LitServe-extra

7b73459

[pre-commit.ci] auto fixes from pre-commit.com hooks

cf0ff2f

for more information, see https://pre-commit.ci

Usama3059 and others added 3 commits September 7, 2024 17:45

Added preprocess workers, for streaming wip

c02f759

[pre-commit.ci] auto fixes from pre-commit.com hooks

0cf8847

for more information, see https://pre-commit.ci

added for streaming, testing in progress

2128aa9

Usama3059 and others added 7 commits September 14, 2024 17:01

Fix: changes in test_loops.py

3b8fd1e

[pre-commit.ci] auto fixes from pre-commit.com hooks

f68823c

for more information, see https://pre-commit.ci

Test: start working on tests

9903d8d

[pre-commit.ci] auto fixes from pre-commit.com hooks

e5a6586

for more information, see https://pre-commit.ci

Tests: without preprocess checks

d182ece

Merge branch 'main' of https://github.com/Usama3059/LitServe-extra

58a61a6

[pre-commit.ci] auto fixes from pre-commit.com hooks

e3a2345

for more information, see https://pre-commit.ci

Usama3059 and others added 4 commits September 16, 2024 14:16

Tests: initial test added

e3db002

Refactor: refactor tests_func

9f8865d

Merge branch 'main' of https://github.com/Usama3059/LitServe-extra

2cf80b1

[pre-commit.ci] auto fixes from pre-commit.com hooks

2ecf163

for more information, see https://pre-commit.ci

aniketmaurya reviewed Sep 16, 2024

View reviewed changes

Usama3059 and others added 5 commits September 16, 2024 20:14

Fix: changed process spawn

5c4cd3d

Merge branch 'main' of https://github.com/Usama3059/LitServe-extra

84d417f

[pre-commit.ci] auto fixes from pre-commit.com hooks

d1e59e3

for more information, see https://pre-commit.ci

Refactor: changed ready_to_inference name

7e78a4e

Merge branch 'main' of https://github.com/Usama3059/LitServe-extra in…

e4cd2e7

…to main

aniketmaurya reviewed Sep 16, 2024

View reviewed changes

src/litserve/server.py Outdated Show resolved Hide resolved

aniketmaurya and others added 2 commits September 16, 2024 23:47

Update src/litserve/server.py

430081d

[pre-commit.ci] auto fixes from pre-commit.com hooks

1a5ca56

for more information, see https://pre-commit.ci

aniketmaurya reviewed Sep 16, 2024

View reviewed changes

Usama3059 and others added 7 commits September 17, 2024 18:55

Merge branch 'Lightning-AI:main' into main

0bdb93b

Tests: added flow with both workers

9e1725f

[pre-commit.ci] auto fixes from pre-commit.com hooks

2be7c99

for more information, see https://pre-commit.ci

Refactor: clean tests

7883038

[pre-commit.ci] auto fixes from pre-commit.com hooks

95cecbc

for more information, see https://pre-commit.ci

Update .gitignore

451b79d

[pre-commit.ci] auto fixes from pre-commit.com hooks

68593a1

for more information, see https://pre-commit.ci

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Heterogeneous parallel processing to avoid CPU & GPU Idle time #258

Heterogeneous parallel processing to avoid CPU & GPU Idle time #258

Usama3059 commented Aug 31, 2024

Usama3059 commented Aug 31, 2024 •

edited

Loading

aniketmaurya commented Aug 31, 2024 •

edited

Loading

Usama3059 commented Aug 31, 2024 •

edited

Loading

williamFalcon commented Aug 31, 2024 •

edited

Loading

lantiga commented Aug 31, 2024

aniketmaurya commented Sep 2, 2024

Usama3059 commented Sep 2, 2024

Usama3059 commented Sep 3, 2024 •

edited

Loading

lantiga commented Sep 4, 2024

Usama3059 commented Sep 5, 2024

lantiga commented Sep 5, 2024

codecov bot commented Sep 14, 2024 •

edited

Loading

aniketmaurya left a comment •

edited

Loading

aniketmaurya Sep 16, 2024 •

edited

Loading

Usama3059 commented Sep 16, 2024

aniketmaurya Sep 16, 2024

Usama3059 Sep 18, 2024

aniketmaurya Sep 18, 2024

Usama3059 Sep 18, 2024 •

edited

Loading

aniketmaurya Sep 18, 2024

Usama3059 Sep 18, 2024 •

edited

Loading

lantiga Sep 21, 2024

Usama3059 Oct 8, 2024

Heterogeneous parallel processing to avoid CPU & GPU Idle time #258

Are you sure you want to change the base?

Heterogeneous parallel processing to avoid CPU & GPU Idle time #258

Conversation

Usama3059 commented Aug 31, 2024

⚠️ How does this PR impact the user? ⚠️

GOOD:

BAD:

What does this PR do?

PR review

Did you have fun?

Usama3059 commented Aug 31, 2024 • edited Loading

aniketmaurya commented Aug 31, 2024 • edited Loading

Usama3059 commented Aug 31, 2024 • edited Loading

williamFalcon commented Aug 31, 2024 • edited Loading

lantiga commented Aug 31, 2024

aniketmaurya commented Sep 2, 2024

Usama3059 commented Sep 2, 2024

Usama3059 commented Sep 3, 2024 • edited Loading

lantiga commented Sep 4, 2024

Usama3059 commented Sep 5, 2024

lantiga commented Sep 5, 2024

codecov bot commented Sep 14, 2024 • edited Loading

Codecov Report

aniketmaurya left a comment • edited Loading

Choose a reason for hiding this comment

aniketmaurya Sep 16, 2024 • edited Loading

Choose a reason for hiding this comment

Usama3059 commented Sep 16, 2024

aniketmaurya Sep 16, 2024

Choose a reason for hiding this comment

Usama3059 Sep 18, 2024

Choose a reason for hiding this comment

aniketmaurya Sep 18, 2024

Choose a reason for hiding this comment

Usama3059 Sep 18, 2024 • edited Loading

Choose a reason for hiding this comment

aniketmaurya Sep 18, 2024

Choose a reason for hiding this comment

Usama3059 Sep 18, 2024 • edited Loading

Choose a reason for hiding this comment

lantiga Sep 21, 2024

Choose a reason for hiding this comment

Usama3059 Oct 8, 2024

Choose a reason for hiding this comment

Usama3059 commented Aug 31, 2024 •

edited

Loading

aniketmaurya commented Aug 31, 2024 •

edited

Loading

Usama3059 commented Aug 31, 2024 •

edited

Loading

williamFalcon commented Aug 31, 2024 •

edited

Loading

Usama3059 commented Sep 3, 2024 •

edited

Loading

codecov bot commented Sep 14, 2024 •

edited

Loading

aniketmaurya left a comment •

edited

Loading

aniketmaurya Sep 16, 2024 •

edited

Loading

Usama3059 Sep 18, 2024 •

edited

Loading

Usama3059 Sep 18, 2024 •

edited

Loading