`add_compute_plan` - which batch size to use #282

Esadruhn · 2022-09-13T07:43:13Z

Summary

When we add a compute plan with N tasks, we can set autobatching to True and set the batch size.
This submits the tasks to the backend by batches of size batch_size. The fastest option is to increase the batch size as much as possible without getting backend errors.

The default batch size is 500, the question here is: how to find the maximal batch size we can use?

What happens when the batch size is too big

When the batch size is too big (451 tasks * 400 data samples per task), we get the following error

Requests error status 429: {"message":"grpc: received message larger than max (6228668 vs. 4194304)"}
Traceback (most recent call last):
  File "HIDDEN/substra/sdk/backends/remote/rest_client.py", line 114, in __request
    r.raise_for_status()
  File "HIDDEN/requests/models.py", line 960, in raise_for_status
    raise HTTPError(http_error_msg, response=self)
requests.exceptions.HTTPError: 429 Client Error: Too Many Requests for url: HIDDEN/task/bulk_create/

The text was updated successfully, but these errors were encountered:

Esadruhn · 2022-09-13T07:53:26Z

@AurelienGasser you said that size_of_grpc_packet = const_value * number_of_samples_per_task * number_of_tasks_per_batch

Is the max value of size_of_grpc_packet always the same, or does it depend on a deployment configuration?

Rule of thumb - optimal batch size

If the max value is always the same,
and we assume that all tasks have the same number of data samples,

then, from the error shown in the description,

451 tasks submitted at once
400 data samples per task
total grpc message size: 6228668
max grpc message size: 4194304

so, in this formula: size_of_grpc_packet = const_value * number_of_samples_per_task * number_of_tasks_per_batch

const_value = 6228668 / (451*400) ≈ 34,
so the max value of number_of_samples_per_task * number_of_tasks_per_batch should be 4194304/34 = 123361 ≈ 120000

so is it correct to say that batch_size = math.floor(120000 / number_of_samples_per_task) would be a good approximation?

mblottiere · 2022-09-13T08:02:50Z

There is no "one size fits all" batch size. It depends on both the number of tasks and the number of inputs.

Would it be feasible to catch this error at the SDK level and lower batch size before retry?

Esadruhn · 2022-09-13T08:05:43Z

There is no "one size fits all" batch size. It depends on both the number of tasks and the number of inputs.

But if we can have a rule of thumb on which batch size works, we can indicate to the user what value to use.
The default batch size used to be 20 and it was very slow, so I think it's good to give an idea of how high the batch size can be for a particular use case

Would it be feasible to catch this error at the SDK level and lower batch size before retry?

Sure, we can try that, I would do that on top of documenting the "optimal batch size"
If we retry automatically, is there a risk that the backend/orchestrator are busy because of the previous call and fail?

We should also expose the batch size in substrafl, today only the autobatching argument is exposed, so when it's true, the default batch size is used.

RomainGoussault · 2022-09-13T09:19:54Z

Do we have data on how much slower it is to have a small batch size vs a big batch size?

Esadruhn · 2022-09-13T09:44:47Z

@tanguy-marchand from what you said, 15 rounds, 136 tuples, with 1257 data samples took 5min to submit?

(1257 samples per task or in total? The number that we are interested in is the number of samples per task)

AurelienGasser · 2022-09-15T08:53:14Z

Is the max value of size_of_grpc_packet always the same, or does it depend on a deployment configuration?

It's always the same. We could change it but have chosen no to so far. The limit serves the purpose of limiting the load on the server and avoid resource starvation.

tanguy-marchand · 2022-09-19T08:57:22Z

@tanguy-marchand from what you said, 15 rounds, 136 tuples, with 1257 data samples took 5min to submit?

A CP using 2 centers (with respectively 1257 and 999 samples) and 30 rounds (overall 512 tuples) takes 17 minutes to submit

Esadruhn · 2022-09-19T15:56:01Z

OK, so as a first fix what we can do is:

expose the batch size in substrafl
provide a way to estimate the best batch size (batch_size = math.floor(120000 / number_of_samples_per_task))

then discuss a better solution:

calculate the batch size in the SDK so that the user does not need to set it
provide a helper function to help calculate it

I think the best would be if we are able to calculate it but I am worried it would slow down the execution and we'll want to keep being able to override this if for any reason the calculation is wrong.

Esadruhn mentioned this issue Sep 13, 2022

feat: set the default batch size to 400 to make the compute plan submission faster #281

Closed

Esadruhn self-assigned this Sep 13, 2022

Esadruhn mentioned this issue Sep 19, 2022

Patch/0.21.5 Substra/substrafl#18

Merged

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`add_compute_plan` - which batch size to use #282

`add_compute_plan` - which batch size to use #282

Esadruhn commented Sep 13, 2022 •

edited

Loading

Esadruhn commented Sep 13, 2022 •

edited

Loading

mblottiere commented Sep 13, 2022

Esadruhn commented Sep 13, 2022 •

edited

Loading

RomainGoussault commented Sep 13, 2022

Esadruhn commented Sep 13, 2022

AurelienGasser commented Sep 15, 2022

tanguy-marchand commented Sep 19, 2022

Esadruhn commented Sep 19, 2022

add_compute_plan - which batch size to use #282

add_compute_plan - which batch size to use #282

Comments

Esadruhn commented Sep 13, 2022 • edited Loading

Summary

What happens when the batch size is too big

Esadruhn commented Sep 13, 2022 • edited Loading

Rule of thumb - optimal batch size

mblottiere commented Sep 13, 2022

Esadruhn commented Sep 13, 2022 • edited Loading

RomainGoussault commented Sep 13, 2022

Esadruhn commented Sep 13, 2022

AurelienGasser commented Sep 15, 2022

tanguy-marchand commented Sep 19, 2022

Esadruhn commented Sep 19, 2022

`add_compute_plan` - which batch size to use #282

`add_compute_plan` - which batch size to use #282

Esadruhn commented Sep 13, 2022 •

edited

Loading

Esadruhn commented Sep 13, 2022 •

edited

Loading

Esadruhn commented Sep 13, 2022 •

edited

Loading