Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Runtime determination and load balancing across multiple GPUs #42

Open
ozcanmiraay opened this issue Oct 20, 2024 · 1 comment
Open

Comments

@ozcanmiraay
Copy link

Hello,

I'm struggling to understand a couple of things about the simulator, given there's no documentation around it.

The simulation runtime is determined based on the requests (either trace files or synthetic request generator with arrival times determined with Poisson distribution). There is also execution time, which is the actual time it takes to process a batch or stage of computations during model inference. My question is, how does the existence of multiple GPUs (added through replica_config_num_pipeline_stages and replica_config_tensor_parallel_size parameters) affect simulation runtime and/or request execution time? It seems like (from the stats extractor script) the GPU hours are calculated by runtime * number of GPUs / 3600; however, I'm thinking that the runtime or the execution time should become less in the presence of multiple GPUs, thus the total GPU hours, due to load balancing. Is this incorrect?

Also, where in the code should I look to find out how load balancing is handled across multiple GPUs? Is there a load-balancing configuration across multiple GPUs, or are GPUs fully independent from each other? I am just curious to understand how the tasks are distributed across GPUs when we increase the world_size by increasing replica_config_num_pipeline_stages and replica_config_tensor_parallel_size parameters.

Finally, how are the batches determined, and are batches allocated to a specific GPU? I am asking this because it looks like the MFU metric (based on utils/mfu_calculator.py) is calculated batch-by-batch, and I am curious if it outputs the MFU over all the GPUs or a specific GPU that each batch is assigned to.

Thanks a lot!

@AgrawalAmey
Copy link
Contributor

Hey Miray,

The makespan of the workload would typically reduce with the number of gpus depending on the workload. You can check makespan by checking the max value of metric request_completions_time_series. You can refer to the full list of metrics here.

Load balancing is performed by global scheduler. We support various routing algorithms like round robin and least outstanding requets.

MFU is computed on a per pipeline stage. Within each pipeline stage, the MFU of all the (tensor parallel) workers is identical.

Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants