Any way to use `optimize` with multiple nodes on a self-managed cluster? #387

hubenjm · 2024-10-02T18:28:50Z

hubenjm
Oct 2, 2024

Lightning Studio isn't really an option for me to use, so I am wondering what the general approach might be to distribute the data when optimizing a dataset with litdata? My initial thought was maybe there is a way to break up the input data ahead of time into distinct S3 subprefixes, each of which is handed to each distinct node for optimizing independently. Essentially, I'm thinking it would be easiest to avoid having to do any communication between nodes. But I am not sure then how to merge the generated index files so that the entire dataset can be handled with a single StreamingDataSet class object.

Any thoughts appreciated.

Answered by deependujha

Oct 2, 2024

we already have a merge optimize dataset functionality. You can check the readme or the #385 merged PR that updated the docs.

View full answer

deependujha · 2024-10-02T18:47:35Z

deependujha
Oct 2, 2024
Maintainer

we already have a merge optimize dataset functionality. You can check the readme or the #385 merged PR that updated the docs.

2 replies

hubenjm Oct 2, 2024
Author

Thanks so much! This is a great feature for supporting setting up your own multi-node data optimization

abysmalocean Dec 3, 2024

great feature! thanks for this update.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Any way to use `optimize` with multiple nodes on a self-managed cluster? #387

{{title}}

Replies: 1 comment 2 replies

{{title}}

{{title}}

{{title}}

Select a reply

Any way to use optimize with multiple nodes on a self-managed cluster? #387

hubenjm Oct 2, 2024

Replies: 1 comment · 2 replies

deependujha Oct 2, 2024 Maintainer

hubenjm Oct 2, 2024 Author

abysmalocean Dec 3, 2024

Any way to use `optimize` with multiple nodes on a self-managed cluster? #387

hubenjm
Oct 2, 2024

Replies: 1 comment 2 replies

deependujha
Oct 2, 2024
Maintainer

hubenjm Oct 2, 2024
Author