Any way to use optimize
with multiple nodes on a self-managed cluster?
#387
-
Lightning Studio isn't really an option for me to use, so I am wondering what the general approach might be to distribute the data when optimizing a dataset with litdata? My initial thought was maybe there is a way to break up the input data ahead of time into distinct S3 subprefixes, each of which is handed to each distinct node for optimizing independently. Essentially, I'm thinking it would be easiest to avoid having to do any communication between nodes. But I am not sure then how to merge the generated index files so that the entire dataset can be handled with a single StreamingDataSet class object. Any thoughts appreciated. |
Beta Was this translation helpful? Give feedback.
Replies: 1 comment 2 replies
-
we already have a merge optimize dataset functionality. You can check the readme or the #385 merged PR that updated the docs. |
Beta Was this translation helpful? Give feedback.
we already have a merge optimize dataset functionality. You can check the readme or the #385 merged PR that updated the docs.