Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

When providing a local path to the optimize method, make it work in a distributed settings for Jobs #193

Closed
tchaton opened this issue Jun 27, 2024 · 5 comments · Fixed by #214
Labels
enhancement New feature or request help wanted Extra attention is needed

Comments

@tchaton
Copy link
Collaborator

tchaton commented Jun 27, 2024

🚀 Feature

Motivation

Right now, it is possible to do this in a Lightning Studio

optimize(
	output_dir="./optimized_data"
)

However, when running this code in a multi machine jobs, this won't properly work.

Instead, we should convert the output_dir to an s3 path pointing to the node 0 artifacts path + the user provided output_dir

Pitch

Alternatives

Additional context

@tchaton tchaton added enhancement New feature or request help wanted Extra attention is needed labels Jun 27, 2024
@deependujha
Copy link
Collaborator

I tested this in lightning studio and it worked (as you've also stated)

import os
from litdata import optimize, Machine

def compress(index):
    return (index, index ** 2)

optimize(
    fn=compress,
    inputs=list(range(100)),
    num_workers=2,
    output_dir="./output_dir",
    chunk_bytes="64MB",
    mode="overwrite",
    num_nodes=1,
    machine=Machine.DATA_PREP,
)

But, I don't get it.

However, when running this code in a machine machine jobs, this won't properly work.

What do you mean by: machine machine jobs?

@tchaton
Copy link
Collaborator Author

tchaton commented Jul 5, 2024

Sorry, it was a typo. I meant multi machine jobs. If you put num_nodes=2 for example.

Both machines are going to store the data locally but never merge it.

@deependujha
Copy link
Collaborator

Got it. I'll try fixing this.

@deependujha
Copy link
Collaborator

Plz clarify this:

Instead, we should convert the output_dir to an s3 path pointing to the node 0 artifacts path + the user provided output_dir.

Let's say output_dir="./optimized_data" and resolve_dir returns us with _output_dir=Dir(path='/teamspace/studios/this_studio/optimized_data', url=None).

So, what should _output_dir be modified to?

I tried making it: /teamspace/studios/{STUDIO_NAME}/optimized_data, but it fails with error that OSError: [Errno 30] Read-only file system: '/teamspace/studios/local-path-in-distributed-optimize/optimized_data'.

@tchaton
Copy link
Collaborator Author

tchaton commented Jul 8, 2024

Hey @deependujha. It needs to be this one: https://github.com/Lightning-AI/litdata/blob/main/src/litdata/processing/utilities.py#L182

that gets translated into /teamspace/jobs/{job_name}/{rank_0}/{user_folder}

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request help wanted Extra attention is needed
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants