Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Don't open too many output files #69

Closed
melodell opened this issue Nov 21, 2023 · 2 comments · Fixed by #70
Closed

Don't open too many output files #69

melodell opened this issue Nov 21, 2023 · 2 comments · Fixed by #70

Comments

@melodell
Copy link
Member

When the input is really big, sometimes Madoop will try to open way too many output files and crash.

In mapreduce.py, prepare_input_files(), we define a magic number MAX_INPUT_SPLIT_SIZE that ends up determining the number of output files we try to make, unbounded by anything.
Then we open all the output files at once in a loop with

outfiles = [stack.enter_context(i.open('w')) for i in outpaths]

(This problem showed up once with the bigger dataset for P5. Depending on a student's OS and choice of intermediate output, the input size to a given MR job could be massive. The quick fix was to bump the value of MAX_INPUT_SPLIT_SIZE in their local environment, but we should fix it for real.)

Potential solutions:

  1. Open output files one at a time instead
  2. Use the split as a subprocess and avoid opening the files in Python

(CC + credit for solutions @MattyMay)

@awdeorio
Copy link
Contributor

Yea, the 1 MB max input slit size is probably too small. Would 10 MB or 100 MB fix the problem?

@noah-weingarden
Copy link
Contributor

This issue is rendered redundant by #70.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants