Don't open too many output files #69

melodell · 2023-11-21T14:55:28Z

When the input is really big, sometimes Madoop will try to open way too many output files and crash.

In mapreduce.py, prepare_input_files(), we define a magic number MAX_INPUT_SPLIT_SIZE that ends up determining the number of output files we try to make, unbounded by anything.
Then we open all the output files at once in a loop with

outfiles = [stack.enter_context(i.open('w')) for i in outpaths]

(This problem showed up once with the bigger dataset for P5. Depending on a student's OS and choice of intermediate output, the input size to a given MR job could be massive. The quick fix was to bump the value of MAX_INPUT_SPLIT_SIZE in their local environment, but we should fix it for real.)

Potential solutions:

Open output files one at a time instead
Use the split as a subprocess and avoid opening the files in Python

(CC + credit for solutions @MattyMay)

The text was updated successfully, but these errors were encountered:

awdeorio · 2023-11-21T18:34:35Z

Yea, the 1 MB max input slit size is probably too small. Would 10 MB or 100 MB fix the problem?

noah-weingarden · 2023-11-22T10:47:16Z

This issue is rendered redundant by #70.

noah-weingarden mentioned this issue Nov 22, 2023

Optimizations #70

Merged

awdeorio closed this as completed in #70 Nov 6, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Don't open too many output files #69

Don't open too many output files #69

melodell commented Nov 21, 2023

awdeorio commented Nov 21, 2023

noah-weingarden commented Nov 22, 2023

Don't open too many output files #69

Don't open too many output files #69

Comments

melodell commented Nov 21, 2023

awdeorio commented Nov 21, 2023

noah-weingarden commented Nov 22, 2023