You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
When the input is really big, sometimes Madoop will try to open way too many output files and crash.
In mapreduce.py, prepare_input_files(), we define a magic number MAX_INPUT_SPLIT_SIZE that ends up determining the number of output files we try to make, unbounded by anything.
Then we open all the output files at once in a loop with
(This problem showed up once with the bigger dataset for P5. Depending on a student's OS and choice of intermediate output, the input size to a given MR job could be massive. The quick fix was to bump the value of MAX_INPUT_SPLIT_SIZE in their local environment, but we should fix it for real.)
Potential solutions:
Open output files one at a time instead
Use the split as a subprocess and avoid opening the files in Python
When the input is really big, sometimes Madoop will try to open way too many output files and crash.
In
mapreduce.py, prepare_input_files()
, we define a magic numberMAX_INPUT_SPLIT_SIZE
that ends up determining the number of output files we try to make, unbounded by anything.Then we open all the output files at once in a loop with
(This problem showed up once with the bigger dataset for P5. Depending on a student's OS and choice of intermediate output, the input size to a given MR job could be massive. The quick fix was to bump the value of
MAX_INPUT_SPLIT_SIZE
in their local environment, but we should fix it for real.)Potential solutions:
split
as a subprocess and avoid opening the files in Python(CC + credit for solutions @MattyMay)
The text was updated successfully, but these errors were encountered: