From 098f8373b8c6e6ca90fcf7aaee5617010d9d02b6 Mon Sep 17 00:00:00 2001 From: noah-weingarden <33741795+noah-weingarden@users.noreply.github.com> Date: Sat, 30 Mar 2024 03:16:40 -0400 Subject: [PATCH 1/4] Add custom partitioner to streaming tutorial --- README_Hadoop_Streaming.md | 32 ++++++++++++++++++++++++++++++++ 1 file changed, 32 insertions(+) diff --git a/README_Hadoop_Streaming.md b/README_Hadoop_Streaming.md index 7dab572..9d49bc7 100644 --- a/README_Hadoop_Streaming.md +++ b/README_Hadoop_Streaming.md @@ -286,6 +286,38 @@ if __name__ == "__main__": main() ``` +## Custom partitioner +If you need to specify which key-value pairs are sent to which reducers, you can create a custom partitioner. Here's a sample which works with our word count example. +```python +#!/usr/bin/env -S python3 -u +"""Word count partitioner.""" +import sys + + +num_reducers = int(sys.argv[1]) + + +for line in sys.stdin: + key, value = line.split("\t") + if key[0] <= "G": + print(0 % num_reducers) + else: + print(1 % num_reducers) +``` + +Each line of output from the mappers is streamed to this partition file, and the number of reducers is set to `sys.argv[1]`. For each line, the partitioner checks whether the first letter of the key is less than or greater than "G". If it's less than "G", the line is sent to the first reducer, and if it's greater, the line is sent to the second reducer. + +Use the `-partitioner` command-line argument to tell Madoop to use this partitioner. + +```console +$ madoop \ + -input example/input \ + -output example/output \ + -mapper example/map.py \ + -reducer example/reduce.py \ + -partitioner example/partition.py +``` + ## Tips and tricks These are some pro-tips for working with MapReduce programs written in Python for the Hadoop Streaming interface. From 0e24b6a4938ee90a4a059b88307c172d1a4b8589 Mon Sep 17 00:00:00 2001 From: noah-weingarden <33741795+noah-weingarden@users.noreply.github.com> Date: Sat, 30 Mar 2024 03:39:21 -0400 Subject: [PATCH 2/4] Add comparison to Hadoop --- README_Hadoop_Streaming.md | 2 ++ 1 file changed, 2 insertions(+) diff --git a/README_Hadoop_Streaming.md b/README_Hadoop_Streaming.md index 9d49bc7..81d0e38 100644 --- a/README_Hadoop_Streaming.md +++ b/README_Hadoop_Streaming.md @@ -318,6 +318,8 @@ $ madoop \ -partitioner example/partition.py ``` +This feature works similarly to Hadoop's [`Partitioner` class](https://hadoop.apache.org/docs/current/api/org/apache/hadoop/mapreduce/Partitioner.html). The main difference is that Hadoop only allows partitioners to be a Java class, while Madoop allows any executable that reads from `stdin` and writes to `stdout`. + ## Tips and tricks These are some pro-tips for working with MapReduce programs written in Python for the Hadoop Streaming interface. From f3b34f045dea16a6a922cc454d77c6633a6e967c Mon Sep 17 00:00:00 2001 From: noah-weingarden <33741795+noah-weingarden@users.noreply.github.com> Date: Sat, 30 Mar 2024 22:39:50 -0400 Subject: [PATCH 3/4] not compatible with Hadoop Co-authored-by: Andrew DeOrio --- README_Hadoop_Streaming.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/README_Hadoop_Streaming.md b/README_Hadoop_Streaming.md index 81d0e38..4f5bd5d 100644 --- a/README_Hadoop_Streaming.md +++ b/README_Hadoop_Streaming.md @@ -318,7 +318,7 @@ $ madoop \ -partitioner example/partition.py ``` -This feature works similarly to Hadoop's [`Partitioner` class](https://hadoop.apache.org/docs/current/api/org/apache/hadoop/mapreduce/Partitioner.html). The main difference is that Hadoop only allows partitioners to be a Java class, while Madoop allows any executable that reads from `stdin` and writes to `stdout`. +This feature is similar to Hadoop's [`Partitioner` class](https://hadoop.apache.org/docs/current/api/org/apache/hadoop/mapreduce/Partitioner.html), although it is not directly compatible with Hadoop. The main difference is that Hadoop only allows partitioners to be a Java class, while Madoop allows any executable that reads from `stdin` and writes to `stdout`. ## Tips and tricks These are some pro-tips for working with MapReduce programs written in Python for the Hadoop Streaming interface. From c5a7b7f3011f53db18bf83729564b6bfac952d78 Mon Sep 17 00:00:00 2001 From: noah-weingarden <33741795+noah-weingarden@users.noreply.github.com> Date: Sat, 30 Mar 2024 22:40:49 -0400 Subject: [PATCH 4/4] Move to the end --- README_Hadoop_Streaming.md | 68 +++++++++++++++++++------------------- 1 file changed, 34 insertions(+), 34 deletions(-) diff --git a/README_Hadoop_Streaming.md b/README_Hadoop_Streaming.md index 4f5bd5d..0fadafa 100644 --- a/README_Hadoop_Streaming.md +++ b/README_Hadoop_Streaming.md @@ -286,40 +286,6 @@ if __name__ == "__main__": main() ``` -## Custom partitioner -If you need to specify which key-value pairs are sent to which reducers, you can create a custom partitioner. Here's a sample which works with our word count example. -```python -#!/usr/bin/env -S python3 -u -"""Word count partitioner.""" -import sys - - -num_reducers = int(sys.argv[1]) - - -for line in sys.stdin: - key, value = line.split("\t") - if key[0] <= "G": - print(0 % num_reducers) - else: - print(1 % num_reducers) -``` - -Each line of output from the mappers is streamed to this partition file, and the number of reducers is set to `sys.argv[1]`. For each line, the partitioner checks whether the first letter of the key is less than or greater than "G". If it's less than "G", the line is sent to the first reducer, and if it's greater, the line is sent to the second reducer. - -Use the `-partitioner` command-line argument to tell Madoop to use this partitioner. - -```console -$ madoop \ - -input example/input \ - -output example/output \ - -mapper example/map.py \ - -reducer example/reduce.py \ - -partitioner example/partition.py -``` - -This feature is similar to Hadoop's [`Partitioner` class](https://hadoop.apache.org/docs/current/api/org/apache/hadoop/mapreduce/Partitioner.html), although it is not directly compatible with Hadoop. The main difference is that Hadoop only allows partitioners to be a Java class, while Madoop allows any executable that reads from `stdin` and writes to `stdout`. - ## Tips and tricks These are some pro-tips for working with MapReduce programs written in Python for the Hadoop Streaming interface. @@ -374,3 +340,37 @@ def reduce_one_group(key, group): for line in group: pass # Do something ``` + +## Custom partitioner +If you need to specify which key-value pairs are sent to which reducers, you can create a custom partitioner. Here's a sample which works with our word count example. +```python +#!/usr/bin/env -S python3 -u +"""Word count partitioner.""" +import sys + + +num_reducers = int(sys.argv[1]) + + +for line in sys.stdin: + key, value = line.split("\t") + if key[0] <= "G": + print(0 % num_reducers) + else: + print(1 % num_reducers) +``` + +Each line of output from the mappers is streamed to this partition file, and the number of reducers is set to `sys.argv[1]`. For each line, the partitioner checks whether the first letter of the key is less than or greater than "G". If it's less than "G", the line is sent to the first reducer, and if it's greater, the line is sent to the second reducer. + +Use the `-partitioner` command-line argument to tell Madoop to use this partitioner. + +```console +$ madoop \ + -input example/input \ + -output example/output \ + -mapper example/map.py \ + -reducer example/reduce.py \ + -partitioner example/partition.py +``` + +This feature is similar to Hadoop's [`Partitioner` class](https://hadoop.apache.org/docs/current/api/org/apache/hadoop/mapreduce/Partitioner.html), although it is not directly compatible with Hadoop. The main difference is that Hadoop only allows partitioners to be a Java class, while Madoop allows any executable that reads from `stdin` and writes to `stdout`.