Skip to content

Commit

Permalink
Merge remote-tracking branch 'origin/develop'
Browse files Browse the repository at this point in the history
  • Loading branch information
awdeorio committed Nov 7, 2022
2 parents 4c21100 + a65488c commit 5067e94
Show file tree
Hide file tree
Showing 5 changed files with 27 additions and 20 deletions.
2 changes: 1 addition & 1 deletion CONTRIBUTING.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,7 @@ Contributing to Madoop
Set up a development virtual environment.
```console
$ python3 -m venv .venv
$ source env/bin/activate
$ source .venv/bin/activate
$ pip install --editable .[dev,test]
```

Expand Down
2 changes: 1 addition & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,7 @@ Madoop: Michigan Hadoop

Michigan Hadoop (`madoop`) is a light weight MapReduce framework for education. Madoop implements the [Hadoop Streaming](https://hadoop.apache.org/docs/r1.2.1/streaming.html) interface. Madoop is implemented in Python and runs on a single machine.

For an in-depth explanation of how to write MapReduce programs in Python for Hadoop Streaming, see our [Hadoop Streaming tutorial](README_hadoop_streaming.md).
For an in-depth explanation of how to write MapReduce programs in Python for Hadoop Streaming, see our [Hadoop Streaming tutorial](README_Hadoop_Streaming.md).


## Quick start
Expand Down
40 changes: 23 additions & 17 deletions README_Hadoop_Streaming.md
Original file line number Diff line number Diff line change
Expand Up @@ -22,12 +22,11 @@ example

Execute the example MapReduce program using Madoop and show the output.
```console
$ cd example
$ madoop \
-input input \
-output output \
-mapper map.py \
-reducer reduce.py
-input example/input \
-output example/output \
-mapper example/map.py \
-reducer example/reduce.py
$ cat output/part-*
Goodbye 1
Expand All @@ -40,6 +39,9 @@ Hello 2
## Overview
[Hadoop Streaming](https://hadoop.apache.org/docs/r1.2.1/streaming.html) is a MapReduce API that works with any programming language. The mapper and the reducer are executables that read input from stdin and write output to stdout.

## Partition
The MapReduce framework begins by partitioning (splitting) the input. If the input size is large, a real MapReduce framework will break it up into smaller chunks. Each Map execution will process one input partition. In this tutorial, we're faking MapReduce at the command line with a single mapper, so we'll skip the partition step.

## Map
The mapper is an executable that reads input from stdin and writes output to stdout. Here's an example `map.py` which is part of a word count MapReduce program.
```python
Expand Down Expand Up @@ -109,7 +111,7 @@ def main():
word, _, count = line.partition("\t")
word_count[word] += int(count)
for word, count in word_count.items():
print(f"{word}\t{count}")
print(f"{word} {count}")

if __name__ == "__main__":
main()
Expand Down Expand Up @@ -151,9 +153,9 @@ The reduce output format is up to the programmer. Here's how to run the whole w
$ cat input/input* | python3 map.py | sort | python3 reduce.py
Bye 1
Goodbye 1
Hadoop 2
Hello 2
World 2
Hadoop 2
Hello 2
World 2
```

## `itertools.groupby()`
Expand All @@ -172,7 +174,7 @@ def main():
word, _, count = line.partition("\t")
word_count[word] += int(count)
for word, count in word_count.items():
print(f"{word}\t{count}")
print(f"{word} {count}")
```

If one reducer execution received one group, we could simplify the reducer and use only O(1) memory.
Expand All @@ -184,7 +186,7 @@ def reduce_one_group(key, group):
for line in group:
count = line.partition("\t")[2]
word_count += int(count)
print(f"{key}\t{word_count}")
print(f"{key} {word_count}")
```

If one reducer execution input contains multiple groups, how can we process one group at a time? We'll use `itertools.groupby()`.
Expand Down Expand Up @@ -238,24 +240,28 @@ def reduce_one_group(key, group):
for line in group:
count = line.partition("\t")[2]
word_count += int(count)
print(f"{key}\t{word_count}")
print(f"{key} {word_count}")
```

Finally, we can run our entire MapReduce program.
```console
$ cat input/* | ./map.py | sort| ./reduce.py
$ cat input/* | ./map.py | sort | ./reduce.py
Bye 1
Goodbye 1
Hadoop 2
Hello 2
World 2
Hadoop 2
Hello 2
World 2
```

### Template reducer
Here's a template you can copy-paste to get started on a reducer. The only part you need to edit is the `"IMPLEMENT ME"` line.
```python
#!/usr/bin/env python3
"""Word count reducer."""
"""
Template reducer.
https://github.com/eecs485staff/madoop/blob/main/README_Hadoop_Streaming.md
"""
import sys
import itertools

Expand Down
1 change: 1 addition & 0 deletions madoop/mapreduce.py
Original file line number Diff line number Diff line change
Expand Up @@ -156,6 +156,7 @@ def is_executable(exe):
result in difficult-to-understand error messages.
"""
exe = pathlib.Path(exe).resolve()
try:
subprocess.run(
str(exe),
Expand Down
2 changes: 1 addition & 1 deletion setup.py
Original file line number Diff line number Diff line change
Expand Up @@ -14,7 +14,7 @@
description="A light weight MapReduce framework for education.",
long_description=LONG_DESCRIPTION,
long_description_content_type="text/markdown",
version="1.0.1",
version="1.0.2",
author="Andrew DeOrio",
author_email="[email protected]",
url="https://github.com/eecs485staff/madoop/",
Expand Down

0 comments on commit 5067e94

Please sign in to comment.