Project number two of Distributed System course at Pitt for graduate students.
Amanda Crawford: acrawfor
Raphael
- Project Overview
- Map Reduce Design Considerations
- Mini Google Overview
- Hadoop
- Hadoop Phases
- Hadoop Cluster
- Hadoop Communication Paradigmns
- Job Tracker - Task Tracker
- Map to Combine
- Combine to Partition
- Partion to Reducer
- HDFS Design Considerations
- HDFS Operations
- Components
- How do we assign work units to workers?
- What if we have more work units than workers?
- What if workers need to share partial results?
- How do we aggregate partial results? How do we know all the workers have finished?
- What if workers fail?
1. Hadoop Implementation
- HDFS
- MapReduce Engine
- HBase / Lucene ( will not be used in project but influences designs)
2. Alternate Implementation
- Distributed File System - Data Storage and Batch Processing
- MapReduce - Computing
- Inverted Index - Searching and Query Handling
- Map
- Sort/ Shuffle / Aggregate
- Reduce
- Master Nodes
- Yarn Resource Manager
- HDFS Name Node
- Job Trackers
- Work Queue
- Worker Nodes
- Yarn NodeManager
- HDFS DataNode
- Task Tracker and tasks
- _Task Trackers
- create and remove tasks received from the job trackers
- communicates task status to job tracker by sending heartbeats
- _Job Tracker
- Manages task trackers
- Schedule and tracks jobs progress
- Receives jobs from clients
Source: https://hadoop.apache.org/docs/r1.2.1/hdfs_design.html
- Read - Load data
- Name Node
- Data Node
- Task Tracker
Sources: Map Reduce & Hadoop
[Hadoop HDFS] (https://www.tutorialspoint.com/hadoop/hadoop_hdfs_overview.htm)
[MapReduce] (https://www.guru99.com/introduction-to-mapreduce.html)
[MapReduce] (https://hci.stanford.edu/courses/cs448g/a2/files/map_reduce_tutorial.pdf)
[MapReduce] (https://www.tutorialspoint.com/hadoop/hadoop_mapreduce.htm) MAP Reduce Thorough