Skip to content

Latest commit

 

History

History
126 lines (89 loc) · 3.42 KB

README.md

File metadata and controls

126 lines (89 loc) · 3.42 KB

Distributed-System-Mini-Google

Project number two of Distributed System course at Pitt for graduate students.

Amanda Crawford: acrawfor

Raphael

  1. Project Overview
    1. Map Reduce Design Considerations
  2. Mini Google Overview
  3. Hadoop
    1. Hadoop Phases
    2. Hadoop Cluster
  4. Hadoop Communication Paradigmns
    1. Job Tracker - Task Tracker
    2. Map to Combine
    3. Combine to Partition
    4. Partion to Reducer
  5. HDFS Design Considerations
    1. HDFS Operations
    2. Components

Project Overview

Map Reduce Design Considerations

  • How do we assign work units to workers?
  • What if we have more work units than workers?
  • What if workers need to share partial results?
  • How do we aggregate partial results?  How do we know all the workers have finished?
  • What if workers fail?

Mini Google Overview

1. Hadoop Implementation

  • HDFS
  • MapReduce Engine
  • HBase / Lucene ( will not be used in project but influences designs)

2. Alternate Implementation

  • Distributed File System - Data Storage and Batch Processing
  • MapReduce - Computing
  • Inverted Index - Searching and Query Handling

Hadoop

Hadoop Phases

  1. Map
  2. Sort/ Shuffle / Aggregate
  3. Reduce

Hadoop Cluster

yarn

  • Master Nodes
    • Yarn Resource Manager
    • HDFS Name Node
    • Job Trackers
    • Work Queue
  • Worker Nodes
    • Yarn NodeManager
    • HDFS DataNode
    • Task Tracker and tasks

Hadoop Communication Paradigmns

Job Tracker - Task Tracker

  • _Task Trackers
    • create and remove tasks received from the job trackers
    • communicates task status to job tracker by sending heartbeats
  • _Job Tracker
    • Manages task trackers
    • Schedule and tracks jobs progress
    • Receives jobs from clients

Map to Combine

Combine to Partition

Partion to Reducer

HDFS Design Considerations

hdfsarchitecture Source: https://hadoop.apache.org/docs/r1.2.1/hdfs_design.html

HDFS Operations

  • Read - Load data

Components

  • Name Node
  • Data Node
  • Task Tracker

Sources: Map Reduce & Hadoop

Hadoop YARN

Hadoop HDFS

[Hadoop HDFS] (https://www.tutorialspoint.com/hadoop/hadoop_hdfs_overview.htm)

Searching HDFS

HBase

[MapReduce] (https://www.guru99.com/introduction-to-mapreduce.html)

[MapReduce] (https://hci.stanford.edu/courses/cs448g/a2/files/map_reduce_tutorial.pdf)

[MapReduce] (https://www.tutorialspoint.com/hadoop/hadoop_mapreduce.htm) MAP Reduce Thorough

RPYC

Multiprocessing

Inverted Index