Skip to content

Project number two of Distributed System course at Pitt for graduate students.

Notifications You must be signed in to change notification settings

raphaelcfernandes/Distributed-System-Mini-Google

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

28 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Distributed-System-Mini-Google

Project number two of Distributed System course at Pitt for graduate students.

Amanda Crawford: acrawfor

Raphael

  1. Project Overview
    1. Map Reduce Design Considerations
  2. Mini Google Overview
  3. Hadoop
    1. Hadoop Phases
    2. Hadoop Cluster
  4. Hadoop Communication Paradigmns
    1. Job Tracker - Task Tracker
    2. Map to Combine
    3. Combine to Partition
    4. Partion to Reducer
  5. HDFS Design Considerations
    1. HDFS Operations
    2. Components

Project Overview

Map Reduce Design Considerations

  • How do we assign work units to workers?
  • What if we have more work units than workers?
  • What if workers need to share partial results?
  • How do we aggregate partial results?  How do we know all the workers have finished?
  • What if workers fail?

Mini Google Overview

1. Hadoop Implementation

  • HDFS
  • MapReduce Engine
  • HBase / Lucene ( will not be used in project but influences designs)

2. Alternate Implementation

  • Distributed File System - Data Storage and Batch Processing
  • MapReduce - Computing
  • Inverted Index - Searching and Query Handling

Hadoop

Hadoop Phases

  1. Map
  2. Sort/ Shuffle / Aggregate
  3. Reduce

Hadoop Cluster

yarn

  • Master Nodes
    • Yarn Resource Manager
    • HDFS Name Node
    • Job Trackers
    • Work Queue
  • Worker Nodes
    • Yarn NodeManager
    • HDFS DataNode
    • Task Tracker and tasks

Hadoop Communication Paradigmns

Job Tracker - Task Tracker

  • _Task Trackers
    • create and remove tasks received from the job trackers
    • communicates task status to job tracker by sending heartbeats
  • _Job Tracker
    • Manages task trackers
    • Schedule and tracks jobs progress
    • Receives jobs from clients

Map to Combine

Combine to Partition

Partion to Reducer

HDFS Design Considerations

hdfsarchitecture Source: https://hadoop.apache.org/docs/r1.2.1/hdfs_design.html

HDFS Operations

  • Read - Load data

Components

  • Name Node
  • Data Node
  • Task Tracker

Sources: Map Reduce & Hadoop

Hadoop YARN

Hadoop HDFS

[Hadoop HDFS] (https://www.tutorialspoint.com/hadoop/hadoop_hdfs_overview.htm)

Searching HDFS

HBase

[MapReduce] (https://www.guru99.com/introduction-to-mapreduce.html)

[MapReduce] (https://hci.stanford.edu/courses/cs448g/a2/files/map_reduce_tutorial.pdf)

[MapReduce] (https://www.tutorialspoint.com/hadoop/hadoop_mapreduce.htm) MAP Reduce Thorough

RPYC

Multiprocessing

Inverted Index

About

Project number two of Distributed System course at Pitt for graduate students.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages