Elastic MapReduce

Managed cluster platform based on the Hadoop framework to run Big Data tools such as Spark, Hive, Flink, Presto. MapReduce is an obsolete programming model from Hadoop.

Resources are managed be default with YARN (Yet Another Resource Negotiator)
Collection of nodes (EC2 instances) with different roles: master, core (host HDFS data) and task (which can be spot instances)
Provision nodes if core nodes fails and can add task nodes on the fly
Clusters can be started and scheduled from AWS Data Pipeline
Spark streaming dataset can be created from a Kineses Data Stream
Includes notebooks capabilities with Apache Zeppelin
Offers EMR Notebooks (for free!) as extension of Jupyter notebooks with more AWS integrations
- Backed up to S3
- Provision cluster from the notebook
- Secured through a VPC
Apart fom IAM, SSH and IAM roles along with notebooks tags, it can integrate Kerberos authentication

Storage

Hadoop Distributed File System (HDFS): Good performance but it's ephemeral, used only to save intermediate results
EMR File System (EMRFS): Access data in S3 directly via the s3:// scheme, saves the metadata in Dynamo DB
Local system: default disk in the EC2 instance

References

AWS Documentation

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ElasticMapReduce.md

ElasticMapReduce.md

Elastic MapReduce

Storage

References

Files

ElasticMapReduce.md

Latest commit

History

ElasticMapReduce.md

File metadata and controls

Elastic MapReduce

Storage

References