Using Boto3 to create EMR cluster. #8

rahul22022 · 2016-09-19T23:52:04Z

Hi All,

I am trying to automate the EMR cluster creation using Boto3. Which i am using to create the EMR cluster. I need a cluster created with Impala configured.
Here is the parmas i passed to run_job_flow
Name='AutmateEMR',
ReleaseLabel='emr-4.6.0',
Instances={
'InstanceGroups': [{'InstanceCount':4,'InstanceRole':'CORE','InstanceType':'r3.8xlarge','Name':'slave'},{'InstanceCount':1,'InstanceRole':'MASTER','InstanceType':'r3.8xlarge','Name':'master'}],
'Ec2KeyName': 'MyKey',
'KeepJobFlowAliveWhenNoSteps': True,
'TerminationProtected': False,
'Ec2SubnetId': 'id',
'EmrManagedMasterSecurityGroup': 'value',
'EmrManagedSlaveSecurityGroup': 'value',
'ServiceAccessSecurityGroup': 'value',
},
BootstrapActions=[{'Name': 'Install Impala2','ScriptBootstrapAction': {'Path': 's3://coeus/bigtop/impala/impala-install'}}],
Applications=[{'Name':'Hadoop','Name':'Spark','Name':'Ganglia','Name':'Hive','Name':'Presto-Sandbox'}],
JobFlowRole='EMR_EC2_DefaultRole',
ServiceRole='EMR_DefaultRole',
VisibleToAllUsers=True|False,
Tags=[{"Key":"owner","Value":"myname"}],
Configurations=[{"Classification":"hadoop-env","Properties":{},"Configurations":[{"Classification":"export","Properties":{"JAVA_HOME":"/usr/lib/jvm/java-1.8.0"},"Configurations":[]}]},{"Classification":"spark-env","Properties":{},"Configurations":[{"Classification":"export","Properties":{"JAVA_HOME":"/usr/lib/jvm/java-1.8.0"},"Configurations":[]}]}]

This code successfully creates the cluster but when i try to run the MapR jobs like distcp on the cluster it throws this error
"Error: Could not find or load main class org.apache.hadoop.mapreduce.v2.app.MRAppMaster"

I created the cluster using the console and passing same parameters the cluster gets created and I am able to run the MapR commands (Distcp) without having any issues. I am not sure why does EMR cluster created with Boto3 has the issues with hadoop config.

Here is the cli export of the cluster i created using the console.

aws emr create-cluster --applications Name=Hadoop Name=Spark Name=Ganglia Name=Presto-Sandbox Name=Hive --bootstrap-actions '[{"Path":"s3://coeus/bigtop/impala/impala-install","Name":"Custom action"}]' --tags 'owner=myname' --ec2-attributes '{"KeyName":"mykey","InstanceProfile":"EMR_EC2_DefaultRole","ServiceAccessSecurityGroup":"","SubnetId":"","EmrManagedSlaveSecurityGroup":"","EmrManagedMasterSecurityGroup":""}' --service-role EMR_DefaultRole --release-label emr-4.6.0 --log-uri ' ' --name 'automate' --instance-groups '[{"InstanceCount":1,"InstanceGroupType":"MASTER","InstanceType":"r3.8xlarge","Name":"master"},{"InstanceCount":4,"InstanceGroupType":"CORE","InstanceType":"r3.8xlarge","Name":"slave"}]' --configurations '[{"Classification":"hadoop-env","Properties":{},"Configurations":[{"Classification":"export","Properties":{"JAVA_HOME":"/usr/lib/jvm/java-1.8.0"},"Configurations":[]}]},{"Classification":"spark-env","Properties":{},"Configurations":[{"Classification":"export","Properties":{"JAVA_HOME":"/usr/lib/jvm/java-1.8.0"},"Configurations":[]}]}]' --region

I am out of ideas why it should be happening. any help is highly appreciated.

bryanyang0528 · 2017-08-14T04:36:29Z

hi @rahul22022
I ran your setting in boto3 and found that there is a little problem in your setting.
yours: Applications=[{'Name':'Hadoop','Name':'Spark','Name':'Ganglia','Name':'Hive','Name':'Presto-Sandbox'}]
According to official document https://boto3.readthedocs.io/en/latest/reference/services/emr.html#EMR.Client.run_job_flow

Setting of applications should be
Applications=[{'Name':'Hadoop'},{'Name':'Spark'},{'Name':'Ganglia'},{'Name':'Hive'},{'Name':'Presto-Sandbox'}]

I think it will be ok if you update this line.

AndresUrregoAngel · 2018-08-23T16:49:18Z

@rahul22022 My dude how can you use the tag "InstanceProfile" used in AWS CLI when you deploy the cluster with boto3. I have seen in the documentaion. But I dont see how in the tag Instances for run_job_flow. and the same question for the options in AWS CLI --region and --enable-debugging

reed9999 · 2018-09-03T06:42:23Z

@AndresUrregoAngel If you look carefully at @rahul22022 's example, it looks like JobFlowRole is the equivalent of InstanceProfile.

I'm new to AWS and this boto3 Python API seems incredibly opaque, hard to figure out. The message in question complains about InstanceProfile, probably coming from deeper in the stack.

As for --region I think it's the Instances parameter subscripted ['Placement']['AvailabilityZone'].

Somebody please correct me if I'm wrong of course.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Using Boto3 to create EMR cluster. #8

Using Boto3 to create EMR cluster. #8

rahul22022 commented Sep 19, 2016

bryanyang0528 commented Aug 14, 2017

AndresUrregoAngel commented Aug 23, 2018 •

edited

Loading

reed9999 commented Sep 3, 2018

Using Boto3 to create EMR cluster. #8

Using Boto3 to create EMR cluster. #8

Comments

rahul22022 commented Sep 19, 2016

bryanyang0528 commented Aug 14, 2017

AndresUrregoAngel commented Aug 23, 2018 • edited Loading

reed9999 commented Sep 3, 2018

AndresUrregoAngel commented Aug 23, 2018 •

edited

Loading