Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Using Boto3 to create EMR cluster. #8

Open
rahul22022 opened this issue Sep 19, 2016 · 3 comments
Open

Using Boto3 to create EMR cluster. #8

rahul22022 opened this issue Sep 19, 2016 · 3 comments

Comments

@rahul22022
Copy link

Hi All,

I am trying to automate the EMR cluster creation using Boto3. Which i am using to create the EMR cluster. I need a cluster created with Impala configured.
Here is the parmas i passed to run_job_flow
Name='AutmateEMR',
ReleaseLabel='emr-4.6.0',
Instances={
'InstanceGroups': [{'InstanceCount':4,'InstanceRole':'CORE','InstanceType':'r3.8xlarge','Name':'slave'},{'InstanceCount':1,'InstanceRole':'MASTER','InstanceType':'r3.8xlarge','Name':'master'}],
'Ec2KeyName': 'MyKey',
'KeepJobFlowAliveWhenNoSteps': True,
'TerminationProtected': False,
'Ec2SubnetId': 'id',
'EmrManagedMasterSecurityGroup': 'value',
'EmrManagedSlaveSecurityGroup': 'value',
'ServiceAccessSecurityGroup': 'value',
},
BootstrapActions=[{'Name': 'Install Impala2','ScriptBootstrapAction': {'Path': 's3://coeus/bigtop/impala/impala-install'}}],
Applications=[{'Name':'Hadoop','Name':'Spark','Name':'Ganglia','Name':'Hive','Name':'Presto-Sandbox'}],
JobFlowRole='EMR_EC2_DefaultRole',
ServiceRole='EMR_DefaultRole',
VisibleToAllUsers=True|False,
Tags=[{"Key":"owner","Value":"myname"}],
Configurations=[{"Classification":"hadoop-env","Properties":{},"Configurations":[{"Classification":"export","Properties":{"JAVA_HOME":"/usr/lib/jvm/java-1.8.0"},"Configurations":[]}]},{"Classification":"spark-env","Properties":{},"Configurations":[{"Classification":"export","Properties":{"JAVA_HOME":"/usr/lib/jvm/java-1.8.0"},"Configurations":[]}]}]

This code successfully creates the cluster but when i try to run the MapR jobs like distcp on the cluster it throws this error
"Error: Could not find or load main class org.apache.hadoop.mapreduce.v2.app.MRAppMaster"

I created the cluster using the console and passing same parameters the cluster gets created and I am able to run the MapR commands (Distcp) without having any issues. I am not sure why does EMR cluster created with Boto3 has the issues with hadoop config.

Here is the cli export of the cluster i created using the console.

aws emr create-cluster --applications Name=Hadoop Name=Spark Name=Ganglia Name=Presto-Sandbox Name=Hive --bootstrap-actions '[{"Path":"s3://coeus/bigtop/impala/impala-install","Name":"Custom action"}]' --tags 'owner=myname' --ec2-attributes '{"KeyName":"mykey","InstanceProfile":"EMR_EC2_DefaultRole","ServiceAccessSecurityGroup":"","SubnetId":"","EmrManagedSlaveSecurityGroup":"","EmrManagedMasterSecurityGroup":""}' --service-role EMR_DefaultRole --release-label emr-4.6.0 --log-uri ' ' --name 'automate' --instance-groups '[{"InstanceCount":1,"InstanceGroupType":"MASTER","InstanceType":"r3.8xlarge","Name":"master"},{"InstanceCount":4,"InstanceGroupType":"CORE","InstanceType":"r3.8xlarge","Name":"slave"}]' --configurations '[{"Classification":"hadoop-env","Properties":{},"Configurations":[{"Classification":"export","Properties":{"JAVA_HOME":"/usr/lib/jvm/java-1.8.0"},"Configurations":[]}]},{"Classification":"spark-env","Properties":{},"Configurations":[{"Classification":"export","Properties":{"JAVA_HOME":"/usr/lib/jvm/java-1.8.0"},"Configurations":[]}]}]' --region

I am out of ideas why it should be happening. any help is highly appreciated.

@bryanyang0528
Copy link

hi @rahul22022
I ran your setting in boto3 and found that there is a little problem in your setting.
yours: Applications=[{'Name':'Hadoop','Name':'Spark','Name':'Ganglia','Name':'Hive','Name':'Presto-Sandbox'}]
According to official document https://boto3.readthedocs.io/en/latest/reference/services/emr.html#EMR.Client.run_job_flow

Setting of applications should be
Applications=[{'Name':'Hadoop'},{'Name':'Spark'},{'Name':'Ganglia'},{'Name':'Hive'},{'Name':'Presto-Sandbox'}]

I think it will be ok if you update this line.

@AndresUrregoAngel
Copy link

AndresUrregoAngel commented Aug 23, 2018

@rahul22022 My dude how can you use the tag "InstanceProfile" used in AWS CLI when you deploy the cluster with boto3. I have seen in the documentaion. But I dont see how in the tag Instances for run_job_flow. and the same question for the options in AWS CLI --region and --enable-debugging

@reed9999
Copy link

reed9999 commented Sep 3, 2018

@AndresUrregoAngel If you look carefully at @rahul22022 's example, it looks like JobFlowRole is the equivalent of InstanceProfile.

I'm new to AWS and this boto3 Python API seems incredibly opaque, hard to figure out. The message in question complains about InstanceProfile, probably coming from deeper in the stack.

As for --region I think it's the Instances parameter subscripted ['Placement']['AvailabilityZone'].

Somebody please correct me if I'm wrong of course.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants