Using Boto3 to create EMR cluster.
rahul22022 opened this issue · 3 comments
Hi All,
I am trying to automate the EMR cluster creation using Boto3. Which i am using to create the EMR cluster. I need a cluster created with Impala configured.
Here is the parmas i passed to run_job_flow
Name='AutmateEMR',
ReleaseLabel='emr-4.6.0',
Instances={
'InstanceGroups': [{'InstanceCount':4,'InstanceRole':'CORE','InstanceType':'r3.8xlarge','Name':'slave'},{'InstanceCount':1,'InstanceRole':'MASTER','InstanceType':'r3.8xlarge','Name':'master'}],
'Ec2KeyName': 'MyKey',
'KeepJobFlowAliveWhenNoSteps': True,
'TerminationProtected': False,
'Ec2SubnetId': 'id',
'EmrManagedMasterSecurityGroup': 'value',
'EmrManagedSlaveSecurityGroup': 'value',
'ServiceAccessSecurityGroup': 'value',
},
BootstrapActions=[{'Name': 'Install Impala2','ScriptBootstrapAction': {'Path': 's3://coeus/bigtop/impala/impala-install'}}],
Applications=[{'Name':'Hadoop','Name':'Spark','Name':'Ganglia','Name':'Hive','Name':'Presto-Sandbox'}],
JobFlowRole='EMR_EC2_DefaultRole',
ServiceRole='EMR_DefaultRole',
VisibleToAllUsers=True|False,
Tags=[{"Key":"owner","Value":"myname"}],
Configurations=[{"Classification":"hadoop-env","Properties":{},"Configurations":[{"Classification":"export","Properties":{"JAVA_HOME":"/usr/lib/jvm/java-1.8.0"},"Configurations":[]}]},{"Classification":"spark-env","Properties":{},"Configurations":[{"Classification":"export","Properties":{"JAVA_HOME":"/usr/lib/jvm/java-1.8.0"},"Configurations":[]}]}]
This code successfully creates the cluster but when i try to run the MapR jobs like distcp on the cluster it throws this error
"Error: Could not find or load main class org.apache.hadoop.mapreduce.v2.app.MRAppMaster"
I created the cluster using the console and passing same parameters the cluster gets created and I am able to run the MapR commands (Distcp) without having any issues. I am not sure why does EMR cluster created with Boto3 has the issues with hadoop config.
Here is the cli export of the cluster i created using the console.
aws emr create-cluster --applications Name=Hadoop Name=Spark Name=Ganglia Name=Presto-Sandbox Name=Hive --bootstrap-actions '[{"Path":"s3://coeus/bigtop/impala/impala-install","Name":"Custom action"}]' --tags 'owner=myname' --ec2-attributes '{"KeyName":"mykey","InstanceProfile":"EMR_EC2_DefaultRole","ServiceAccessSecurityGroup":"","SubnetId":"","EmrManagedSlaveSecurityGroup":"","EmrManagedMasterSecurityGroup":""}' --service-role EMR_DefaultRole --release-label emr-4.6.0 --log-uri ' ' --name 'automate' --instance-groups '[{"InstanceCount":1,"InstanceGroupType":"MASTER","InstanceType":"r3.8xlarge","Name":"master"},{"InstanceCount":4,"InstanceGroupType":"CORE","InstanceType":"r3.8xlarge","Name":"slave"}]' --configurations '[{"Classification":"hadoop-env","Properties":{},"Configurations":[{"Classification":"export","Properties":{"JAVA_HOME":"/usr/lib/jvm/java-1.8.0"},"Configurations":[]}]},{"Classification":"spark-env","Properties":{},"Configurations":[{"Classification":"export","Properties":{"JAVA_HOME":"/usr/lib/jvm/java-1.8.0"},"Configurations":[]}]}]' --region
I am out of ideas why it should be happening. any help is highly appreciated.
hi @rahul22022
I ran your setting in boto3 and found that there is a little problem in your setting.
yours: Applications=[{'Name':'Hadoop','Name':'Spark','Name':'Ganglia','Name':'Hive','Name':'Presto-Sandbox'}]
According to official document https://boto3.readthedocs.io/en/latest/reference/services/emr.html#EMR.Client.run_job_flow
Setting of applications should be
Applications=[{'Name':'Hadoop'},{'Name':'Spark'},{'Name':'Ganglia'},{'Name':'Hive'},{'Name':'Presto-Sandbox'}]
I think it will be ok if you update this line.
@rahul22022 My dude how can you use the tag "InstanceProfile"
used in AWS CLI
when you deploy the cluster with boto3. I have seen in the documentaion. But I dont see how in the tag Instances
for run_job_flow
. and the same question for the options in AWS CLI --region
and --enable-debugging
@AndresUrregoAngel If you look carefully at @rahul22022 's example, it looks like JobFlowRole is the equivalent of InstanceProfile.
I'm new to AWS and this boto3 Python API seems incredibly opaque, hard to figure out. The message in question complains about InstanceProfile, probably coming from deeper in the stack.
As for --region I think it's the Instances parameter subscripted ['Placement']['AvailabilityZone'].
Somebody please correct me if I'm wrong of course.