Instructions

Launch cloudformation.
SSH into the instance using the ssh imformation provided in EMR console.
Launch the zepplin browser from the console.
In zeppelin, under Interpreter > Spark set the following config options (see this stackoverflow question and this cloudera post to see how to best customize these values.)
- spark.executor.instances 18
- spark.executor.cores 5
- spark.executor.memory 16g
When tuning these values, note that yarn will allocate something like 80% of the system's ram to itself, so this is the total ram you want to target. Furthermore, the value of spark.executor.memory will actually 10-20% higher than specified, but the amount of memory actually available in each container can be up to 40% lower. This may have something to do with how I am loading the jars here, I'm not sure. Anyways, you should assume some significant memory overhead when picking these values.
Inside the instance you have ssh into, find the file capacity-scheduler.xml. It should be located in /etc/hadoop/conf.empty.
In capacity-scheduler.xml there is an xml property that looks like this:

  <property>
    <name>yarn.scheduler.capacity.resource-calculator</name>
    <value>org.apache.hadoop.yarn.util.resource.DefaultResourceCalculator</value>
    <description>
      The ResourceCalculator implementation to be used to compare
      Resources in the scheduler.
      The default i.e. DefaultResourceCalculator only uses Memory while
      DominantResourceCalculator uses dominant-resource to compare
      multi-dimensional resources such as Memory, CPU etc.
    </description>
  </property>

This should be edited so that instead of "DefaultResourceCalculator" it says "DominantResourceCalculator" 7. In yarn-site.xml change

  <property>
    <name>yarn.nodemanager.resource.memory-mb</name>
    <value>12288</value>
  </property>

So that it reflects the amount of memory your want to allocate to your executors in this case (64512).

in yarn-site.xml change

  <property>
    <name>yarn.nodemanager.resource.cpu-vcores</name>
    <value>8</value>
  </property>

to equal the number of cores on your machine - 1.

Verify that the memory usage listed in the hadoop gui is near maximum and that the total number of containers should equal the number you specified + 1. If you don't have enough containers, you probably set the mem/container too high. However, ignore the fact that the VCore usage is too low - the hadoop gui doesn't know the real number and just subs in 1/container as a default.
Verify in the spark gui that things are operating as planned - you can access it by clicking on the links in the hadoop gui for the job that you are running. Note that the links in the gui are broken - they links relative to the master node and not the internet. You will need to substitute in the ec2 url for the internal url.

AkiraKane/news_analysis_spark

Instructions