layout title date
post
InstallationReadme
2017-05-07 16:12

Installation

YARN-HADOOP

In installation, followed the Default Course tutorial.

Project specific configuration

Installed in all the 8 vm-s. The cluster is configured currently to use Yarn and Hadoop only on these computers to acoid student1 small memory bottlenecks.

  • student13-x1 Hadoop master
  • student85-x1 slave
  • student85-x2 slave
  • student13-x2 slave
  • student14-x1 slave
  • student14-x2 slave

Most Important Settings

yarn-site.xml

....

yarn.nodemanager.resource.memory-mb 5020 yarn.scheduler.minimum-allocation-mb 1024 yarn.scheduler.maximum-allocation-mb 5020 yarn.resourcemanager.nodes.exclude-path /opt/hadoop-2.6.0/etc/hadoop/yarn.exclude true ...

yarn.exclude

student1-x2
student1-x1

hdfs-site.xml

...

dfs.replication 3 dfs.blocksize 64m dfs.client.block.write.replace-datanode-on-failure.policy ALWAYS dfs.hosts.exclude /opt/hadoop-2.6.0/etc/hadoop/dfs.exclude true ...

hdfs.exclude

student1-x2
student1-x1

SPARK

In installation, followed the Default Course tutorial.

Project specific configuration

spark.defaults - spark settings when you don't specify anything else in the commandline or in the program-

spark.executor.instances 4
spark.yarn.archive hdfs://student13-x1:9000/addedlibs/spark-archive.zip
spark.executor.memory 4g
spark.driver.memory 4g
spark.yarn.executor.memoryOverhead 700
spark.yarn.driver.memoryOverhead 1000
spark.yarn.am.memoryOverhead 700

XEN

All cluster node settings

vcpus = '1' memory = '6096'
maxmem = '8098'

Student1 vm Settings

vcpus = '1' memory = '1024'

PySpark on Student13-x1

  • Step1: Installed Jupyter and Ipython notebook and configured it to use together with Spark. Main tutorial1 I followed to install this module .
  • Step2: Configured ipython to be accessible outside localhost and added tunneling through cocserver port 18813.
  • Step3 Configured Spark and Ipython so that sparkContext could be executed inside a Ipython kernel. Basic idea how to do this I got from this tutorial2, however had to do some specifications to be adapted to this enviroment.
    • Added this to end of .profile of hduser in student13-x1 so SparkContext could be used by hduser

      export PYSPARK_DRIVER_PYTHON=ipython
      export PYSPARK_DRIVER_PYTHON_OPTS='notebook --ip=*' pyspark

    • Added this to the end of .bashrcs in student13-x1 so pyspark command would be executable everywhere.

      export SPARK_HOME="/opt/spark-2.1.0-bin-hadoop2.6"

Usage

Starting Pyspark

To start ipython together with spark just execute at hduser@student13-x1

pyspark --master yarn [other options ]

At you home directory or wherever you find this most reasonable. This starts a Ipython notebook in address http://202.45.128.135:18813/?token=dsmaofnfds (token is different every time) The sparContext has not yet started.

To initialize sparkContext you have to initialize an ipython kernel. After you have initialized the kernel, sparkContext initialization has started, but this takes some time. You can follow this process under Yarn applications interface.

Executing Our Project Source

  1. Start pyspark like specified under the previous header.
  2. Execute the code snippets commented as Task1,2,4,3. In the window you started the pyspark you can follow the job and application output that spark writes. You can also follow it under yarn application interface.

Monitoring the environment

Fom the ListOflinks web-page you can systematically access

Cluster References

  • Yarn and Hadoop Monitoring interfaces
  • Spark Monitoring interfaces
  • Project web-page
  • Ganglia

External project parts

Using management scripts

in hduser@student13-x1 home execute

  • startHadoop - restart the hadoop enviroment, yarn enviroment and history server
  • shareSettings - shares Yarn settings accrosse the cluster.

Files submitted in this report.

YARN HADOOP SETTINGS

  • yarn-site.xml
  • dfs-site.xml
  • yarn.exclude
  • dfs.exclude

SPARK SETTINGS

  • spark-defaults.conf
  • spark-env .sh

SOURCE CODE FOR DATA ANALYSIS

  • max_count.ipynb

WEB PAGE SOURCE CODE

/ngram folder

XEN

  • student13-x1.cfg- shares settings with student13-x2.cfg, student14-x1.cfg, student14-x2.cfg, student85-x1.cfg, student85-x2.cfg
  • student1-x1.cfg- shares settings with student1-x2.cfg

MANAGEMENT INTERFACE

  • ListOflinks.html
  • YarnSparkTutorial.h

SCRIPTS

  • startHadoop
  • shareSettings