layout	title	date
post	InstallationReadme	2017-05-07 16:12

Installation

YARN-HADOOP

In installation, followed the Default Course tutorial.

Project specific configuration

Installed in all the 8 vm-s. The cluster is configured currently to use Yarn and Hadoop only on these computers to acoid student1 small memory bottlenecks.

student13-x1 Hadoop master
student85-x1 slave
student85-x2 slave
student13-x2 slave
student14-x1 slave
student14-x2 slave

Most Important Settings

yarn-site.xml

....

yarn.nodemanager.resource.memory-mb 5020 yarn.scheduler.minimum-allocation-mb 1024 yarn.scheduler.maximum-allocation-mb 5020 yarn.resourcemanager.nodes.exclude-path /opt/hadoop-2.6.0/etc/hadoop/yarn.exclude true ...

yarn.exclude

student1-x2
student1-x1

hdfs-site.xml

...

dfs.replication 3 dfs.blocksize 64m dfs.client.block.write.replace-datanode-on-failure.policy ALWAYS dfs.hosts.exclude /opt/hadoop-2.6.0/etc/hadoop/dfs.exclude true ...

hdfs.exclude

student1-x2
student1-x1

SPARK

In installation, followed the Default Course tutorial.

Project specific configuration

spark.defaults - spark settings when you don't specify anything else in the commandline or in the program-

spark.executor.instances 4
spark.yarn.archive hdfs://student13-x1:9000/addedlibs/spark-archive.zip
spark.executor.memory 4g
spark.driver.memory 4g
spark.yarn.executor.memoryOverhead 700
spark.yarn.driver.memoryOverhead 1000
spark.yarn.am.memoryOverhead 700

XEN

All cluster node settings

vcpus = '1' memory = '6096'
maxmem = '8098'

Student1 vm Settings

vcpus = '1' memory = '1024'

PySpark on Student13-x1

Step1: Installed Jupyter and Ipython notebook and configured it to use together with Spark. Main tutorial1 I followed to install this module .
Step2: Configured ipython to be accessible outside localhost and added tunneling through cocserver port 18813.
Step3 Configured Spark and Ipython so that sparkContext could be executed inside a Ipython kernel. Basic idea how to do this I got from this tutorial2, however had to do some specifications to be adapted to this enviroment.
- Added this to end of .profile of hduser in student13-x1 so SparkContext could be used by hduser
  
  export PYSPARK_DRIVER_PYTHON=ipython
  export PYSPARK_DRIVER_PYTHON_OPTS='notebook --ip=*' pyspark
- Added this to the end of .bashrcs in student13-x1 so pyspark command would be executable everywhere.
  
  export SPARK_HOME="/opt/spark-2.1.0-bin-hadoop2.6"

Usage

Starting Pyspark

To start ipython together with spark just execute at hduser@student13-x1

pyspark --master yarn [other options ]

At you home directory or wherever you find this most reasonable. This starts a Ipython notebook in address http://202.45.128.135:18813/?token=dsmaofnfds (token is different every time) The sparContext has not yet started.

To initialize sparkContext you have to initialize an ipython kernel. After you have initialized the kernel, sparkContext initialization has started, but this takes some time. You can follow this process under Yarn applications interface.

Executing Our Project Source

Start pyspark like specified under the previous header.
Execute the code snippets commented as Task1,2,4,3. In the window you started the pyspark you can follow the job and application output that spark writes. You can also follow it under yarn application interface.

Monitoring the environment

Fom the ListOflinks web-page you can systematically access

Cluster References

Yarn and Hadoop Monitoring interfaces
Spark Monitoring interfaces
Project web-page
Ganglia

External project parts

Github repository for project related info

Using management scripts

in hduser@student13-x1 home execute

startHadoop - restart the hadoop enviroment, yarn enviroment and history server
shareSettings - shares Yarn settings accrosse the cluster.

Files submitted in this report.

YARN HADOOP SETTINGS

yarn-site.xml
dfs-site.xml
yarn.exclude
dfs.exclude

SPARK SETTINGS

spark-defaults.conf
spark-env .sh

SOURCE CODE FOR DATA ANALYSIS

max_count.ipynb

WEB PAGE SOURCE CODE

/ngram folder

XEN

student13-x1.cfg- shares settings with student13-x2.cfg, student14-x1.cfg, student14-x2.cfg, student85-x1.cfg, student85-x2.cfg
student1-x1.cfg- shares settings with student1-x2.cfg

MANAGEMENT INTERFACE

ListOflinks.html
YarnSparkTutorial.h

SCRIPTS

startHadoop
shareSettings

AndresNamm/SPARK-text-analysis