layout | title | date |
---|---|---|
post |
InstallationReadme |
2017-05-07 16:12 |
In installation, followed the Default Course tutorial.
Project specific configuration
Installed in all the 8 vm-s. The cluster is configured currently to use Yarn and Hadoop only on these computers to acoid student1 small memory bottlenecks.
- student13-x1 Hadoop master
- student85-x1 slave
- student85-x2 slave
- student13-x2 slave
- student14-x1 slave
- student14-x2 slave
Most Important Settings
yarn-site.xml
yarn.nodemanager.resource.memory-mb 5020 yarn.scheduler.minimum-allocation-mb 1024 yarn.scheduler.maximum-allocation-mb 5020 yarn.resourcemanager.nodes.exclude-path /opt/hadoop-2.6.0/etc/hadoop/yarn.exclude true .......
yarn.exclude
student1-x2
student1-x1
hdfs-site.xml
dfs.replication 3 dfs.blocksize 64m dfs.client.block.write.replace-datanode-on-failure.policy ALWAYS dfs.hosts.exclude /opt/hadoop-2.6.0/etc/hadoop/dfs.exclude true ......
hdfs.exclude
student1-x2
student1-x1
In installation, followed the Default Course tutorial.
Project specific configuration
spark.defaults - spark settings when you don't specify anything else in the commandline or in the program-
spark.executor.instances 4
spark.yarn.archive hdfs://student13-x1:9000/addedlibs/spark-archive.zip
spark.executor.memory 4g
spark.driver.memory 4g
spark.yarn.executor.memoryOverhead 700
spark.yarn.driver.memoryOverhead 1000
spark.yarn.am.memoryOverhead 700
All cluster node settings
vcpus = '1' memory = '6096'
maxmem = '8098'
Student1 vm Settings
vcpus = '1' memory = '1024'
- Step1: Installed Jupyter and Ipython notebook and configured it to use together with Spark. Main tutorial1 I followed to install this module .
- Step2: Configured ipython to be accessible outside localhost and added tunneling through cocserver port 18813.
- Step3 Configured Spark and Ipython so that sparkContext could be executed inside a Ipython kernel. Basic idea how to do this I got from this tutorial2, however had to do some specifications to be adapted to this enviroment.
-
Added this to end of .profile of hduser in student13-x1 so SparkContext could be used by hduser
export PYSPARK_DRIVER_PYTHON=ipython
export PYSPARK_DRIVER_PYTHON_OPTS='notebook --ip=*' pyspark -
Added this to the end of .bashrcs in student13-x1 so pyspark command would be executable everywhere.
export SPARK_HOME="/opt/spark-2.1.0-bin-hadoop2.6"
-
To start ipython together with spark just execute at hduser@student13-x1
pyspark --master yarn [other options ]
At you home directory or wherever you find this most reasonable. This starts a Ipython notebook in address http://202.45.128.135:18813/?token=dsmaofnfds (token is different every time) The sparContext has not yet started.
To initialize sparkContext you have to initialize an ipython kernel. After you have initialized the kernel, sparkContext initialization has started, but this takes some time. You can follow this process under Yarn applications interface.
- Start pyspark like specified under the previous header.
- Execute the code snippets commented as Task1,2,4,3. In the window you started the pyspark you can follow the job and application output that spark writes. You can also follow it under yarn application interface.
Fom the ListOflinks web-page you can systematically access
Cluster References
- Yarn and Hadoop Monitoring interfaces
- Spark Monitoring interfaces
- Project web-page
- Ganglia
External project parts
- Github repository for project related info
in hduser@student13-x1 home execute
- startHadoop - restart the hadoop enviroment, yarn enviroment and history server
- shareSettings - shares Yarn settings accrosse the cluster.
YARN HADOOP SETTINGS
- yarn-site.xml
- dfs-site.xml
- yarn.exclude
- dfs.exclude
SPARK SETTINGS
- spark-defaults.conf
- spark-env .sh
SOURCE CODE FOR DATA ANALYSIS
- max_count.ipynb
WEB PAGE SOURCE CODE
/ngram folder
XEN
- student13-x1.cfg- shares settings with student13-x2.cfg, student14-x1.cfg, student14-x2.cfg, student85-x1.cfg, student85-x2.cfg
- student1-x1.cfg- shares settings with student1-x2.cfg
MANAGEMENT INTERFACE
- ListOflinks.html
- YarnSparkTutorial.h
SCRIPTS
- startHadoop
- shareSettings