learn Scala by reading Twitter's Scala School project(https://twitter.github.io/scala_school/zh_cn/index.html)
- hadoop-2.6.0-cdh5.7.0
- scala-2.11.8
- spark-2.1.0-bin-2.6.0-cdh5.7.0(build source code)
- apache-maven-3.3.9
- mysql
- jquery-3.2.1
- Echarts 3
- zeppelin-0.7.1-bin-all
- IntelliJ IDEA
- Mac OS 10.12.6 & CentOS 6.4 on VMware Fusion
-
The analysis and visualization of imooc.com log with the help of SparkSQL and visualization tools such as Zeppelin and ECharts.
-
The issues are:
- Top N course and articles(Todo => visualization)
- analyze the log for the Top N courses and articles, load data into mysql database
- Top N course partioned by city with the help of IP
- analyze the log for the Top N courses and articles with the information about city, load data into mysql database(deadlock in batching)
- Top N couser according to traffic
- analyze the log for the Top N courses and articles with the information about traffics, load data into mysql database
- data on MySQL:
- Top N course and articles(Todo => visualization)
- deploy data clean project on YARN
./bin/spark-submit
--class com.imooc.log.SparkStatCleanJobYARN
--name SparkStatCleanJobYARN
--master yarn
--executor-memory 1G
--num-executors 1
--files /home/hadoop/lib/ipDatabase.csv,/home/hadoop/lib/ipRegion.xlsx
/home/hadoop/lib/sql-1.0-jar-with-dependencies.jar
hdfs://hadoop001:8020/imooc/input/* hdfs://hadoop001:8020/imooc/clean - deploy statistics project on YARN
./bin/spark-submit
--class com.imooc.log.TopNStatJobYARN
--name TopNStatJobYARN
--master yarn
--executor-memory 1G
--num-executors 1
/home/hadoop/lib/sql-1.0-jar-with-dependencies.jar
hdfs://hadoop001:8020/imooc/clean 20170511
- reuse data
- cache
- shuffle partitions
./bin/spark-submit
--class com.imooc.log.TopNStatJobYARN
--name TopNStatJobYARN
--master yarn
--executor-memory 1G
--num-executors 1
--conf spark.sql.shuffle.partitions=100
/home/hadoop/lib/sql-1.0-jar-with-dependencies.jar
hdfs://hadoop001:8020/imooc/clean 20170511