SparkStreaming_PVStatistics

A simulation of real-time log files collection system, and make a statistic on the logs, to show the information of PV reduced by several situation.

#Environment of the project

This project is a Spark Streaming test program, so the main environment of the Spark-Core is the version of spark-1.6.2-bin-hadoop2.6.
The scala version the Spark-Core depends on is the 2.10.4.
The HDFS is provided by the 2.7.1 version of the Hadoop.
The python environment is the version of the 2.7.13
The *.scala files and related jar files were built in the IDEA IDE, you can also use the Eclipse IDE, to rebuild this jar file.

#How to use the project

Copy the sample_web_log.py and the generate_logs.sh files into the same directory on the Linux File System.
Then copy the jar file into Linux File System, here the jar file name is FirstSpark.jar.
Before starting the generate_logs.sh, I recommend you to make two directories on your HDFS first, here according to the bash file, this directory is used for store the log files, and the two directories are /user/hadoop/spark/web_logs and /user/hadoop/spark/web_logs/tmp
Now you can run the generate_logs.sh file.
Then, use spark-submit command to analyze the PV data.

zhkmxx9302013/SparkStreaming_PVStatistics