A simulation of real-time log files collection system, and make a statistic on the logs, to show the information of PV reduced by several situation.
#Environment of the project
- This project is a Spark Streaming test program, so the main environment of the Spark-Core is the version of spark-1.6.2-bin-hadoop2.6.
- The scala version the Spark-Core depends on is the
2.10.4
. - The HDFS is provided by the
2.7.1
version of theHadoop
. - The python environment is the version of the
2.7.13
- The
*.scala
files and related jar files were built in the IDEA IDE, you can also use the Eclipse IDE, to rebuild this jar file.
#How to use the project
- Copy the
sample_web_log.py
and thegenerate_logs.sh
files into the same directory on the Linux File System. - Then copy the jar file into Linux File System, here the jar file name is
FirstSpark.jar
. - Before starting the
generate_logs.sh
, I recommend you to make two directories on your HDFS first, here according to the bash file, this directory is used for store the log files, and the two directories are/user/hadoop/spark/web_logs
and/user/hadoop/spark/web_logs/tmp
- Now you can run the
generate_logs.sh
file. - Then, use
spark-submit
command to analyze the PV data.