#Analysing Collected Server Log Data This example demonstrates the following respectively:
- Generate log data in Hortonworks Sandbox
- Load the log data, each time it's changed, into Hadoop with Flume
- Import the data with Python Hive Client into Root to visualize data analysis
Inspired from Hortonwork's official example: How to Refine and Visualize Server Log Data
I'm using WinSCP to access server for better understanding. You can learn your server ip with ifconfig
command.
To install Flume, type the command below. Even though it was already installed at the first time I've started virtual machine.
yum install –y flume
Start flume after putting the configuration file into /etc/flume/conf/
directory. Using PuTTY is optional.
flume-ng agent -c /etc/flume/conf -f /etc/flume/conf/flume.conf -n sandbox
According to our flume.conf
configuration file, we know that sandbox
is our agent name, and our source is /var/log/eventlog-demo.log
which means flume listens changes on this log, namely when the source receives an event, it stores it into our channel which is also defined in cofiguration file. The channel keeps the event until it’s consumed by the Flume sink. The sink removes the event from the channel and puts it into HDFS in this example.
After putting generate_logs.py
file into server, then generate new log line with python generate_logs.py
command. I made this file to write one line to better determination for beginners. Otherwise, a real world example would create larger logs.
Our generate_logs.py
file now in the server. All we have to do is to run python command.
Next step is creating the table to store the logs.
hcat -e "CREATE TABLE COUNTRY_LOGS(time STRING, ip STRING, country STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY '|' LOCATION '/flume/events';"
As you can see from the image above, I get an error firstly. Later on, succeeded after running usermod -aG hdfs hue
and/or usermod -aG hdfs root
commands in server and restarting the services.
Eventually, the data residing in the HDFS peacefully...
Python Client pyhs2
executes query via HiveServer2 Thrift API then fetches the query result. At the end, it stores data files according to query result. Those data files are possible canditates as input for Root histogram.
To install pyhs2
dependencies run python setup.py install
command in ./pyhs2
directory.
In the root-analysis, run python generate_data.py
command and it's done, the data files are in ./data
directory.
CERN's Root is a great framework to analysis, especially for scientific analyses. root htraffic.C
command will cause our histogram to emerge.