
Analysing log data using Hadoop, Flume, Root.

Primary LanguagePython

#Analysing Collected Server Log Data This example demonstrates the following respectively:

  • Generate log data in Hortonworks Sandbox
  • Load the log data, each time it's changed, into Hadoop with Flume
  • Import the data with Python Hive Client into Root to visualize data analysis

Inspired from Hortonwork's official example: How to Refine and Visualize Server Log Data

Install and Run Flume

I'm using WinSCP to access server for better understanding. You can learn your server ip with ifconfig command.


To install Flume, type the command below. Even though it was already installed at the first time I've started virtual machine.

yum install –y flume

Start flume after putting the configuration file into /etc/flume/conf/ directory. Using PuTTY is optional.

flume-ng agent -c /etc/flume/conf -f /etc/flume/conf/flume.conf -n sandbox

According to our flume.conf configuration file, we know that sandbox is our agent name, and our source is /var/log/eventlog-demo.log which means flume listens changes on this log, namely when the source receives an event, it stores it into our channel which is also defined in cofiguration file. The channel keeps the event until it’s consumed by the Flume sink. The sink removes the event from the channel and puts it into HDFS in this example.

Generating Server Log

After putting generate_logs.py file into server, then generate new log line with python generate_logs.py command. I made this file to write one line to better determination for beginners. Otherwise, a real world example would create larger logs.


Our generate_logs.py file now in the server. All we have to do is to run python command.


Creating HCatalog Table

Next step is creating the table to store the logs.



As you can see from the image above, I get an error firstly. Later on, succeeded after running usermod -aG hdfs hue and/or usermod -aG hdfs root commands in server and restarting the services.


Eventually, the data residing in the HDFS peacefully...

Fetching Data for Visualization with Python Client for Hive

Python Client pyhs2 executes query via HiveServer2 Thrift API then fetches the query result. At the end, it stores data files according to query result. Those data files are possible canditates as input for Root histogram.

To install pyhs2 dependencies run python setup.py install command in ./pyhs2 directory.

In the root-analysis, run python generate_data.py command and it's done, the data files are in ./data directory.

Visualization with Root

CERN's Root is a great framework to analysis, especially for scientific analyses. root htraffic.C command will cause our histogram to emerge.



  1. Flume User Guide
  2. How to Refine and Visualize Server Log Data