hadoop-mapreduce-samples
Sample Programs representing different capabilities of Map Reduce Framework
User-access-log parser
Program is designed to parse user access log files and provide details about user's accessing the system within a given time frame.
Expectations
Program expects user-log files to be of following format
<user-name/id> <accesstime in System's current milliseconds>
Sample log files are provided in logfiles directory
Logfiles need to be copied to Hadoop's HDFS file system to work upon.
Pre-requisites
Working Installation of Hadoop. The program has been tested on Hadoop 1.0.3
Execution Steps
-
Compile the source file com.mercris.hadoop.mapreduce.samples.useraccess.UserAccessCount. It will require hadoop-core-xxx.jar and hadoop-client-xxx.jar in the CLASSPATH. These jars are part of hadoop distribution. An easy way to do would be to checkout the source code and create a Java project in your preferred IDE like Eclipse and Intellij. Configuring CLASSPATH of your project will compile all the source code automatically for you.
-
Create a JAR file of the compiled .class files. A simple shell script makejar.sh is available inside utils folder. It assumes all the class files are located inside bin folder. You may need to change it appropriately.
-
Copy user-access log files to Hadoop file-system.
-
Run the program using utils/runaccesslog.sh after making necessary changes related to your local environment. Following are the relevant parameters you may need to change:
a. Root folder location in HDFS file-system where log files are copied
b. Expected folder location in HDFS file-system where results will be stored. Note that the folder should not exist before-hand
c. Start time (in milliseconds) from where, the program should take into account user-access
d. End time (in milliseconds) till where, the program should take into account user-access
-
Access output at specified output-folder/part-00000 file in your HDFS file system. Output folder is specified in utils/runaccesslog.sh.
Sample Output
user10 1
user11 1
user12 1
user13 1
user14 1
user4 2
user5 2
user6 2
user7 1
user8 1
user9 1
Understanding result
During the time limits provided to the program, above mentioned users have accessed the system number of times mentioned along with their name/id.