
A big data platform for analyzing web access logs

Primary LanguageJava

HBase Actual Data Analysis System


1. Database design


  • This table is used to store the data after data cleaning and transformation

  • Database type: HBase

  • Table Structure

    Rowkey prop
  • RowKey structure design description

RowKey is divided into date + last three digits of website code + six digit ID Each part is described as follows:

Field Explanation Example
Date The date when the log file was generated (pure numbers, without spaces and -) 20170808
Company code The last three digits of the company code 200
ID Six digits starting from 100000, used to uniquely mark data and align 100001

complete example 201708082001000000 means a request made by 200-point company on 2017-08-08

  • Create table statement

create "LogData", "prop"


  • This table is used to store the analyzed data
  • Database type: HBase
  • Table Structure
rowkey IPSumVal IPTotalNum IPList URLList MaxURL BytesSecList BytesHourList / TotalBytes MethodList StateList ReqHourList ReqSecList ReqSum
  • field description
Field Explanation Example
IPTotalNum The total number of IPs, excluding duplicates 100 means that 100 IPs visited the website that day
IPSumVal Total number of IPs, including duplicates 100 indicates that 100 IPs visit the website, and IPs can be repeated
IPList Ranking of IP and corresponding visits, the structure is a JSON file converted from mutable.Map[String, Int] {"": 1000} means that the IP of generated 1000 requests on the website in total)
URLList The 10 most requested URLs, the structure is Json {"test.aj":100, "test2.aj":90, ...}
MaxURL The URL with the most requests (now the front end has given up using this field) {"test.aj": 100}
BytesSecList Statistical traffic generated per second, the unit is Byte, but converted to MB when the front-end display {"2017-08-08 01:00:00":9000, "2017-08-08 01:00:00" : 500, ...}
BytesHourList Count the traffic generated every hour in a day, the unit is Byte, but it will be converted to MB when displayed on the front end {"08": 9000, "09": 500, ...}, 08 means within 8 o'clock to 9 o'clock generated traffic
TotalBytes The total traffic size generated in one day, the unit is Byte, but it is converted to MB when displayed on the front end 3000, indicating that the traffic of 3000b bytes is generated on that day
MethodList Appeared request method statistics {"POST":3446,"OPTIONS":5,"HEAD":4}
StateList Appeared request state intermediate {"501":8,"302":801,"404":1,"200":14738,"400":2,"405":4}
ReqHourList Count the number of requests by hour {"15":2350,"09":3503,"00":690,"11":1903}
ReqSecList Count the number of requests by second {"2017-08-08 10:44:08":1,"2017-08-08 09:45:05":4,"2017-08-08 10:06:58 ":4}
ReqSum The total number of requests in a day 1000, indicating that there are 1000 requests in the day
  • RowKey structure design description

RowKey is divided into date + last three digits of company code Each part is described as follows:

Field Explanation Example
Date The date when the log file was generated (pure numbers, without spaces and -) 20170808
Company code The last three digits of the company code 200, it should be noted that 000 means all website data of the day

example: 20170808200 means all the data of Tianjin High Court on 2017-08-08 20170808000 means all courts at point 2017-08-08 all data

  • Create table statement

create "LogAna", "IP", "URL", "BYTES", "METHOD_STATE", "REQ"

2. Project code description

  • This project is divided into three sub-projects, including data acquisition, data storage and display, and data offline analysis

data collection

  • Project name: CollectTomcatLogs
  • Function Description:

Collect tomcat logs under the specified path Upload to HDFS or FTP server after renaming the file Save the log to record whether the upload is successful

  • Deployment instructions: Deploy on each server that needs to collect logs, specify the company code and log path in my.properties
  • Configuration management: maven
  • Main technologies: Java FTPClient, HDFS
  • Test case description: mainly used to test whether the renamed file is normal
  • File renaming: Add the court code before the localhost_XXXXX.txt file to distinguish the data of each company

Data storage and display

  • Project name: RestoreData
  • Function Description:

Data preprocessing: including data analysis, cleaning and transformation Data storage: save the converted data in a List and insert them into the HBase database in batches Front-end display: display the analyzed data Data query: Query corresponding data according to various input conditions

  • Development environment:

JDK 8 Hadoop 2.7 Hbase 1.2 tomcat 8

  • Deployment instructions: Configure various data in my.properties, pay attention to the compatibility of JDK and Hadoop versions
  • Configuration management: maven
  • Main technology: Spring MVC / Hadoop / JSP
  • Test case description:

HbaseBatchInsertTest.java: for testing batch insertion HbaseConnectionTest.java: used to test whether the Hbase connection is normal ParseLogTest.java: for testing log parsing ListBean.java: Print all beans, used to cope with @Autowried failure

  • Front end part:

code section

index.jsp: The page is loaded by default, and the data will be requested after loading, showing all the website data of the previous day index.js: used to process various requests and data analysis in index.jsp

queryData.jsp: Used to query the data of various websites, the input is date + website, multiple selection is supported queryData.js: used to process various requests and data analysis in queryData.jsp (to be completed)

dataGrid.jsp: display data in form of table (to be completed)

myCharts.js: Use echarts to draw various charts (note that the initialization of dom is done externally) inputCheck.js: Check if the input is legal

mystyle.css: Customize various styles ####Third party library Bootstrap: mainly with its grid system Bootstrap-select: Implementation of multiple selection boxes BootstrapDatepickr: date input echarts: draw various charts jQuery: frame font-awesome: various small icons

Data offline analysis

  • Project name: ScalaReadAndWrite
  • Function Description:

Offline analysis of various data, a total of 13 indicators, see the database table LogAna design for details

  • Development environment:

Scala 2.11 Spark 1.52 Hadoop 2.7

  • Special Note:

There are only two implementations of global variables in spark, broadcast variables or accumulators, this project uses accumulators When customizing the accumulator, it is very important to pay attention to the correct input and output types Be sure to implement all six overloaded functions An accumulator can only pass one kind of variable, which can be a complex object Failure to do so will invalidate the accumulator!

  • Deployment instructions: None
  • Configuration management: maven
  • Main technology: Spark
  • Description of project structure:

Accumulator: accumulator, including various custom accumulators analysis: main analysis code DAO: parse the entity class and store it in HBase Entity: two entity classes util: various tools

3. Project screenshot:

  • Hbase database screenshot image

  • Data display interface image

  • Data display interface image