/apache-log-parser

This is a simple Apache log parser with a flexibly ability to group entries by column and|or filter it. Set up printing as you like!

Primary LanguagePythonMIT LicenseMIT

Apache log parser

'">-----X--S--S-------------------------S--Q--L----I--N--J-E--C-T-----

Hello friend!
This is a simple Apache log parser with a flexibly ability to group 
entries by column and|or filter it. Set up printing as you like!

--------m--a--l--i--c--i--o--u--s----r--e--q--u--e--s--t--s--------<"'

I use it to track robot requests, attempts to hack the site, and for general statistics.

This is my first script on the Python. Please, rub my nose in every horrible string of code if you can. I want to be better. =)

Example outputs

One of the variants of grouping

Group by :  ['date', 'ip:20', ['code', 'method', 'uri:100']]

[  7.15 s,     107248 : (+     37680, -     69568) lines,  15001.93 l/s] 2018-01-13-site.net_access.log
[ 18.38 s,     185776 : (+    177003, -      8773) lines,  10105.29 l/s] 2018-02-15-site.net_access.log
[  3.19 s,      54966 : (+     21227, -     33739) lines,  17209.65 l/s] 2018-06-site.net_access.log
[  1.29 s,      16924 : (+     10484, -      6440) lines,  13093.75 l/s] 2018-07-site.net_access.log
[  0.09 s,       2640 : (+       178, -      2462) lines,  29022.45 l/s] site.net_access.log

Total
[ 30.11 s,     367554 : (+    246572, -    120982) lines, 12206.89 l/s]


2017-02-23
        5.18.223.132
                2       200     POST    /wp-admin/admin-ajax.php
2017-02-24
        141.8.184.105
                1       301     GET     /c
                1       301     GET     /d
        52.174.145.81
                1       404     GET     /effe
        62.16.25.217
                2       301     GET     /administrator/index.php
                1       404     GET     /admin.php
                2       404     GET     /administrator/
...

Another way grouping (Without filters, the speed is noticeably larger =))

Group by :  ['method', 'code']

[  2.01 s,     107248 : (+    107248, -         0) lines,  53453.60 l/s] 2018-01-13-site.net_access.log
[  3.31 s,     185776 : (+    185776, -         0) lines,  56139.12 l/s] 2018-02-15-site.net_access.log
[  0.99 s,      54966 : (+     54966, -         0) lines,  55528.67 l/s] 2018-06-site.net_access.log
[  0.30 s,      16924 : (+     16924, -         0) lines,  57197.19 l/s] 2018-07-site.net_access.log
[  0.05 s,       2640 : (+      2640, -         0) lines,  54716.42 l/s] site.net_access.log

Total
[  6.65 s,     367554 : (+    367554, -         0) lines, 55266.40 l/s]


GET
        117321  200
        84      206
        8287    301
        118     302
        848     304
        4       400
        4       403
        3007    404
        47      405
        16      500
HEAD
        175     200
        86      301
        1       302
        65      404
POST
        236071  200
        183     204
        195     302
        8       400
        1028    404
        6       500

Table of content

Python version

I tested it with 2.7.15 and 3.7.0.

Usage

  1. Specify the filters and columns for grouping in the print_report function.

  2. Then run the script

    python log.py site.com.access*.log
    

    You can process multiple *.log files at once.

  3. Enjoy =)

print_report()

print_report(path_files=[] [, filters={} ]])

    print_report(
        files,
        # Group by columns
        ['date', 'ip:20', ['code', 'method', 'uri:100']],
        # Exclude filters
        {
            'exclude': [
                # requests from my ip
                { 'ip': r'(?:127.0.0.1|192.168.0.1)' },
                # and exclude requests to the main page "/" and few legal requests
                { 'uri': r'^/$' },
                # for /about.html and /contact.html
                { 'uri': r'^/(?:about|contact)\.html$' },
            ],
            # Include filters
            'include': [
                # For example, will find requests from bots or empty User-Agent
                { 'ua': r'bot' },
                { 'ua': r'^$' },
            ]
        }
    )

How it works?

  • The script sequentially processes each file line by line.
  • First of all each line is parsed by the parse_apache_line(line, sep=' ') function, splitting it by space into named columns.
  • Further, the line passes the exclude filters, and then the include ones.
  • If the filters are passed, then a string is formed for grouping. A string is composed of one or more named columns. (If an array of arrays is passed).
  • And at the end the line is grouped with exactly the same (counting).

In fact, you can handle any types of logs. To do this, you need to add or modify the parsing function for the named columns (parse_apache_line) for your format and column separator.

Named columns

Each line splitting by the parse_apache_line function into named columns, which uses for filtering and grouping purposes.

For example this line:

66.249.64.73 - - [12/Jul/2018:05:29:02 +0300] "GET /robots.txt HTTP/1.0" 200 2 "https://ref-site.net/" "Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; Googlebot/2.1; +http://www.google.com/bot.html) Safari/537.36"
Column name Value
ip 66.249.64.73
date 12/Jul/2018 when it passing throught filter and 2018-07-12 when printing
code 200
method GET
uri /robots.txt
protocol HTTP/1.0
request GET /robots.txt HTTP/1.0
ua Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; Googlebot/2.1; +http://www.google.com/bot.html) Safari/537.36
ref https://ref-site.net/

Filtering

Filtering is performing by named columns. Exclude filter performing before include.

A filter is a string for a case-insensitive regular expression.

Filters with AND logic

Filters with AND logic are passed as a dictionary.

{
    'method': r'POST',
    'uri': r'^/favicon\.ico$',
}

It corresponds to the POST request for /favacon.ico.

Filters with OR logic

Filters with OR logic are passed as a list.

[    
    {'method': r'POST'},
    {'uri': r'^/favicon\.ico$'},
]

It corresponds to any POST requests OR /favacon.ico separately.

Grouping

Grouping defines a string whose occurrence is counted.

A string is the value of one or more named columns separated by a tab character.

The output is sorted by these lines.

Grouping columns

Grouping columns are specified in list: ['date', 'ip'].

Will be output something like:

2018-07-01
        21      11.22.33.44
        1       11.22.33.55

2018-07-02
        6       11.22.33.55
        3       11.22.33.66

Grouping several columns

In one line, you can group several columns at once: ['date', ['code', 'ip']].

Output:

2018-07-03
        8       200     11.22.33.44
        2       404     11.22.33.44
        1       404     11.22.33.55

Set width of column

After the column name through the colon, you can specify the minimal and|or maximal width of the column.

Minimal width of column: [['uri:20', 'code']]

Output:

473     /                       200
372     /                       301
1       /aaaaaabbbbbbccccdddeeeeeeffffffggggghhhhhhhhhhhhh/    404
1       /.well-known/assetlinks.json    404

Maximal width of column: [['uri:.20', 'code']]

Output:

473     /       200
372     /       301
1       /aaaaaabbbbbbccccddd    404
1       /.well-known/assetli    404

Minimal and maximal width of column: [['uri:20.20', 'code']]

Output:

473     /                       200
372     /                       301
1       /aaaaaabbbbbbccccddd    404
1       /.well-known/assetli    404

How I usually use it?

  1. Browse all entries without filters
  2. Gradually add regex for normal entries to exclude list
  3. As result, only non-target entries remain
  4. And then in another terminal I extract specific requests

Examples

For example, I left in the script filters for my sites that I use.

Workflow for new site

In accordance with the How I usually use it :

  1. Browse all entries without filters

    print_report(files, [['code', 'method', 'uri:100']], {
        'exclude': [],
    })

    and run with less -S:

    python log.py site.net.access*.log | less -S
    
  2. Gradually add regex for normal entries to exclude list

    Uncomment in turn and fill the filters. (This is not very convenient =( but this is usually done for 1 site 1 time)

    ## Grouping for easilly find normal requests
    print_report(files, [['code', 'method', 'uri:100']], {
    ## Grouping for precise find normal requests
    # print_report(files, [['code', 'method', 'uri:100'], 'ip:20'], {
    # print_report(files, ['ip:20', ['code', 'method', 'uri:100']], {
    # print_report(files, ['ref:20', ['code', 'method', 'uri:100']], {
    # print_report(files, ['ua', ['code', 'method', 'uri:100']], {
    # print_report(files, ['ua', ['ip']], {
        'exclude': [
            # Site Pages
            {'uri': r'^/$'},
            {'uri': r'^/robots\.txt$'},
            {'uri': r'^/favicon\.ico$'},
            {'uri': r'^/css/style\.css$'},
            {'uri': r'^/poisk[^/\\#?.]+te\.html.+'},
            {'uri': r'^/(support|radio|music|song)(?:\.html|\/)?$'},
            {'uri': r'^/js/[^\/?#\\]+\.js$'},
            {'uri': r'^/(?:bio|music|song|short_story)/[^\/?#\\]+(?:\.html|\/)?$'},
            {'uri': r'^/img/[a-z\d\-\_]+\.(?:png|jpg|gif)$'},
        ],
    })
  3. As result, only non-target entries remain

    192.187.109.42
            1       200     POST    /wp-admin/admin-ajax.php
            1       200     GET     /wp-admin/admin-ajax.php?action=revslider_show_image&img=../wp-config.php
            1       404     GET     /wp-content/plugins/./simple-image-manipulator/controller/download.php?filepath=/etc/passwd
            1       404     GET     /wp-content/plugins/recent-backups/download-file.php?file_link=/etc/passwd
            1       404     POST    /uploadify/uploadify.php?folder=/
    
  4. And then in another terminal I extract specific requests

    print_report(files, [['code', 'method', 'uri:100'], 'ip:20'], {
        'exclude': [skip_my_ip],
        'include': {'uri': r'^/wp-admin/admin-ajax.php'},
    })
  5. Whois and ban hacker IP

For daily view

## Grouping for daily view
# print_report(files, ['date', ['code', 'method', 'uri:100'], 'ip:20'], {
print_report(files, ['date', 'ip:20', ['code', 'method', 'uri:100']], {
    'exclude': [
        skip_my_ip,
        # site filters
        site['site1'],
    ],
})

To do

  • Make it convenient to switch groups and filters
  • *.gz and *.log files together

P.S.

Pull requests are welcome =). Your experience is interesting.

Thank you for attention!