MAD -- Malicious Application Detector

Before you start


System requirements

  • Ubuntu 16.04
  • run with root privilege
  • preferred using docker & docker-compose

Software & Libraries

Python dependencies

  • Django
  • dpkt
  • peewee
  • PyMySQL
  • requests
  • scapy
  • tensorflow
  • user-agents

Bootstrap installation

git clone https://github.com/JarryShaw/mad.git
cd mad
# build pkt2flow
# do not build pkt2flow
./bootstrap.sh "anything"  # just make sure $1 is not an empty string

Docker distribution

Develop Environment

git clone https://github.com/JarryShaw/mad.git
cd mad
# omit docker tags (default is <latest>)
# with certain tags (e.g. v0.1b1)
./build.sh "v0.1b1"

Distribution Environment

  1. Modify {app,gem,www}/init.sh if you need;
  2. Modify docket-compose.yml if you need;
  3. Run ./init.sh volumes to create directories;
  4. Run ./init.sh archives to set up CNN models and retrain dataset;
  5. Run docker-compose up --build -d to start up MAD services in detach mode.

Configurations (for Docker Compose only)


  • init.sh
    • db -- set up database tables
    • model -- set up CNN models (/home/traffic/db/apt_model)
    • report -- set up report directory (/home/traffic/db/apt_report)
    • retrain -- set up retrain dataset (/home/traffic/db/apt_retrain)
    • dataset -- set up dataset directory (./log/dataset)
    • volumes -- set up shared directories, i.e. report & dataset
    • archives -- set up archives, i.e. model & retrain
    • all -- set up all stuff, i.e. retrain & model & report & dataset & db
  • cleanup.sh
    • db -- reset database
    • log -- empty log (/home/traffic/pcapfile/apt_log.txt)
    • model -- reset CNN models (/home/traffic/db/apt_model)
    • report -- remove reports (/home/traffic/db/apt_report)
    • dataset -- remove datasets (./log/dataset)
    • retrain -- reset retrain dataset (/home/traffic/db/apt_retrain)
    • volumes -- cleanup shared directories, i.e. report & dataset
    • archives -- reset archives, i.e. model & retrain
    • all -- cleanup all stuff, i.e. retrain & model & report & dataset & log & db

mad_app -- main application

  • docker-compose.yml
    • CPU usage
      • 50% of available CPUs
      • 75% of CPU processing shares
    • Memory usage
      • 96G memory limit
      • 192G SWAP limit
    • Volume path
      • PCAP sources (/mad/pcap) in /home/traffic/pcapfile
      • dataset directory (/mad/dataset) in ./log/dataset
      • CNN models (/mad/model) in /home/traffic/db/apt_model
      • retrain dataset (/mad/retrain) in /home/traffic/db/apt_retrain
  • init.sh
    • Sample source: /mad/pcap
    • Rounds interval: 0s
    • Sampling interval: 0
    • Validation ratio: 10%
    • Process number: 15
    • MEMLOCK limit: unlimited
    • VMEM limit: unlimited
    • AS limit: unlimited
    • SWAP limit: unlimited
    • Validation: yes
    • Develop mode: no

mad_gen -- report generator

  • docker-compose.yml
    • Volume path
      • report directory (/mad/report) in /home/traffic/db/apt_report
      • dataset directory (/mad/dataset) in ./log/dataset
  • init.sh
    • Cleanup reports: yes
    • Process number: 4
    • Sleep interval: 5m
    • API token: 6JJ0qCCNHzv6iLsPvUPQNst0Dpbh87io

mad_www -- web dashboard

mad_db -- MySQL database

  • docker-compose.yml
    • Volume path
      • initialisation script (/docker-entrypoint-initdb.d) in sql/MySQL.sql
      • database library (/var/lib/mysql) in /home/traffic/db/apt_db

Entry points


Main Application

$ python run_mad.py --help
usage: mad_app [-h] [-V] [-m {1,2,3,4,5}] [-p DIR] [-s FILE] [-n] [-o SEC]
               [-t INT] [-r PCT] [-c PROC] [-l MEM] [-v MEM] [-a MEM] [-w MEM]
               [-d] [-i] [-e SHELL]

Malicious Application Detector

optional arguments:
  -h, --help            show this help message and exit
  -V, --version         show program's version number and exit

general arguments:
  -m {1,2,3,4,5}, --mode {1,2,3,4,5}
                        runtime mode
  -p DIR, --path DIR    input file name or directory (mode=1/2/3)
  -s FILE, --sample FILE
                        sample file(s) for model training (mode=2/5)

runtime arguments:
  -n, --no-validate     do not run validate process after prediction (mode=3)
  -o SEC, --wait-timeout SEC
                        wait for %SEC% seconds between each round (mode=3;
                        default is 0)
  -t INT, --sampling-interval INT
                        sample every %INT% file(s) (mode=3; default is 0, i.e.
                        sampling from all files)
  -r PCT, --validate-ratio PCT
                        validate %PCT% percent of CNN detection results
                        (mode=3; default is 10)

resource arguments:
  -c PROC, --process PROC
                        number of concurrent processes that may run (default
                        is %log2(CPU)%)
  -l MEM, --memlock MEM
                        number of bytes of memory that may be locked into RAM
                        (default is %MEMLOCK%)
  -v MEM, --vmem MEM    largest area of mapped memory which the process may
                        occupy (default is %VMEM%)
  -a MEM, --address-space MEM
                        maximum area (in bytes) of address space which may be
                        taken by the process (default is %AS%)
  -w MEM, --swap MEM    maximum size (in bytes) of the swap space that may be
                        reserved or used by all of this user id's processes
                        (default is %SWAP%)
development arguments:
  -d, --devel           run in develop mode (quit after first round)
  -i, --interactive     enter interactive mode (running SHELL)
  -e SHELL, --shell SHELL
                        shell for interactive mode (default is '/bin/sh')

Report Generator

$ python3 generate_report.py --help
usage: mad_gen [-h] [-c] [-f] [-i SEC] [-p NUM] [-t KEY]

positional arguments:
  -t, --token           shodan.io API token

optional arguments:
  -h, --help            show this help message and exit
  -c, --cleanup         remove processed CNN reports
  -f, --force-cleanup   remove CNN reports regardless of processing error
  -i, --interval        sleep interval between rounds
  -p, --process         process number (default is %log2(CPU)%)


from mad import main
main(mode=3, path='/mad/pcap', sample=None)


iface - str, network interface for sniffing (mode=1) c.f. scapy.all.sniff (deprecated)
mode - int, runtime mode
    |-- 1 -> initialisation
    |-- 2 -> migration
    |-- 3 -> prediction -- the main course (default)
    |-- 4 -> adaptation -- retain the models
    |-- 5 -> regeneration -- dev only
path - str, input file name or directory (mode=1/2)
file - str, JSON file name w/ list of input file names (mode=3) (deprecated)
sample - str, path of training sample(s) (mode=2, 5)




  1. Start initialisation (mode=1) with all (legacy PCAP) files under PATH.

    >>> from mad import main
    >>> main(mode=1, path=PATH)
  2. Run migration (mode=2) with all (legacy PCAP) files from PATH, and start live prediction for eth0 afterwards.

    >>> from mad import main
    >>> main(mode=2, path=PATH, iface='eth0')
  3. Run migration (mode=2) with all (legacy PCAP) files from PATH, and start prediction for PCAP files recorded in FILE (JSON list) afterwards.

    >>> from mad import main
    >>> main(mode=2, path=PATH, file=FILE)
    # FILE = 'data.json' -> ["foo.pcap", "bar.pcap", "boo.pcap", ...]
  4. Directly run live prediction for eth0.

    >>> from mad import main
    >>> main(mode=3, iface='eth0')
  5. Directly run prediction for legacy PCAP files recorded in FILE (JSON list).

    >>> from mad import main
    >>> main(mode=3, file=FILE)
    # FILE = 'data.json' -> ["foo.pcap", "bar.pcap", "boo.pcap", ...]

Repo directory

Report directory

    |-- mad.log                                 # log file for RPC (0-start; 1-stop; 2-retrain; 3-ready; 4-error)
    |-- pcap/
    |   |-- apt_log.txt                         # log file
    |   |-- YYYY_MMDD_HHMM_SS.pcap              # PCAP files
    |   |-- ...
    |-- dataset/                                # where all dataset go
    |   |-- YYYY-MM-DDTHH:MM:SS.US/             # dataset named after ISO timestamp
    |   |   |-- groups.json                     # WebGraphic group record
    |   |   |-- filter.json                     # fingerprint filter report
    |   |   |-- record.json                     # flattened group record
    |   |   |-- report.json                     # detection report
    |   |   |-- stream.json                     # backup for stream.json in retrain
    |   |   |-- tmp/                            # temporary files generated by pkt2flow
    |   |   |   |-- tcp_syn/
    |   |   |   |   |-- IP_PORT_IP_PORT_TS.pcap
    |   |   |   |   |-- ...
    |   |   |   |-- tcp_nosyn/
    |   |   |   |   |-- IP_PORT_IP_PORT_TS.pcap
    |   |   |   |   |-- ...
    |   |   |-- stream/                         # where stream files go
    |   |   |   |-- IP_PORT_IP_PORT_TS.pcap     # temporary stream PCAP files
    |   |   |   |-- ...
    |   |   |-- Background_PC/                  # where Background_PC dataset files go
    |   |       |-- 0/                          # clean ones
    |   |       |   |-- IP_PORT_IP_PORT_TS.dat  # dataset file
    |   |       |   |-- ...
    |   |       |-- 1/                          # malicious ones
    |   |           |-- IP_PORT_IP_PORT_TS.dat  # dataset file
    |   |           |-- ...
    |   |-- ...
    |-- model/                                  # where CNN model go
    |   |-- fingerprint.pickle                  # pickled fingerprint database
    |   |-- Background_PC/                      # Background_PC models
    |   |   |-- ...
    |   |-- ...
    |-- report/                                 # where generated reports go
    |   |-- ...
    |-- retrain/                                # where CNN retrain data go
        |-- Background_PC/                      # Background_PC retrain dataset
        |   |-- 0/                              # clean ones
        |   |   |-- YYYY-MM-DDTHH:MM:SS.US_IP_PORT_IP_PORT_TS.dat
        |   |   |-- ...
        |   |-- 1/                              # malicious ones
        |       |-- YYYY-MM-DDTHH:MM:SS.US_IP_PORT_IP_PORT_TS.dat
        |       |-- ...
        |-- stream.json                         # stream index for retrain


This software and associated documentation files (the "Software") are generally licensed under the GNU GPLv3 License. The original development branch of the MAD project as hosted on GitHub) is licensed under the GNU GPLv3 License. The f2format transformed distribution branch, as hosted on GitLab, is licensed under the Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License . No permits are foreordained unless granted by the authors and maintainers of the Software.