Bothound is an automatic DDoS attack detector and botnet classifier. Its purpose is to create a historical classification of the attacks with detailed information regarding the attackers (country-based, time-based, etc.).
Bothound's role is to detect and classify the attacks (incidents), using the anomaly-detection and machine-learning tool Grey Memory. BotHound attack classifier reacts to anomalous detectors and starts gathering live information from the Deflect network. It computes a behaviour vector for all visitors of the network when Grey Memory detects an anomaly. BotHound groups the client IPs in different groups (clusters) using unsupervised machine learning algorithms in order to profile the group of malicious visitors. It uses different measures to tag the groups which are more likely to be attackers. After that, it feeds all the behaviour vectors of bot IPs into a classifier to detect if the botnet has a history of attacking the Deflect network in the past. It finally generates a report based on its conclusions for Deflect's sysops and gets feedback to improve its classification performance.
Python 2.7 should be installed
The following libraries should be installed:
[sudo] apt-get install emacs python libmysqlclient-dev build-essential python-dev python-numpy python-setuptools python-scipy libatlas-dev python-matplotlib python-mysqldb python-geoip libffi-dev python-dnspython libssl-dev python-zmq
[sudo] apt-get install python-pip
[sudo] pip install -U scikit-learn
[sudo] apt-get install git
[sudo] apt-get install openjdk-7-jre
[sudo] apt-get install mysql-server
Install Adminer interface
- First make sure that you install Jupyter locally because nbextension has a bug and is only able to install if there is a local installation.
sudo pip install jupyter --user
- Install Jupyter system-wide
sudo pip install jupyter
- Install Jupyter nbextensions
pip install https://github.com/ipython-contrib/IPython-notebook-extensions/archive/master.zip
- The file is erroneously copied in the local folder. Copy the files to the system-wide folder.
sudo cp /root/.local/share/jupyter /usr/local/share
sudo chmod -R a+r /usr/local/share/jupyter
git clone https://github.com/equalitie/bothound
cd bothound/
Install required packages from requirements.txt:
pip install -r requirements.txt
You need to create a configuration file bothound.yaml
- Make a copy of the example configuration file
- Rename the copy to bothound.yaml
- Update the file with your credentials.
Bothoung.yaml description:
- encryption_passphrase - the password for IP encryption
- hash_passphrase - the solt for hash function used for IP hash, stored in the database
- sniffles section - not supported yet
- elastic_db - Elastic search node credentials
- Make sure Mysql server is up and running.
- To create a database, you need to launch any script which instantiates bothound_tools object, for example:
cd src
python session_computer.py
Make sure the database and the tables are created successfully.
- Create a test incident using the followin sql :
INSERT INTO incidents (start,stop,process,target) VALUES (2016-06-01, 2016-06-02, 1, 'mysite.com');
- Run session_computer.py again. Make sure bothound is processing data from elastic search server. You should see the following message if the testing incident is processed correctly : "Incident 1 processed"
- Make sure the Jupyter instance is running on the Bothound server. To run the instance, launch this command:
jupyter notebook --no-browser --port=8889
- Establish a tunnel to the Jupyter instance from your local computer:
ssh -N -L 8889:127.0.0.1:8889 user@server
- Open the local URL http://localhost:8889/. Make sure you see a list of files and folders.
- Session - an IP and a vector of feature values recorded and calculated during a period of the IP activity
- Feature - an individual measurable property of a session
- Incident - a set of sessions recorded during a time interval
- Attack - a subset of sessions in an incident which was labeled as an attack
- Botnet - a list of IPs that participated in similar attacks
Incidents are created manually using the Adminer interface. In the future, incidents will be created automatically based on messages from the Grey Memory anomaly detector.
- Insert a new record into the "incidents" table.
- Make sure you filled at least the "start", "stop" and "target" fields.
- The target URL should not contain "www." at the beginning. If you have multiple targets, you can add them separated by a comma.
- Set "process" field to 1.
- Insert a new record into the "incidents" table.
- Make sure you filled "file_name" with the full path to a nginx log file.
- Set "process" field to 1.
The Jupyter Notebook is a web application that allows you to create and share documents that contain live code, equations, visualizations and explanatory text. Notebook contains a list of cells (markdown, python code, graphs). Use Shift+Enter to execute a cell. You can fold/unfold the content of a cell using the left "arrow" key.
The Session Computer calculates sessions for all the records in the incidents table containing "1" in the "Process" field.
- Run the Session Computer with:
python session_computer.py
- The Session Computer will recalculate all the incident records containing "1" in the "Process" field.
- For regular incidents, the Session Computer runs ElasticSearch queries. For nginx incidents, the Session Computer will parse the corresponding log file.
- The sessions will be stored in the "sessions" table.
For security reasons, Bothound stores only encrypted IPs in the session table, in the "ip_encrypted", "ip_iv",and "ip_tag" fields. The hash of the IP is also stored in the "ip" field. The encryption key is set in the configuration file "conf/bothound.yaml" ("encryption_passphrase"). Bothound supports multiple encryption keys. The encryption table contains the hash value of the key which was used to encrypt the IPs of an incident.
In order to get the decrypted IPs of the incident, use the extract_attack_ips() function in bothound_tools.py
Bothound uses clustering methods in order to separate attackers from regular traffic. This process of labelling a subset of incident sessions as an attack is manual. The user opens a Jupyter notebook, chooses an incident, clusters the sessions with different clustering algorithms and manually assigns an arbitrary attack number to the selected clusters.
- Open Jupyter interface URL: http://localhost:8889/
- Open src/Clustering.ipynb
- Execute "Initialization" chapter
- "Configuration" chapter: change the assignment of variable "id_incident = ..." to your incident number
- "Configuration" chapter: uncomment the features you want to use: "features = [...]"
- Execute "Configuration" chapter
- Execute "Load Data" chapter
-
Execute DBSCAN Clustering chapter. After the clustering is done, you will see a bar plot of clusters. Y-axis corresponds to the size of the cluster. Every cluster has its own color from a predefined palette.
-
Use plot3() function in the second cell of the chapter to create different 3D scatter plots of the calculated clusters:
plot3([0,1,3], X, clusters, [])
The first argument of this function is an array of indexes of the 3 features to display at the scatter plot. Note that these are the indexes in the array of uncommented features from the "Configuration" chapter. If you have more than 3 uncommented features, choose different indexes and re-execute plot3() cell.
-
Choose your features carefully. It is always better to experiment and play with different features subsets (uncommented in "Configuration" chapter). Clustering is very sensitive to feature selection. Different attacks might have different distinguishable features. If you change your features selection in "Configuration" chapter, you must re-execute the "Configuration", "Load Data", and "Clustering" chapters.
-
Double clustering. In some cases DBSCAN clustering is not good enough. The suspected cluster might have a weird shape and even contain two different botnets. In order to further divide such a cluster, you can use the second iteration, which we call "Double Clustering". You should choose the target cluster after the first clustering, as well as the number of clusters for K-Means clustering algorithm.
The second cell in this chapter is the same plot3() function which displays a 3D scatter plot of double clustering.
plot3([0,1,3], X2, clusters2, [])
Note that you should use X2 and clusters2 arguments.
-
Choose your attack ID(s). Attack IDs are arbitrary numbers you assign to each botnet. The attack is identified by its incident ID and attack ID. It is possible to have more than one attack in a single incident.
-
Modify the tools.label_attack() function arguments
If you have more than one attack number to save, you should add a call to the label/attack() function for every attack.
For example, for attack #1 you choose cluster #3:
tools.label_attack(id_incident, attack_number = 1, selected_clusters = [3], selected_clusters2 = [])
If you use double clustering, don't forget to specify the indexes for selected_clusters2. For example, for attack #1 you will choose cluster #3 and double clusters #4 and #5:
tools.label_attack(id_incident, attack_number = 1, selected_clusters = [3], selected_clusters2 = [4,5])
- Execute "Save Attack" chapter.
In this section, users can explore the distribution of a single feature over the clusters to verify the quality of the clustering results.
box_plot_feature(clusters, num_clusters = 4, X = X, feature_index = 2)
The function will display a boxplot of feature values distribution per cluster.
Using this graph, you can get more insight into the quality of the clustering you used.
For instance, if you know in advance that the attack you are clustering should have a significant higher hit rate, then you can expect that a proper attack cluster should have a similar high boxplot of "request_interval" features.
If two attacks share a significant portion of identical IPs, they are likely to belong to the same botnet.
plot_intersection(clusters, num_clusters, id_incident, ips, id_incident2 = ..., attack2 = -1)
This function will create a bar plot highlighting portions of the clusters which share identical IPs with another incident (specified by variable id_incident2). It is also possible to specify a particular attack index.
This graph explores the country distribution over the clusters.
Even if an IP was banned during the incident, Bothound does not use this information for clustering. Nevertheless, the distribution of banned IPs over the clusters might be useful. This graph will display portions of IPs, banned by Banjax per cluster.
When attack labeling is completed (see "Attacks" chapter), a set of analytic scripts may be executed from a separate Jupyter notebook:
- Open Jupyter interface URL: http://localhost:8889/
- Open src/Analytics_1.ipynb
- Execute "Initialization" chapter
- "Configuration" chapter: type the incident IDs to explore
- Execute "Read Data" chapter
In this section you can get the general information about the attacks in the selected incidents:
- number of unique IPs
- IDs of labeled attacks
- number of bots in each attack
Incident 29, num IPs = 14790, num Bots = 13013
Incident 42, num IPs = 10963, num Bots = 9023
Attack 1 = 13857 ips
Attack 4 = 2589 ips
Attack 7 = 11746 ips
A barplot of country distribution over the botnets.
A barplot of country distribution over the incidents.
The top used User Agent string used by attackers.
This 3D scatter plot illustrates the distribution of attack sessions vs. the regular traffic. The first cell contains the code for preprocessing the plot. The first line in this cell defines an array with all the features.
features = [
"request_interval", #1
"ua_change_rate",#2
"html2image_ratio",#3
"variance_request_interval",#4
"payload_average",#5
"error_rate",#6
"request_depth",#7
"request_depth_std",#8
"session_length",#9
"percentage_cons_requests",#10
]
...
The second cell contains the call to plot3() function (the same function used in "Clustering.ipynb" Jupyter notebook). Make sure you correctly specify the first argument: an array of 3 indexes from the features array.
plot3([3,2,5], X, incident_indexes, -1, "Attack ")
The basic 3 metrics of the attacks:
- session length
- html/image ratio
- hit rate
Attack similarity is a very important measure. It gives you a quantitative measure of how close a selected attack is to previously processed attacks.
tools.calculate_distances(
id_incident = 29, # incident to explore
id_attack = 1, # attack to explore
id_incidents = [29,30,31,32,33,34,36,37,39,40,42], # incidents to compare with
features = [] # specify the features by name. Use all features if empty
)
The output is a list of previous attacks ordered by similarity or distance.
The amount of common IPs with previously recorded attacks is another important metric. When a new attack shares a significant portion of IPs with another attack, it is a plausible sign that a single botnet is behind both attacks.
# common ips with other attacks
tools.calculate_common_ips(
incidents1 = [29,30], # incidents to explore
id_attack = 1, # attack to explore(use -1 for all attacks)
incidents2 = [36,37,39,40] # incidents to compare with
)
The output is a list of attacks, ordered by the portion of common IPs.
- The first number - "identical" - is the total number of common identical IPs
- The second number - % of attack - is the portion of identical IPs in the target attack
- The third number - % of incident IPs - is the portion of identical IPs in the incident botnet
Intersection with incidents:
[36, 37, 39, 40]
========================== Attack 1:
Num IPs in the attack 13857:
__________ Incident 36:
Num IPs in the incident 111:
# identical IPs: 134
% of attack IPs: 5.00%
% of incident IPs: 77.00%
__________ Incident 37:
Num IPs in the incident 2720:
# identical IPs: 4567
% of attack IPs: 12.00%
% of incident IPs: 7.00%