Rob Osterburg, Galvanize Data Science Immersive Capstone
DNS maps domain names to an IP address eliminating the need. This makes for happy users while also providing a perfect backchannel for attackers to communicate unhindered through firewalls.
Attackers use DNS to steal information assets and command and control botnets. My project focuses on detecting malicious uses of the DNS protocol.
-
Research methods and engineer features for detecting misuse of the DNS protocol.
-
Select best features and evaluate models for predicting malicious use.
-
Gain experience applying machine learning to cyber security.
Security Onion is a distribution of Linux focused on network monitoring and includes the Bro Network Security Monitor (BNSM) that was used to gather all data for this project. Monitoring my home network with BNSM, I was able to capture roughly 40,000 DNS and 80,000 connection log entries over a period of two weeks.
After finding DNS tunneling “in the wild”, Eric Conrad shared the logs files from three of these attacks on his blog. These log files contained roughly 6,500 DNS and 3,500 connection log entries. I used sampling with replacement to rebalance this minority class in my training datasets for each different type of malware.
-
Producer consumer ratio (PCR) indicates the directionality of information transfer.
-
Shannon entropy of the query string.
-
URL reputation using VirusTotal RESTful API.
-
Classify newly registered URLs using whois.
-
Length of the query string and answer string.
To select features, I first removed those that caused data leakage, including IP addresses and port numbers. Then I used sequential feature selection to identify a set of three features for both the connection and DNS logs that minimized recall (tp / tp + fn), since missing actual attacks represents the highest risk to the organization.
Boosting and random forest models performed the best for classifying connections, and logistic regression did best for DNS traffic. More feature engineering work is needed to improve the models performance on the DNS log.
-
Model the history of a connection using a Markov chain.
-
Use information gain as a feature for classifying DNS queries and answers.
-
Develop a virtual lab to generate more consistent datasets.
-
Create an application to alert when attacks are detected.
- Conference Presentation Slides: PCR - A New Flow Metric, Producer Consumer Ratio by Carter Bullard and John Gerth
- Conference Talk Video: Security Onion 2016: C2 Phone Home - Eric Conrad
- Conference Talk Video: Machine learning applied to Bro by Chris McCubbin
- Conference Talk Video: Data Analysis, Machine Learning, Bro, and You! by Brian Wylie
- Documentation: BNSM Domain Name Log Data Elements
- Documentation: BNSM Connnection Log Data Elements