Tovio - Galvanize Data Science Analytics capstone
- What do you do with highly correlated and partially Y-derived data?
- Why might it be worthwhile to make categorical predictions on derived features?
- Explored the kdd99 data
- Tested significant differences in ratios of attack categories
- Downsampled majority class
- Applied a logistic regression model to single features and recorded False Negative and Accuracy rates
- The data come from a machine learning competition held in 1999. See http://kdd.ics.uci.edu/databases/kddcup99/kddcup99.html
- The task was to create a predictive model able to categorize a wide variety of bad connections or attacks into four main categories.
- Additionally, malicious connections should be categorized separately from "normal" connections.
- leveraged bash to pull arbitrary numbers of random samples from the data
- (>700000 rows)
- quick visualization and sanity checks
- abstracted functions into relevant classes
- reusability
- not trapped in jupyter
- Tried out a lot of things, abandoned almost everything
- It was generated artificially and shared incrementally as TCP dumps, totalling roughly 4GB in total.
- Many of the 42 features were derived and interrelated with the predictant
- A basic logistic regression model, given a set of features with < |0.60| correlation will easily predict on the training data at "too high" an accuracy.
- no null values
- mix of numeric and categorical types
- smurf attacks (DOS) comprise 57% of the rows
- many types of values for
- service
- flag
icmp 405033
tcp 267370
udp 27597
ICMP
- (from wikipedia): is used by network devices, including routers, to send error messages and operational information indicating, for example, that a requested service is not available or that a host or router could not be reached.- Keep in mind that many attack types only happen over certain protocols.
At 700000 samples:
land has only zero values
num_outbound_cmds has only zero values
is_host_login has only zero values
{('count', 'dst_host_same_src_port_rate'): 0.862
('dst_host_rerror_rate', 'dst_host_srv_rerror_rate'): 0.987
('dst_host_rerror_rate', 'srv_rerror_rate'): 0.985
('dst_host_same_src_port_rate', 'dst_host_same_srv_rate'): 0.676
...
- recall that
normal.
is not an attack. - we will get counts of normal and everything else
From the graphics above, it seems likely that ICMP garners a higher proportion of attacks than TCP or UDP. We can perform a chi-squared test for proportions to check if there is a statistical difference.
- ALPHA = 0.05
- df = 1
- total sample size: 700000
Number of icmp Connections: 405033
Number of Attacks: 403238
Number Normal: 1795
Cat Count
2 401455
0 1795
1 1783
Number of tcp Connections: 267832
Number of Attacks: 157741
Number Normal: 110091
Cat Count
2 153805
0 110091
1 3762
4 166
3 8
Number of udp Connections: 27850
Number of Attacks: 413
Number Normal: 27437
Cat Count
0 27437
1 279
2 134
- H0: the proportion of connections recorded that are attacks is not significantly different between ICMP and TCP
- HA: the proportion of connections recorded that are attacks is significantly different between ICMP and TCP
- ALPHA = 0.05
- df = 1
from scipy.stats import chi2_contingency
obs = np.array([[icmp_obs_attacks,
icmp_obs_normal],[tcp_obs_attacks, tcp_obs_normal]])
chi2, p, dof, expected = chi2_contingency(obs)
-The null hypothesis of independence is rejected if chi-squared is large enough, because this means that observed frequencies and expected frequencies are far apart.
Reject H0, there is a significant difference
in attack/connection proportion between ICMP and TCP
chi-square test statistic: 192072.97
p-value : 0.00
deg of freedom: 1.00
- H0: the proportion of connections recorded that are attacks is not significantly different between TCP and UDP
- HA: the proportion of connections recorded that are attacks is significantly different between TCP and UDP
- ALPHA = 0.05
- df = 1
Reject H0, there is a significant difference in attack/connection
proportion between TCP and UDP
chi-square test statistic: 33421.43
p-value : 0.00
deg of freedom: 1.00
- Create filters of services
- Test if proportions of services:attack types are significantly different
- Use top 10 count of services, so 10-row chi test
- Since
denial of service
floods servers with response requests, it was the majority category of attack type. - Used only a logistic regression
- Built a reusable template that can take other models
- Iterated through and gathered Error Type metrics on selected list of features
- Normal: not an attack
- Probe: surveillance and other probing, e.g., port scanning
- DOS: denial-of-service, e.g. syn flood
- U2R: unauthorized access to local superuser (root) privileges, e.g., various `buffer overflow`` attacks
- R2L: unauthorized access from a remote machine, e.g. guessing password;
- (some descriptions of the problem have 5 categories)
- 0 - Normal: 139294 categorized
- 1 - Probe: 5833 categorized
- 2 - DOS: 554699 categorized
- 3 - U2R: 8 categorized
- 4 - R2L: 166 categorized
- split target Y into 1 for DOS (Denial of Service) and 0 for everything else
- wrote methods to auto-drop any features with >.60 correlation coefficients
- dropped rows of the majority class to get 50/50 spread of Y values
- populate pandas dataframe with confusion matrix values
- output metrics on each feature
- network service on the destination, e.g., http, telnet, etc.
#--------------------------------------------#
Running classifier on ['service']
#--------------------------------------------#
Modified Y to balance 1s and 0s
1 145236
0 145236
Name: attack_category, dtype: int64
True Positives: 28989
True Negatives: 26437
False Positives: 2551
False Negatives: 118
Accuracy: 0.954
Classification_error: 0.045
Recall: 0.995
Precision: 0.919
False Negative Rate: 0.004
confusion matrix
[[26437 2551]
[ 118 28989]]
- Currently not very useful or meaningful
- should really use combinations of at least 2 features
- How well can
derived features
be discerned in real time?- How few computed features do you need to make a good guess
- What is the computation cost on the derivation of any given feature in real time
- What is the cost of false negatives?
- If your home router is compromised
- Equihacks?
- Attempted to apply lasso, with little success
- Instead incorporated
l1
intoLogisticRegression(penalty='l1')
- Instead incorporated
- Built a pipeline, scrapped it in favor of utilities that call each other
- Tried crossval, but decided to get a working model first, ran out of time
- Modeling
- Run the model on the actual test data
- Train models on the other attack categories
- pass in kNN model maybe?
- Run Combinations on the features
- Real World
- Cost matrix for the theoretical ramifications of False Negatives
- Incorporate data related to computational cost for features derived in real time
- Work with current data set of attacks
feature name | description | type |
---|---|---|
duration | length (number of seconds) of the connection | continuous |
protocol_type | type of the protocol, e.g. tcp, udp, etc. | discrete |
service | network service on the destination, e.g., http, telnet, etc. | discrete |
src_bytes | number of data bytes from source to destination | continuous |
dst_bytes | number of data bytes from destination to source | continuous |
flag | normal or error status of the connection | discrete |
land | 1 if connection is from/to the same host/port; 0 otherwise | discrete |
wrong_fragment | number of ``wrong'' fragments | continuous |
urgent | number of urgent packets | continuous |
feature name | description | type |
---|---|---|
hot | number of ``hot'' indicators | continuous |
num_failed_logins | number of failed login attempts | continuous |
logged_in | 1 if successfully logged in; 0 otherwise | discrete |
num_compromised | number of ``compromised'' conditions | continuous |
root_shell | 1 if root shell is obtained; 0 otherwise | discrete |
su_attempted | 1 if ``su root'' command attempted; 0 otherwise | discrete |
num_root | number of ``root'' accesses | continuous |
num_file_creations | number of file creation operations | continuous |
num_shells | number of shell prompts | continuous |
num_access_files | number of operations on access control files | continuous |
num_outbound_cmds | number of outbound commands in an ftp session | continuous |
is_hot_login | 1 if the login belongs to the ``hot'' list; 0 otherwise | discrete |
is_guest_login | 1 if the login is a ``guest''login; 0 otherwise | discrete |
for more info: http://kdd.ics.uci.edu/databases/kddcup99/task.html