/T_Brain_Malware_Detection_Competition

TrendMicro Malware Detection Competition Top 3% Solution

Primary LanguagePython

T_Brain_Malware_Detection_Competition

image This competition was held by TrendMicro, a well-known antivirus software company, on data science competition platform T-Brain. T-Brian is established in this year and attempts to create a competitive data science environment in Taiwan. It has planned to launch series of competitions in following days. If you get interested, look around this promising platform!

Problem Statement

Malware detection is an crucial issue in the field of Cyber Security. Traditionally, they use malwares' specific signatures to detect, but it costs lots of time and computation resources. Therefore, TrendMicro attempts to rely on machine learning skill to detect malware in time and reduce costs.
Given the query log within three monthes, build a predictive model to detect malware under the situation of file agnostic.

Dataset

  1. Query Log
    Query log contains 83,273,110 records of 81,977 unique FileID,including columns of FileID, CustomerID, ProductID and querytime.
  2. Train Training dataset contains 52,559 unique FileID and their malware_or_not(Target value) tag.
  3. Test
    Tesing dataset contains 29,418 unique FileID.

Timeline

Starts at: Jan 22 2018
Closed on: Mar 23 2018

Measure

AUC (Area Under ROC Curve)

Method

I generated Files' aggregated features from query log. Totally, I created almost 300 variables.

Feature Engineering

  1. Frequency-based features:
    First, I counted frequency by FileID and other category variables. Then, I calculated aggregate features like mean, variance, max and min.ect on frequency_count by FileID. For example, "Groupby(['FileID','CustomerID']).Count()" can get File's customer usage distribution. Afterwards, if you compute mean on it, you can get FileID's mean customer usage frequency.
  2. Time series features:   I calculated aggregate features like mean, variance, max and min.ect on QueryTime grouped by FileID and other category variables.
  3. Time difference features:
    I computed time difference for each file's usages, and then calculate aggregate features like mean, variance, max and min.ect on time_difference grouped by FileID and other category variables.
  4. Average response features:
    I generated users' average response rate in each cross-validation process and found it as key feature

Modeling

I used stacking model to make final prediction. In stacking model, I treated xgboost, lightboost and random forest as layer1 to generate meta features on train and test data. For train data, I use 3 folds cross-validation way to generate features fold by fold,but for test data I average 3 folds' features. Afterwards, I use logistic regression model with meta features as layer2 to make classification on test data.
I used 3 folds cross-validation with grid-search to train four models. The hyper-parameters I fine-tuned are as follows:

  1. XGB:n_iteration、max_depth、learning rate.
  2. LGB:n_iteration、max_depth、learning rate.
  3. Random Forest:n_iteration、max_depth、min_samples_leaf.
  4. Logistic Regression:C、Penalty。

Result

My team name is BigPikachu. I got 4th in Public Leaderboard(AUC=0.962997) and 7th in Private Leaderboard(0.967284).
image

Improvement

  1. More detailed variables:
    Maybe I should do more effort on EDA to know the data trend in each dimension. Then, I can generate features like "Count the number of times FileID's time difference > 300 seconds" to depict important trend.
  2. Matrix factorization method:
    Refer to others, I can try ALS and FFT to generate key features, which I can simultaneously do dimension reduction and depict File-Customer, File-Time or File-Product relationships.
  3. Dimension reduction:
    Although my score got higher in private leaderboard, the progress rate wasn't better than others. I think that's because my model learned too well on traing data, which indicated a little bit over-fitting. Despite that I had tried autoencoder to do dimension reduction, it seemed my local cv score didn't get better. I think I should train my autoencoder with larger iteration value(more than my prior 500 iterations) to let it converge.

Reference

  1. T-Brain:Malware Detection: https://tbrain.trendmicro.com.tw/Competitions/Details/1
  2. Autoencoder: https://www.kaggle.com/deepspacelearning/autoencoder-for-dimensionality-reduction
  3. Stacking Model Introduction:http://blog.kaggle.com/2016/12/27/a-kagglers-guide-to-model-stacking-in-practice/