/Zep-Task

Primary LanguageJupyter Notebook

Predicting Data Exfiltration via DNS

Create a static machine learning model based on batch data. The dataset that is used is from top secret files obtained from our allies Ring Canada (RC) and the Cyber Threat Intelligence (CTI). The dataset provided to you has DNS traffic generated by exfiltrating various filetypes ranging from small to large sizes.

The aim of the task is to implement a binary classifier aiming at predicting data exfiltration via DNS.

Exploratory Data Analysis (EDA)

  • Using the file called “static_dataset.csv”
  • checked using plots and statistical tools the distribution of each feature and the target variable
  • checked any type of data skewed pattern.
  • Validated if your dataset is imbalanced

Data cleaning

Analyzed the data inside the .csv file (static_dataset.csv) and transform the variables that contain string values, so that all of them can be used in the model. Check for missing values and categorical values.

Feature engineering

Applied PCA dimensionality reduction on the dataset and found that best component is 13

Model Training and Model evaluation

Logistic Regression Model is used for binary classification problem Splited the data using a method you find suitable and justify it. Normalized your data and train the selected model.