Insider Risk Detection in PySpark

Introduction

This repo contains the exploration of anomaly detection for insider risk implemented by Kernel Density Estimation (KDE), MinHash and K-Means. The implementation is based on PySpark-3.1.1 and Google Colab.

We implemented probability-based risk estimation for numerical features by KDE. And we implemented the detection of anomalous email contents by MinHash and K-Means.

Dataset

The Insider Threat Test Dataset, which is provided by the CERT Division, is a collection of synthetic insider threat test datasets that provide both background and malicious actor synthetic data. It contains 1000 users, 17 months long.

For more background on this data, please see the paper, Bridging the Gap: A Pragmatic Approach to Generating Insider Threat Data.

Usage

Please download the dataset from CMU kilthub and unzip them. Then put the CSV files into the folder ./data/.
For KDE based method, please open KDE_risk.ipynb and follow the introduction inside.
For Minhash & K-means based method, please open Kmeans_email.ipynb and follow the introduction inside.

Others

Because of the limitation of Colab, we cannot call the customized Spark backend. Therefore, the notebook email_IF.ipynb, which tries to apply the Isolation Forest algorithm, can not work successfully yet.

If you have any ideas, please tell me in Issues, thank you!