/feature-extraction-for-CERT-insider-threat-test-dataset

Feature extraction for CERT insider threat test dataset

Primary LanguagePythonMIT LicenseMIT

Feature extraction for CERT insider threat test dataset

This is a script for extracting features (csv format) from the CERT insider threat test dataset [1], [2], versions 4.1 to 6.2. For more details, please see this paper: Analyzing Data Granularity Levels for Insider Threat Detection Using Machine Learning.

[1] Lindauer, Brian (2020): Insider Threat Test Dataset. Carnegie Mellon University. Dataset. https://doi.org/10.1184/R1/12841247.v1

[2] J. Glasser and B. Lindauer, "Bridging the Gap: A Pragmatic Approach to Generating Insider Threat Data," 2013 IEEE Security and Privacy Workshops, San Francisco, CA, 2013, pp. 98-104, doi: 10.1109/SPW.2013.37.

Run the script

  • Require python3, numpy, pandas, joblib. The script is written and tested in Linux only.
  • By default the script extracts week, day, session, and sub-session data (as in the paper).
  • To run the script, place it in a folder of a CERT dataset (e.g. r4.2, decompressed from r4.2.tar.bz2 downloaded here), then run python3 featureExtraction.py
  • To change number of cores used in parallelization (default 8), use python3 featureExtraction.py numberOfCores, e.g python3 featureExtraction.py 16.

Extracted Data

Extracted data is stored in ExtractedData subfolder.

Note that in the extracted data, insider is the label indicating the insider threat scenario (0 is normal). Some extracted features (subs_ind, starttime, endtime, sessionid, user, day, week) are for information and may or may not be used in training machine learning approaches.

Pre-extracted data from CERT insider threat test dataset r5.2 (gzipped) can be found in here.

Citation

If you use the source code, or the extracted datasets, please cite the following paper:

D. C. Le, N. Zincir-Heywood and M. I. Heywood, "Analyzing Data Granularity Levels for Insider Threat Detection Using Machine Learning," in IEEE Transactions on Network and Service Management, vol. 17, no. 1, pp. 30-44, March 2020, doi: 10.1109/TNSM.2020.2967721.