/DSCI-550-Assignment-1

📧 Analysis of Cyber Phishing Emails: Fraudulent Emails and Social Engineering.

Primary LanguageJupyter NotebookMIT LicenseMIT

Analysis of Cyber Phishing Emails: Fraudulent Emails and Social Engineering.

GitHub watchers GitHub Repo stars GitHub forks

Made withJupyter made-with-python commit activity repo size

title-img

Project Report can be seen here.

Prerequisite

  1. Python virtual environment has been set up using pipenv. You need pipenv installed (learn more about installation and usage).
  2. Even though we have converted the data into json files using Tika, you may want to do it yourself. To learn more, check out the notes we have written below and its documentations.
  3. There are several other packages/tools you may want to use along the way. You should check out the instruction about this assignment

Usage

  1. First of foremost, build up the pipenv environment by running pipenv install command in this working directory. We are using Jupyter notebooks for all of our coding, so you may want to install the ipykernel as well. To do so:

    $ pipenv shell # this take you to the virtual environment
    $ python -m ipykernel install --user --name=<my-virtualenv-name> # change the kernel name as you see fit
    $ jupyter lab # run a jupyterlab instance on your localhost
  2. [Task 4] Download fraudulent emails datasets from Kaggle and put it into the data directory.

    Convert tika output to json:

    $ java -jar tika-app-2.0.0-ALPHA.jar -J -t -r data/fradulent_emails.txt > data/fradulent_emails_t.json

Explanation (learn more):

  • -J is recursive JSON.
    • [doc] -J or --jsonRecursive Output metadata and content from all embedded files (choose content type with -x, -h, -t or -m; default is -x)
    • -t is output plain text content.
      • [doc] -t or --text Output plain text content
    • -r is pretty print.

We have converted all flag options, but we mainly used -t option.

  1. [Task 5] Jupyter notebooks in Task 5

    Just run through each cell in the notebooks, they either generate a new feature JSON file or upload each of the features to the Firebase, where our team store the data to. As long as you are using the virtual environment kernel we mentioned in the 0 step of Build Instructions, you should have the packages you need in your virtual environment.

  2. [Task 6] Jupyter notebooks in Task 6

    Just run through each cell in the three notebooks, each notebook handles one dataset. We used firebase to store our data but we have accommodate the grader to have a local version by using json dump.

  3. [Task 7] Export PDF files in visualization directory. We offer circle packing and dynamic circle packing clustering visualizations.

    Also, we have saved all the circle.json and cluster.json from each similarity metrics.

    To re-run the visualization:

    Sample visualizations (edit-distance, dynamic circle packing): edit-distance-viz

    Sample visualizations (cosine, circle packing): cosine-viz

  4. [Task 8] TSV generation: Jupyter notebook [here](notebooks/Task7-TikaSimilarity/TSV generation & data for tika-smilarity.ipynb)

    Output in the data directory

How to Access Additional Data

Firebase URL: https://copydsci550.firebaseio.com/

We stored additional data in firebase. There is a local backup here. If you want to access the data using REST API, you can use curl:

$ curl '<firebase-URL>.json'

Notes

  1. Python virtual environment has been set up using pipenv. You need pipenv installed (learn more). Then run:

    $ pipenv install

    pipenv will install all python packages in the virtual environment. In the future, use

    $ pipenv install <wanted-package>

    to install a python package and it will keep track of what packages used in our project.

  2. fradulent_emails.txt has been converted to read-only. To modify the data, run this command in the data directory:

    $ new_file_name="<your-new-file-name>" bash -c 'cp fradulent_emails.txt ${new_file_name}; chmod 0644 ${new_file_name}'

    The command will make a copy of the data that can be read and written.

FAQ & Pull Requests

Please feel free to fork the repo and give it a pull request. If you encounter any problem, feel free to email me.

About

This is the assignment 1 from DSCI 550 Spring 2021 at USC Viterbi School of Engineering. This repo is collaborated by a group of six.

Team members: Zixi Jiang, Peizhen Li, Xiaoyu Wang, Xiuwen Zhang, Yuchen Zhang, Nat Zheng