A Clustering Algorithm for Detecting Email Fraud
- Tim Strauven
- Kivanc Gunduz
- Anjali Tiwari
- Rosyidah Nadia
Analize the Enron email dataset, this is a set of :
- 500 000 messages
- 150 users, mostly senior management of Enron
- Organized into mailbox folders
- 2.5 GB uncompressed
- link: https://www.cs.cmu.edu/~enron/
Facilitate the exploration of the dataset by:
- Organising it by topic (clustering, topic modeling)
- Without being overwhelming (small number of clusters)
- Remaining as relevant as possible
- In practice, the clustering may need to be hierarchical:
- Start fuzzy and specify more and more
- Build a tool as user friendly and demonstrable as possible:
- (deployed) web interface > command line > notebook (still good though)
- 10 minutes demo / per team on Friday @ 14.00
- Everybody will present
-
Download data
-
Create Github Repo
-
First look at data
-
Add literature research in fraud business to provide a different perspective or an additional analysis of the current project. This can help identify gaps in the current knowledge and highlight future directions.
-
Data cleaning from half of million emails.
-
Data preprocessing
-
Decide on what algorithm to use: Kmeans or LDA (topic modeling) look like good candidates
-
user interface
General
- Download data
- Create Github Repo
- First look at data
- Brainstorm for ideas and approach at12h05
Jobdesk
- Tim : Creating dataset in CSV
- Kivanc : Analysing K-means Clustering and vectorization
- Anjali : Exploring Enron Case and the structure of Topic Modelling
- Nadia : Research in Clustering and Topic Modelling
Challenge for today
- Which clustering method to use? --answer : topic modelling and K- means.
- What is the expectancy of the end result? --answer : data visualization of topic modelling and K-means.
General
- Brainstorming at 09.00 and 12.00
- Data cleaning
- Literature review in Clustering method
Jobdesk
- Tim : Literature review and applying Clustering method in csv file
- Kivanc : Literature review and applying Clustering method in csv file
- Anjali : Providing a clear stepstone to reach the goal and data cleaning, topic modelling (preprocessing, model application and evaluation)
- Nadia : Data Cleaning and visualization
- In the afternoon all member is working on data visualization based on Kmeans and LDA models.
Challenge for today
- Tim : To create a first cluster, using Kmeans and plotting and to understand clustering process.
- Kivanc : To implemente time and understanding clustering.
- Anjali: Overall understand all steps of preprocessing for topic modelling but facing some issue in visualizing model (showing some error).
- Nadia : Data visualization in LDA is empty, might caused by the data process.
***Next day goal **
-
Taking out insights from that (improvement in data processing if required) and compare insights of both models (how it is making sense) and done with modelling part and start user interface (if possible)
-
Tim : taking out insights from that (improvement in data processing if reauired) .......
-
Kivanc : taking out insights from that (improvement in data processing if reauired) .....
-
Anjali : solving visualization issue, taking out insights from that (improvement in data processing if reauired), find optimal no. of topics
-
Nadia : taking out insights from that (improvement in data processing if reauired) ......
General
- Brainstorming at 09.00 and 12.00
- Deadline : 2 model run and comparing/ analysing both models outcome based on visulaization (improve model if required)
Jobdesk
- Tim : Get AgglomerativeClustering to work for the whole dataset and review the LDA codes to improve the code
- Kivanc : Implemented the KMeans Clustering for whole dataset
- Anjali : Topic modelling visualisation and start the user interface
- Nadia : Better visualisation of the the first 10.000 rows of dataset
Challenge for today
- Tim : Hyperparameter tuning of LDA model
- Kivanc : Mananging whole dataset
- Anjali : Challenge in model output
- Nadia : Got some error for visualisation but solved with restarting Vscode
***Next day goal **
- Preparing the powerpoint user interface
General
- Brainstorming at 09.00 and 12.00
- Deadline : User interface
Jobdesk
- Tim : Agglomerative Clustering to make a connection between tokenize words and email and user interface using Streamlit
- Kivanc : User interface and LDA data visualization
- Anjali : Data visualisation for 50,000 dataset and slide presentation.
- Nadia : Working on bigger size of dataset and slide presentation. Compared to yesterday, the Topic Modelling's result is less interesting. In rough conclusion, smaller dataset will give a better overview
Challenge for today
- Data visualization, bigger dataset means longer time to proces, and user interface preparation
***Next day goal **
- Preparing the user interface
General
- Brainstorming at 09.00 and 12.00
- Deadline : Presentation
Jobdesk
- Tim : User Interface
- Kivanc : User Interface
- Anjali : Powerpoint
- Nadia : Powerpoint
Challenge for today
- To deliver good and understandable content for audience
- Programming Language: Pyhton
- IDE: VS Code
- Presentation: Jupyter Notebook
- Communication: Discord