Email Network Analysis with Neo4j

This project explores email communication networks using Neo4j, a graph database well-suited for representing and analyzing relationships between entities.

Email Network Visualization

Overview

The project focuses on:

  • Data Generation: Creating a realistic email dataset with different categories (work, personal, spam) and controlled distribution of senders and recipients.
  • Data Ingestion: Importing the generated email data into Neo4j.
  • Tiering Algorithm: Implementing a custom algorithm to categorize email addresses into tiers based on their communication directionality with a central email address (John Doe in this case).

Requirements

  • Python 3
  • Faker library (pip install Faker)
  • Neo4j Desktop or Neo4j Browser

Instructions

The Python code you provided looks comprehensive and well-structured. However, there are a few additional details that could be added to the README to improve clarity and usability:

  1. Python Script Explanation: Add a brief explanation of what the generate_email_network_dataset.py script does, including the fact that it generates a CSV file with simulated email data.

  2. Customization Options: Mention that users can adjust the distribution of senders and recipients, the number of emails generated, and the labels used in the script to customize the generated dataset to their needs.

  3. Output File Location: Specify the location where the generated CSV file will be saved (e.g., in the data directory) and the default filename (john_doe_emails_network.csv).

  4. Running the Script: Provide a command-line example of how to run the script, such as python generate_email_network_dataset.py.

  5. Dependencies: Include a note that the script requires the Faker library, which can be installed using pip install Faker.

Data Generation

The generate_email_network_dataset.py script generates a CSV file containing simulated email data. It uses the Faker library to create realistic email addresses and subject lines. The script allows for customization of the distribution of senders and recipients, the number of emails generated, and the labels used.

Output File:

  • The generated CSV file will be saved in the data directory with the filename john_doe_emails_network.csv.

Customization Options:

  • Adjust the distribution of senders and recipients in the sender_distribution and recipient_distribution variables.
  • Modify the number of emails generated by changing the loop range (for i in range(1000)).
  • Alter the labels used in the labels variable.

Running the Script:

To generate the email dataset, run the following command in your terminal:

python generate_email_network_dataset.py

Dependencies:

  • The script requires the Faker library. Install it using:

    pip install Faker

Data Ingestion

Prerequisites:

  • You have generated the email dataset CSV file using the generate_email_network_dataset.py script.
  • You have Neo4j Desktop or Neo4j Browser installed and running.

Option 1: Using Neo4j Desktop**

  1. Open Neo4j Desktop.
  2. Create a new project or open an existing one.
  3. In the Manage section, go to the Databases tab.
  4. Click on the database where you want to import the data.
  5. In the Import section, click From CSV.
  6. Select your CSV file ("john_doe_emails_network.csv").
  7. Click Import.

Option 2: Using Neo4j Browser**

  1. Open the Neo4j Browser (http://localhost:7474 by default).

  2. Execute the following Cypher query, replacing the path with the actual location of your CSV file:

     LOAD CSV WITH HEADERS FROM 'https://raw.githubusercontent.com/veteranbv/neo-tiering-algorithm/main/data/john_doe_emails_network.csv' AS row
     MERGE (sender:Email {address: row.sender})
     MERGE (recipient:Email {address: row.recipient})
     CREATE (sender)-[:SENT {id: row.email, date: apoc.date.parse(row.date, 'ms', 'yyyy-MM-dd HH:mm:ss'), subject: row.subject, labels: row.labels}]->(recipient)

Notes:

  • The Cypher query creates "email" nodes and "SENT" relationships based on the data in your CSV file.
  • Adjust the node label and relationship type if your data model is different.

Tiering Algorithm

This section describes the Cypher queries used to implement the tiering algorithm for analyzing the email network, along with scoring mechanisms for each tier.

Assumptions:

  • You have successfully imported the email data into Neo4j.
  • John Doe's email address is represented by the node with the property address: 'john.doe@personal.com'.

Tier Definitions:

  • Tier 1: Nodes that have both sent and received emails with John Doe. These represent direct communication partners.
  • Tier 2: Nodes that have only received emails from John Doe. These represent recipients of information from John Doe.
  • Tier 3: Nodes that have only sent emails to John Doe. These represent senders of information to John Doe.

Tiering Queries with Scoring:

Tier 1: Direct Communication Partners

MATCH (johnDoe:Email {address: 'john.doe@personal.com'})
MATCH (johnDoe)-[sentByJohnDoe:SENT]->(tier1Node:Email)-[sentToJohnDoe:SENT]->(johnDoe)
WITH tier1Node, 
     COUNT(DISTINCT sentByJohnDoe) AS sentByJohnDoeCount, 
     COUNT(DISTINCT sentToJohnDoe) AS sentToJohnDoeCount
SET tier1Node:Tier1, tier1Node.score = sentByJohnDoeCount * sentToJohnDoeCount
RETURN tier1Node
ORDER BY tier1Node.score DESC

Tier 2: Recipients of Information from John Doe

MATCH (johnDoe:Email {address: 'john.doe@personal.com'})
MATCH (johnDoe)-[sentByJohnDoe:SENT]->(tier2Node:Email)
WHERE NOT EXISTS ((tier2Node)-[:SENT]->(johnDoe))
WITH tier2Node, COUNT(sentByJohnDoe) AS sentByJohnDoeCount
SET tier2Node:Tier2, tier2Node.score = sentByJohnDoeCount
RETURN tier2Node
ORDER BY tier2Node.score DESC

Tier 3: Senders of Information to John Doe

MATCH (johnDoe:Email {address: 'john.doe@personal.com'})
MATCH (tier3Node:Email)-[sentToJohnDoe:SENT]->(johnDoe)
WHERE NOT EXISTS ((johnDoe)-[:SENT]->(tier3Node))
WITH tier3Node, COUNT(sentToJohnDoe) AS sentToJohnDoeCount
SET tier3Node:Tier3, tier3Node.score = sentToJohnDoeCount
RETURN tier3Node
ORDER BY tier3Node.score DESC

Running the Queries:

  1. Open the Neo4j Browser.
  2. Execute each query separately to retrieve and score the nodes in each tier.
  3. You can visualize the results in the Browser or use further Cypher queries to analyze the properties and relationships of these nodes.