/email-archive-analysis

Processing PST/EML/MBOX mail archives for Investigative Research

The Researcher's Toolkit for Email Archive Analysis

Introduction

Mailbox archives are one of the most common formats in data leaks, often requiring investigation by investigative journalists. These archives typically consist of hundreds of files totalling gigabytes in size.

There are various archive formats, and while some of them are easily indexed and searched using standard tools, others require specialized software, such as importing into Outlook or Thunderbird.

While there are numerous online services for converting individual files, this approach is impractical for gigabytes of data.

This guide will explain how to research large amounts of non-standard mailbox archives using the example of the Metprom email breach.

Reference: mailbox formats
  • PST (Personal Storage Table) - Used by Microsoft Outlook to store email messages, attachments, folders, and other items on a local computer.

  • OST (Offline Storage Table) - Also used by Microsoft Outlook; it allows users to work offline and then synchronize changes with the email server once online.

  • MBOX - A generic file format for storing email messages where each file can contain multiple emails. It is used by various email clients such as Mozilla Thunderbird, Apple Mail, and formerly by Eudora and Entourage.

  • EML - Files typically used by email clients like Microsoft Outlook Express, Windows Mail, and others where each file contains a single email message.

  • DBX - Used by older versions of Microsoft Outlook Express, where each DBX file represents a mail folder.

  • MBX - Older file format used by Eudora and some other email clients to store emails.

  • NSF (Notes Storage Facility) - Used by IBM Lotus Notes to store emails along with other items like calendar entries and contacts.

  • MSG - Represents a single email message saved in Microsoft Outlook. It can include attachments and rich text formatting.

Preparation of mailbox archives

Large volumes of email archives are often distributed via torrents. However, there may not be any seeders to distribute the torrent to you, requiring you to manually download the files from a web interface. This process can be automated using download manager software.

Environment: MacOS:

Installation:

brew install wget

Downloading:

wget -m -np -nH --cut-dirs=1 -A '*' https://data.ddosecrets.com/Metprom%20Group/

Add -c flag option to continue downloading after any interruption.

Convert PST to EML

Among the common types of email archives, PST is not the most popular, but it often appears in data leaks. Since popular tools like Datashare and Pinpoint do not recognize and index it, there arises the task of automatically extracting files from PST.

Environment: MacOS:

brew install libpst

Convert one file:

readpst -D -S -r -e -o OUTPUT_DIR PST_FILENAME

Convert all PST files in the current folder to EML and save them to folder ~/Datashare:

for f in `ls *.pst`; do mkdir ~/Datashare/$f && readpst -D -o ~/Datashare/$f -S -r -e $f; done

Convert EML/MSG to PDF

Despite EML being a popular format and easily opened by email clients, your search tool may only support regular PDFs.

We will use email-to-pdf-converter project, which suggests CLI and GUI interfaces to convert files.

Environment: MacOS (You need Java installed)

brew install wkhtmltopd wget
wget https://github.com/nickrussler/email-to-pdf-converter/releases/download/2.6.0/emailconverter-2.6.0-all.jar

Convert one file + extract attachments:

java -jar emailconverter-2.6.0-all.jar FILENAME_EML_OR_MSG -a

Run the converter in the graphical interface:

java -jar emailconverter-2.6.0-all.jar -gui
email-to-pdf-converter

Convert PST to MBOX/EML

Converting PST to MBOX is somewhat of an exotic task because MBOX files are large, undivided files consisting of email headers, HTML, and Base64 encoded email bodies. However, this conversion might be necessary to prepare MBOX files for import into your email client.

pstconv

Warning: pstconv does not automatically handle directory encodings, which is critical for files in other languages

Environment: MacOS: (You need Java installed)

wget https://github.com/soxoj/pstconv/raw/main/pstconv-0.9.7.jar

Convert one file to MBOX format (alternatively, you can put eml and have EML output):

java -jar pstconv-0.9.7.jar -i INPUT_FILE.pst -o OUTPUT_DIR -f mbox

Convert all PST files in the current folder to MBOX and save them to folder ~/Datashare:

for f in `ls *.pst`; do java -jar pstconv-0.9.7.jar -i $f -o ~/Datashare -f mbox; done

Convert MBOX to EML

Sometimes you may encounter archives consisting of MBOX files. These can be imported into Thunderbird; however, if you plan to use programs for automated searching, the large MBOX files may not be processed correctly and may not be fully indexed.

To resolve this issue, you can convert MBOX into a multiple EML files.

mboxzilla

We will use a powerfull tool mboxzilla working with MBOX files in different ways: splitting, converting to EML, cleaning, etc.

Environments: MacOS: download Linux: download Windows x64: download

Extract all emails from MBOX file:

mboxzilla -f MBOX_FILENAME -o OUTPUT_DIR -e