How to import a big mbox file to Gmail
Scenario
My hoster allows to create a zip file of all IMAP folders to be downloaded from my ftp webspace. This zip file is 3GB (~70000 mails).
Importing this amount of mails takes very long and there is no resume feature, so importing is not feasible from a local machine. Instead I use a EC2 instance.
Setup EC2 instance
Use Ubuntu 14.04, I tried Amazon Linux but failed to install PyOpenSSL (no idea why). A micro instance is fine but use 32GB HD.
Download the archive from FTP
SSH into your EC2 instance.
$ ftp
ftp> open <your_ftp_server> # will prompt to username and password
ftp> passive # enter passive mode, otherwise stuff broke for be
ftp> dir # show directory
ftp> get <your_archive_filename>
That'll take some time. But still way fater then downloading it to your local machine.
Copy to S3
Copy your mailbox archive to S3 so you don't need to redownload from FTP if stuff goes wrong (copying to S3 is very fast).
Install aws-cli
sudo apt-get install awscli
Configure aws-cli
aws configure
Will prompt for Access Key ID and Secret Access Key, make sure that this user has read/write access from/to your S3 backup bucket. Will also prompt for the region.
Start copy
aws s3 cp <your_mailbox_archive> s3://<your_bucket>/<your_mailbox_archive>
To later copy archive to machine use:
aws s3 cp s3://<your_bucket>/<your_mailbox_archive> <your_mailbox_archive>
import-mailbox-to-gmail
Install I suggest you make yourself familar with import-mailbox-to-gmail
on your local machine.
sudo apt-get update
sudo apt-get install python-pip python-dev libffi-dev libssl-dev libxml2-dev libxslt1-dev libjpeg8-dev zlib1g-dev # otherwise PyOpenSSL install failed
sudo pip install --upgrade google-api-python-client PyOpenSSL
Sent label issue
We want to properly import the sent mails, since we want to have conversation view and stuff. import-mailbox-to-gmail
does not propertly import sent email.
Gmail determines what email is a "sent one" by reading the From
header (read more). import-mailbox-to-gmail
uses import()
to get the mails from the mbox file into Gmail. Import is generally very useful, it does some checks like spam or that duplicates are detected. However, when using import()
, the From
header is not read correctly.
Solution: use insert()
. It is more low level (similar to IMAP APPEND
) and header (including From
) are read correctly. In code import()
can simply replaced by insert()
. That will work.
Discard imap folder
I don't want to tranfer the IMAP folder into GMail labels.
Solution: replace this line with metadata_object = {}
.