Freezes on ð character in subject line
benfrancis opened this issue · 9 comments
Thanks for this tool!
I just ran the script with the following command, on an .mbox file from Google Takeout containing approximately 7,000 emails:
$ python3 imap_upload.py --gmail --box imported takeout.mbox
It seems to have got stuck on an email with a subject line containing an ð ("eth") character. The full subject line is "FW: Youtube Job Wants You 👉 $20K/Month Potential! 80272150" (yes, it appears to be a spam message).
Is there anything I can do to recover from this? If I run the script a second time, will it upload duplicate emails?
I tried cleaning up some of the spam emails and re-exporting. This time it hung on a "â" character.
It seems to have got stuck on an email with a subject line containing an ð ("eth") character.
Sounds like a character encoding issue. Do you have the exact error message? Including the line of code throwing the error?
If I run the script a second time, will it upload duplicate emails?
I would believe so. You might be able to recover by searching for emails added at a specific date. Also, you could check to see if Gmail has added any specific labels for uploaded emails.
If this doesn't work, maybe we could add a feature to delete emails that exist in a specific mbox.
Sounds like a character encoding issue. Do you have the exact error message? Including the line of code throwing the error?
I'm afraid it didn't actually print an error to the console, it just stopped printing any output to the console after that character in the subject line.
It might be possible to reproduce by exporting an .mbox file with an email containing a ð or â character in its subject line. I think that was a real subject line designed to evade spam filters, not garbled output caused by your script. But I agree it seems like a character encoding issue in that something is crashing on certain UTF-8 characters.
I would believe so. You might be able to recover by searching for emails added at a specific date. Also, you could check to see if Gmail has added any specific labels for uploaded emails.
If this doesn't work, maybe we could add a feature to delete emails that exist in a specific mbox.
In the end I fixed the problem by re-exporting the .mbox without the offending emails, but it took a while to get rid of them all and I had to delete several thousand emails each time I ran the script to avoid duplication. Fortunately the uploaded emails were labelled as "imported" by GMail which made that easy to do.
It would be useful if the script could de-duplicate emails when uploading, but I don't know how hard that is and how it would affect performance.
I've since discovered that Google have a couple of tools for this called mail importer and import-mailbox-to-gmail. They both look harder to use than your script, but the former features de-duplication and the latter has a --from_message
parameter to re-start from a certain message number in the mailbox if something goes wrong.
I'm afraid it didn't actually print an error to the console, it just stopped printing any output to the console after that character in the subject line.
The script might successfully terminate without logging a message to the console. Can you confirm only a subset of your emails were uploaded?
Also, if you're using Google Takeout, have you tried using the --google-takeout-*
arguments?
The script might successfully terminate without logging a message to the console. Can you confirm only a subset of your emails were uploaded?
Yes, it terminated after about 4,000 of 23,000 emails had been uploaded.
Also, if you're using Google Takeout, have you tried using the
--google-takeout-*
arguments?
No I didn't use that option because I didn't need to preserve labels for this particular upload. I may need that for future uploads though, so will try it next time thanks.
I think it's unlikely, the .mbox file was about 600MB and the PC the script was running on has 16GB RAM. It would also be a bit of a coincidence that it appeared to stop at unusual characters every time.
You could try running the script in the new dry-run mode, and capturing all the output.
Our newest branch: https://github.com/btactic/imap-upload/tree/google_takeout_codepages_fixes_v1 which we will merge soon thanks to #54 deals better with wrong encoding in subject lines.
Prior to this improvement I have never experienced the program to end if there was such a problem with the subject encoding. The only thing that happened in my tests is that this particular email status line was not written and the email was skipped, next email was processed.
So... why don't you give it a go with the google_takeout_codepages_fixes_v1
branch and give us feedback?