/getMailTopKeywords

Scans mail folders and returns a count of repeated words

Primary LanguagePythonGNU General Public License v2.0GPL-2.0

getMailTopKeywords

Scans mail folders and returns a count of repeated words. I'm using it with a backup of Gmail I got using "Backup Gmail" https://code.launchpad.net/~cfraire/backup-gmail/devel

Requires pyzmail library: http://www.magiksys.net/pyzmail/

usage: getMailTopKeywords.py [-h] [-c COUNT] [-a] [-l LANG] [-n] folder

positional arguments:
  folder                the folder to process

optional arguments:
  -h, --help            show this help message and exit
  -c COUNT, --count COUNT
                        show words with count greater or equal than this
                        number. Default: 10
  -a, --aspell          whether to check the words using Aspell. Default: no
                        Requires python-aspell library. See
                        https://github.com/WojciechMula/aspell-python.
  -l LANG, --lang LANG  check words in that language (only if -a is set).
                        Needs aspell dictionary in that language installed.
                        Default: en
  -n, --nltk            use Natural Language Toolkit. Default: no
                        Requires nltk library.
                        See http://nltk.org/index.html. 

Example:

./getMailTopKeywords.py -a -c 10 -l es /home/esalgado/GmailBackup/2005-12
Reading from /home/USER/GmailBackup/2005-12 ... 
Analyzing 42 mails...
[==================================================] 100%
Building a list of the top keywords...
Top keywords:
  mas 21
  correo 20
  foto 18
  infantil 15
  seguridad 13
  enero 13
  parque 11
  servicios 11
  nuevos 11

Example using NLTK (extracts the roots of the words):

./getMailTopKeywords.py -a -c 10 -l es -n /home/esalgado/GmailBackup/2005-12
Reading from /home/USER/GmailBackup/2005-12 ... 
Analyzing 42 mails...
[==================================================] 100%
Building a list of the top keywords...
Processing using Natural Language Toolkit...
Top keywords:
  mas 21
  envi 20
  corre 20
  fot 20
  nuev 19
  infantil 15
  servici 14
  activ 14
  ener 13
  segur 13
  mes 12
  monitor 12
  parqu 11