scrapers: A Python repository from mernisse

Horde Financial Scrapers
========================

The scripts in this distribution fetch account balance/value information
from various financial web sites. Python, mechanize
(http://wwwsearch.sourceforge.net/mechanize/), and BeautifulSoup
(http://www.crummy.com/software/BeautifulSoup/) are required.

Scrapers are included for:

American Express (americanexpress.com)
American Funds (americanfunds.com)
AT&T Wireless (wireless.att.com) [unmaintained]
Canandaigua National Bank & Trust (cnbank.com)
Citizens Bank (Charter One in some markets) (citizensbank.com)
Fidelity 401(k) (401k.fidelity.com)
HSBC (hsbcdirect.com)
mypaystub.info
T-Mobile (my.tmobile.com)
T. Rowe Price 401(k) (rps.troweprice.com) [unmaintained]
Treasury Securities (using yahoo.com) [unmaintained]
Vanguard (personal.vanguard.com)


Please note that these scrapers work for me, but may require modification to
work in other situations (e.g., if you have multiple accounts with HSBC).


Please Be Polite
================

I've done my best to read the applicable agreements for each site; the
agreements I've received with my accounts do not prohibit me from using
software like this for the reasons I use it. Before using these scrapers,
I strongly recommend reading the agreements you've been provided to
determine whether they allow you to use these scrapers.

Even when their use is allowed, please take care to be polite to these
sites. Don't place undue load on them by running these scrapers every five
minutes; run them a couple of times a day at most. Your graphs don't
*really* need to be that up-to-date, right?


Setup and Use
=============

I use these scrapers with Cricket (http://cricket.sourceforge.net/), but
they are "compatible" with any monitoring software that can execute a
command and read its output. By default, balances are stored in
/var/cache/cricket, but this can be changed by modifying the *_TAB variables
at the top of each scraper.

I generally run them twice a day, around market open and after market close.
For example:

0 9,21 * * * trp; chgrp cricket /var/cache/cricket/trp.tab; chmod 640 /var/cache/cricket/trp.tab

I also run the scrapers themselves as a dedicated user to avoid exposing
login credentials to other accounts, such as the web server role account.
This user is a member of the cricket group, so the tab files in
/var/cache/cricket can be made readable to the cricket user. **UNDER NO
CIRCUMSTANCES should the scraper be world-readable or readable by the user
your monitoring software/web server run as.**

For example, the T. Rowe Price scraper (trp) will run as a dedicated user
(not cricket, www-data, httpd, or any other user, but a user created solely
for running these scrapers) and Cricket will run trp-cat as the cricket user
to fetch the data stored by trp.

The included Defaults file configures Cricket to work with these scrapers.
It also adds two RRA definitions so daily averages will be kept for ten
years instead of Cricket's default of one year. The Defaults file must be
modified for some of the scrapers (American Funds and T. Rowe Price in
particular) to reflect the funds you hold in those accounts.


Design Details
==============

Each scraper has two parts: the scraper itself that fetches data from the
site and writes it to a file, and a script that emits the stored data from
the file.

This allows the scraper itself to run much less frequently than the
monitoring software (Cricket, for example) polls for data; there's no need
to poll for your balance information every five minutes, which is default
Cricket behavior. It also enhances security, since permissions on the
scraper (containing your authentication credentials) can be made much more
restrictive, perhaps running as a dedicated user. Under no circumstances
should the scraper be world-readable or readable by the user your monitoring
software/web server run as.
mernisse/scrapers