404 crawler

A web crawler written in python. The purpose is to find bad links (404,403, ect) it's also to find pages missing the meta data field.

This will crawl a website and parse out links to store in a database.
It will follow those links and find more links to crawl.

##Requirements

Python 2.7.6
mysql python driver (used MySQL-python-1.2.3)
mysql server (used MySQL 5.6.22)
beautifulesoup
html parser (used html5lib)
- You'll need to upgrade Python's six package to >=1.9.0 for html5lib to work.
- On Mac, you need to bypass the System version of six as follows because pip won't:

wget https://pypi.python.org/packages/b3/b2/238e2590826bfdd113244a40d9d3eb26918bd798fc187e2360a8367068db/six-1.10.0.tar.gz#md5=34eed507548117b2ab523ab14b2f8b55
tar -xzf six-1.10.0.tar.gz
cd six-1.10.0
python setup.py install

##Database see database.sql for schema Pages Table contains links crawled and status checked Ignore Table add special case pages that crash program

Starting the Application

Load your database

mysql -u <user> -p < database.sql

Seed the database with the first URL by running this code.

    INSERT INTO `pages` (`source`, `link`) 
    VALUES ( 'http://www.lanl.gov/collaboration/index.php', 'http://www.lanl.gov/about/index.php' );

Edit mycrawler.py and set the following config:

protocol = 'https:'
baseurl = protocol + '//some_baseurl.com'
timeout = 15
mysqlParam = {
    'host': 'localhost',
    'user': 'root',
    'passwd': 'SOME_PASSWORD',
    'db': 'crawler',
    'port': 3306
}
#.
#.
#.
opener.addheaders = [('User-Agent', 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36')]  # Chrome 58.0.3029.110

Run the crawler

python mycrawler.py

jarchuleta/404_crawler

404 crawler

Starting the Application