spidy Web Crawler

Spidy (/spˈɪdi/) is the simple, easy to use command line web crawler.
Given a list of web links, it uses the Python requests library to query the webpages.
Spidy then uses lxml to extract all links from the page and adds them to its list.
Pretty simple!

Created by rivermont (/rɪvɜːrmɒnt/) and FalconWarriorr (/fælcʌnraɪjɔːr/), and developed with help from these awesome people.
Looking for technical documentation? Check out DOCS.md
Looking to contribute to this project? Have a look at CONTRIBUTING.md, then check out the docs.

🎉 New Features!

PyPI

Install spidy with one line: pip3 install spidy-web-crawler!

Automatic Testing with Travis CI

Release v1.4.0 - #31663d3

spidy Web Crawler Release 1.4

Domain Limiting - #e229b01

Scrape only a single site instead of the whole internet. May use slightly less space on your disk.
See config/wsj.cfg for an example.

spidy Web Crawler
New Features!
Contents
How it Works
Features
Tutorial
- Python Installation
  - Windows and Mac
    - Anaconda
    - Python Base
  - Linux
- Crawler Installation
- Launching
- Running
  - Config
  - Start
  - Autosave
  - Force Quit
Contributors
License

How it Works

Spidy has two working lists, TODO and DONE.
TODO is the list of URLs it hasn't yet visited.
Done is the list of URLs it has already been to.
The crawler visits each page in TODO, scrapes the DOM of the page for links, and adds those back into TODO.
It can also save each page, because datahoarding 😜.

Features

We built a lot of the functionality in spidy by watching the console scroll by and going, "Hey, we should add that!"
Here are some features we figure are worth noting.

Error Handling: We have tried to recognize all of the errors spidy runs into and create custom error messages and logging for each. There is a set cap so that after accumulating too many errors the crawler will stop itself.
Cross-Platform compatability: spidy will work on all three major operatin systems, Windows, Mac OS/X, and Linux!
Frequent Timestamp Logging: Spidy logs almost every action it takes to both the console and one of two log files.
Browser Spoofing: Make requests using User Agents from 4 popular web browsers, use a custom spidy bot one, or create your own!
Portability: Move spidy's folder and its contents somewhere else and it will run right where it left off.
User-Friendly Logs: Both the console and log file messages are simple and easy to interpret, but packed with information.
Webpage saving: Spidy downloads each page that it runs into, regardless of file type. The crawler uses the HTTP Content-Type header returned with most files to determine the file type.
File Zipping: When autosaving, spidy can archive the contents of the saved/ directory to a .zip file, and then clear saved/.

Tutorial

The way that you will run spidy depends on the way you have Python installed.
Spidy can be run from the command line (on Mac systems), a Python IDE, or (on Windows systems) by launching the .bat file.

Python Installation

Windows and Mac

There are many different versions of Python, and hundreds of different installations for each them.
Spidy is developed for Python v3.5.2, but should run without errors in other versions of Python 3.

Anaconda

We recommend the Anaconda distribution.
It comes pre-packaged with lots of goodies, including lxml, which is required for spidy to run and not including in the standard Python package.

Python Base

You can also just install default Python, and install the external libraries separately.
This can be done with pip:

pip install -r requirements.txt

Linux

Python 3 should come preinstalled with most flavors of Linux, but if not, simply run

sudo apt update
sudo apt install python3 python3-lxml python3-requests

Then cd into the crawler's directory and run python3 crawler.py.

Crawler Installation

If you have git or GitHub Desktop installed, you can clone the repository from here. If not, download the latest source code or grab the latest release.

Launching

Use cd to navigate to the directory that spidy is located in, then run:

python crawler.py

Running

Spidy logs a lot of information to the command line throughout its life.
Once started, a bunch of [INIT] lines will print.
These announce where spidy is in its initialization process.

Config

On running, spidy asks for input regarding certain parameters it will run off of.
However, you can also use one of the configuration files, or even create your own.

To use spidy with a configuration file, input the name of the file when the crawler asks

The config files included with spidy are:

blank.txt: Template for creating your own configurations.
default.cfg: The default version.
heavy.cfg: Run spidy with all of its features enabled.
infinite.cfg: The default config, but it never stops itself.
light.cfg: Disable most features; only crawls pages for links.
rivermont.cfg: My personal favorite settings.
rivermont-infinite.cfg: My favorite, never-ending configuration.

Start

Sample start log.

Autosave

Sample log after hitting the autosave cap.

Force Quit

Sample log after performing a ^C (CONTROL + C) to force quit the crawler.

Contributors

Our logo was designed by Cutwell
3onyc - PEP8 Compliance
DeKaN - Getting PyPI packaging to work.
esouthren - Unit testing.
j-setiawan - Paths that work on all OS's.
michellemorales - Confirmed OS/X support.
quatroka - Fixed testing bugs.
stevelle - Respect robots.txt

License

We used the Gnu General Public License (see LICENSE) as it was the license that best suited our needs.
Honestly, if you link to this repo and credit rivermont and FalconWarriorr, and you aren't selling spidy in any way, then we would love for you to distribute it.
Thanks!

lenguyenthedat/spidy