scrape

a command-line web scraping tool

scrape is a rule-based web crawler and information extraction tool capable of manipulating and merging new and existing documents. XML Path Language (XPath) and regular expressions are used to define rules for filtering content and web traversal. Output may be converted into text, csv, pdf, and/or HTML formats.

Installation

pip install scrape

pip install git+https://github.com/huntrar/scrape.git#egg=scrape

git clone https://github.com/huntrar/scrape
cd scrape
python setup.py install

You must install wkhtmltopdf to save files to pdf.

Usage

usage: scrape.py [-h] [-a [ATTRIBUTES [ATTRIBUTES ...]]] [-all]
                 [-c [CRAWL [CRAWL ...]]] [-C] [--csv] [-cs [CACHE_SIZE]]
                 [-f [FILTER [FILTER ...]]] [--html] [-i] [-m]
                 [-max MAX_CRAWLS] [-n] [-ni] [-no] [-o [OUT [OUT ...]]] [-ow]
                 [-p] [-pt] [-q] [-s] [-t] [-v] [-x [XPATH]]
                 [QUERY [QUERY ...]]

a command-line web scraping tool

positional arguments:
  QUERY                 URLs/files to scrape

optional arguments:
  -h, --help            show this help message and exit
  -a [ATTRIBUTES [ATTRIBUTES ...]], --attributes [ATTRIBUTES [ATTRIBUTES ...]]
                        extract text using tag attributes
  -all, --crawl-all     crawl all pages
  -c [CRAWL [CRAWL ...]], --crawl [CRAWL [CRAWL ...]]
                        regexp rules for following new pages
  -C, --clear-cache     clear requests cache
  --csv                 write files as csv
  -cs [CACHE_SIZE], --cache-size [CACHE_SIZE]
                        size of page cache (default: 1000)
  -f [FILTER [FILTER ...]], --filter [FILTER [FILTER ...]]
                        regexp rules for filtering text
  --html                write files as HTML
  -i, --images          save page images
  -m, --multiple        save to multiple files
  -max MAX_CRAWLS, --max-crawls MAX_CRAWLS
                        max number of pages to crawl
  -n, --nonstrict       allow crawler to visit any domain
  -ni, --no-images      do not save page images
  -no, --no-overwrite   do not overwrite files if they exist
  -o [OUT [OUT ...]], --out [OUT [OUT ...]]
                        specify outfile names
  -ow, --overwrite      overwrite a file if it exists
  -p, --pdf             write files as pdf
  -pt, --print          print text output
  -q, --quiet           suppress program output
  -s, --single          save to a single file
  -t, --text            write files as text
  -v, --version         display current version
  -x [XPATH], --xpath [XPATH]
                        filter HTML using XPath

Author

Hunter Hammond (huntrar@gmail.com)

Notes