prcrawler
Company Press Release Web Crawler
Python3 Scrapy web crawler for collecting company press releases
Getting Started
Place CSV files listing firms to crawl in the ./data/
directory.
File Template:
Required columns:
industry
[string] The firm's industryfirm
[string] The firm namepdf
[0,1] Flag indicating if press releases are formatted as PDFs, not HTML text (1=yes, 0=no)start_url
[string] An example press release URL for the firm to start crawling links on that firm's domain.
Column order is optional. Other columns (e.g., notes for reference) may be included in the data file but will not be processed.
Collecting Press Releases
Run a batch of simultaneous asynchronous prcrawler
s by executing the run_industry_crawler.py
script from the command line.
Process all files in the ./data/
directory:
$ python run_industry_crawler.py
Process specific files in ./data/
with the optional files argument -f
(or --files
):
$ python run_industry_crawler.py -f datafile1.csv datafile2.csv datafile3.csv
Logs and Debugging
Log files for each run are written in ./logs/
.