docker-compose up
command will start a crawler
service
process_request()
- Modify request to check: google/bing cache, wayback machine, etc
- Rewrite some URLs to get static html versions
process_response()
process_spider_input()
process_spider_output()
process_item()
- Write html/pdf to disk, db
Item Pipeline
| [Spiders]
| |
| (Spider Middleware)(1*)
| |
| [Item Pipeline](3*) <-- [Engine] <--> (2*)(Downloader Middleware) <--> [Downloader]
| |
| [Scheduler]
- Figure out I/O story
- Local fs storage? database only? i/o queue, ala zeromq?
Restart/Resume Crawl
Track success/failure of URL fetch and html extraction
Try different fetch methods based on URL fetch statue (use login, headless browser, google cache, etc.)
Fetch js-enabled pages
Fetch PDF (links, documents)