Web Crawlers orchestration Framework that lets you create datasets from multiple web sources with yaml configurations.
NOTE: This project is under active development
Features | Install | Usage | Documentation | Support
-
Write spiders in the YAML configuration.
-
Define multiple extractors per spider.
-
Traverse between multiple websites.
-
Use standard extractors to scrape data like Tables, Paragraphs, Meta data of the page.
-
Define custom extractors to scrapy the data in the format you want in yaml config.
-
Write Python Extractors for advanced extraction strategy
pip install git+https://github.com/crawlerflow/crawlerflow#egg=crawlerflow
# This project is under constant development and might brake any previous implementation.
To run a single website spider, to extract information from one website only.
crawlerflow --path examples/ --type=web
Refer examples in the examples/
folder or check doc/index.md for more details.
Few features like IP rotation, headless browsing, data backups, scheduling and monitoring are available in our CrawlerFlow Cloud version.
For any futher queries or dedicated support, please feel free to contact us