/crawlerflow

Web Crawlers orchestration Framework that lets you create datasets from multiple web sources.

Primary LanguagePythonMIT LicenseMIT

CrawlerFlow

Web Crawlers orchestration Framework that lets you create datasets from multiple web sources with yaml configurations.

NOTE: This project is under active development

Build Status codecov

Features | Install | Usage | Documentation | Support

Features

  1. Write spiders in the YAML configuration.

  2. Define multiple extractors per spider.

  3. Traverse between multiple websites.

  4. Use standard extractors to scrape data like Tables, Paragraphs, Meta data of the page.

  5. Define custom extractors to scrapy the data in the format you want in yaml config.

  6. Write Python Extractors for advanced extraction strategy

Install

pip install git+https://github.com/crawlerflow/crawlerflow#egg=crawlerflow
# This project is under constant development and might brake any previous implementation.

Usage

To run a single website spider, to extract information from one website only.

crawlerflow --path examples/ --type=web

Documentation

Refer examples in the examples/ folder or check doc/index.md for more details.

Support

Few features like IP rotation, headless browsing, data backups, scheduling and monitoring are available in our CrawlerFlow Cloud version.

For any futher queries or dedicated support, please feel free to contact us