/crawler

The crawling pieces - ws, cli, coordinator

Primary LanguageJavaApache License 2.0Apache-2.0

GBIF Crawler

This project is responsible for coordinating dataset crawling. The coordinator, crawler, and CLI modules work together to do the actual crawling.

The webservice and webservice client present crawling status as recorded in Zookeeper.

The Crawler project includes:

  1. crawler: Contains the actual crawlers that speak the various XML and DWC-A/ABCD-A/CamtrapDP dialects needed for crawling the GBIF network
  2. crawler-cleanup: Used to delete crawl jobs in Zookeeper (see sub-module README for details how to use)
  3. crawler-cli: Provides the services that listen to RabbitMQ for instructions to crawl resources
  4. crawler-coordinator: Coordinates crawling jobs via Zookeeper (Curator)
  5. crawler-ws: Exposes read only crawl status and access to logs.
  6. crawler-ws-client: Java client to the WS

Building

See the individual sub-module READMEs for specific details, but in general it is enough to build all components with:

mvn clean package

Sequence

Darwin Core Archive

  • Downloader
    • Validator
      • Metasync
        • Pipelines (all archives)
        • Normalizer (Checklist)

More information in crawler-cli README.