Scraping Monitoring

The logs are analysed to monitor any scraping anomalies.

Architecture

schema-monitoring.png

Description

Input

extractor.py and pensieve_interface.py extract collection data from Pensieve database via the endpoint GET/by_collect_date with range of dates.

Transformations

exhaustive_transformer.py
This transformation was used for the tests. It basically reproduces Pensieve's database by normalizing the infos field. It also calculates rates for errors.

first_indicator_transformer.py
For each territory, detects collects with issues bases on these criterias :

  • updated_at interval is higher than 12 days
  • item_scraped_count changes from a count higher than 10 to nothing

Analysis

The variables that would probably be the most useful for detecting problems are:

  • collect_id
  • territory_uid
  • website
  • updated_at
  • request_count (number of paths available on the website)
  • response_count (number of paths visited by the spider)
  • response_status_count/4XX or 5XX (number of responses with particular error code)
  • item_scraped_count (number of items scraped)
  • sqs_pushed_count (number of items pushed further in the pipeline)

To prepare the analysis we need to normalize the json containing the scraping statistics and merge it with the columns containing the metadata :

  • Normalize : infos (json) to dataframe
  • Relabel : relabel columns names
  • Clean : filter if multiple collects happening in a single day for a website

Analyse some statistics on each collect and their evolution for one website :

  • errors_count : number of total errors / errors 4XX / 5XX
  • errors_rate : rate errors / number of responses
  • errors_progression : evolution of errors on a website
  • more indicators to determine

Storage

Store on bucket S3 json containing statistics per day of collect.

Visualisation

Google Sheet with a detailed sheet and a synthetised sheet.

Output

Google Sheet Output Test
Google Sheet Output

Date de l'alerte Id de l'alerte Nom département Code Nom Url dans Pensieve Collect status
18-03-2021 S1-0001 Normandie FRCOMM76600 Le Havre http:// finished
18-03-2021 S1-0001 Normandie FRCOMM76290 Montivilliers http:// closespider_timeout
18-03-2021 S1-0001 Normandie FRCOMM76000 Rouen http:// finished

Test the tool

In ws-pensieve (données test en prod) export DATABASE_HOST=rdb.pensieve.explain.fr In monitoring_collect_docs_admins (local) unset PENSIEVE_URL