The logs are analysed to monitor any scraping anomalies.
extractor.py
and pensieve_interface.py
extract collection data from Pensieve database via the endpoint
GET/by_collect_date with range of dates.
exhaustive_transformer.py
This transformation was used for the tests. It basically reproduces Pensieve's database by normalizing the infos field. It also calculates rates for errors.
first_indicator_transformer.py
For each territory, detects collects with issues bases on these criterias :
updated_at
interval is higher than 12 daysitem_scraped_count
changes from a count higher than 10 to nothing
The variables that would probably be the most useful for detecting problems are:
- collect_id
- territory_uid
- website
- updated_at
- request_count (number of paths available on the website)
- response_count (number of paths visited by the spider)
- response_status_count/4XX or 5XX (number of responses with particular error code)
- item_scraped_count (number of items scraped)
- sqs_pushed_count (number of items pushed further in the pipeline)
To prepare the analysis we need to normalize the json containing the scraping statistics and merge it with the columns containing the metadata :
- Normalize : infos (json) to dataframe
- Relabel : relabel columns names
- Clean : filter if multiple collects happening in a single day for a website
Analyse some statistics on each collect and their evolution for one website :
- errors_count : number of total errors / errors 4XX / 5XX
- errors_rate : rate errors / number of responses
- errors_progression : evolution of errors on a website
- more indicators to determine
Store on bucket S3 json containing statistics per day of collect.
Google Sheet with a detailed sheet and a synthetised sheet.
Google Sheet Output Test
Google Sheet Output
Date de l'alerte | Id de l'alerte | Nom département | Code | Nom | Url dans Pensieve | Collect status |
---|---|---|---|---|---|---|
18-03-2021 | S1-0001 | Normandie | FRCOMM76600 | Le Havre | http:// | finished |
18-03-2021 | S1-0001 | Normandie | FRCOMM76290 | Montivilliers | http:// | closespider_timeout |
18-03-2021 | S1-0001 | Normandie | FRCOMM76000 | Rouen | http:// | finished |
In ws-pensieve (données test en prod)
export DATABASE_HOST=rdb.pensieve.explain.fr
In monitoring_collect_docs_admins (local)
unset PENSIEVE_URL