pip install arche
Arche (pronounced Arkey) helps to verify scraped data using set of defined rules, for example:
- Validation with JSON schema
- Coverage (items, fields, categorical data, including booleans and enums)
- Duplicates
- Garbage symbols
- Comparison of two jobs
We use it in Scrapinghub, among the other tools, to ensure quality of scraped data
Arche requires Jupyter environment, supporting both JupyterLab and Notebook UI
For JupyterLab, you will need to properly install plotly extensions
Then just pip install arche
To check the quality of scraped data continuously. For example, if you scraped a website, a typical approach would be to validate the data with Arche. You can also create a schema and then set up Spidermon
pipenv install --dev
pipenv shell
tox
Any contributions are welcome! See https://github.com/scrapinghub/arche/issues if you want to take on something or suggest an improvement/report a bug.