A simple web scraper in Python adhering to best practices.
- Clone this repository on your local machine
- Run
docker build --tag python-scraper app/
- Run
docker run --mount type=bind,source="$(pwd)/app",target=/app/ -env ENVIRONMENT="prod" --website http(s)://www.<anywebsite>.com/
The docker container should output two files in the app/
directory:
- "content_file.json" -> This a json list of all the external urls loaded in the page provided
- "word_count_file.json -> This is a json dictionary with the frequency of each word appearing in the privacy policy
You can change the env variable to "dev" for more thorough logs like so:
docker run python-scraper -env ENVIRONMENT="dev" --website http(s)://www.<anywebsite>.com/
##Running tests:
- Install
pytest
- Run
python3 -m pytest
and the tests should run :)