├── db
│ ├── imgs.db
│ ├── imgs.dump.sql
│ └── schema.sql
├── img-original
│ └── {website-hostname}
│ └── {image-filename}
├── img-rotated
│ └── {website-hostname}
│ └── {image-filename}
├── log
│ └── {YYYY-MM-DD--HH-MM-SS}.log
├── src
│ ├── tests
│ │ └── test_urlutils.py
│ ├── fsutils.py
│ ├── main.py
│ └── urlutils.py
├── README.md
├── config.ini # script config
└── input.txt # a list of websites (full URLs) to fetch images from
- Both
config.ini
andinput.txt
are readable and their content is valid. - Network is available, but it's not given that all webpages are reachable.
- There's enough RAM to handle all the images.
- Local FS provides permissions and enough free space to store the files.
- A possibility of same images being accessed under different URLs can be neglected (all images are treated as unique as long as their URLs differ).
- Even though it's not specified in the assignment, the script will try to equally take images from all websites, aiming for robust output (idempotence depends on webpage content, not on network latency).
- Script is executed exactly once. Whatever happens, it aims to fetch/process a given number of images. There are scenarios were it'd make sense to re-execute the script (failures, interruptions, input/content changes), but then a cleanup is needed to avoid mixing different outputs.
Orchestrate image processing in a pool (concurrent queue), instead of batching until success or exhaustion. Apart from not having to wait for each batch to finish, it would also limit a number of outbound connections.
It could be implemented by managing
a tasks = set()
of asyncio.create_task()
and asyncio.wait(tasks, return_when=asyncio.FIRST_COMPLETED)
in a while
loop against a counter.
However, figuring out a sensible limit value would be a task of its own, when trying to optimise for throughput. For current setup, given a rather small number of images to fetch and a list of somewhat reliable websites, almost every script run ends up with producing only a single batch or parallel requests.
A typical run takes ~23s to process ~9MB of images from 5 websites, where first ~3s are spent on fetching and parsing websites content. Download speed is ~225Mb/s. Request timeout is configured to 2s.
Manage a pool of reuseable HTTP2 connections. Would also require some experimentation with single/multiple connections per origin, to measure what would be faster - to pipeline requests or run them concurrently.
If a re-executing scenario would be considered, it'd make sense to first check if some image were already processed or at least downloaded, before scheduling them for fetching and rotation.
- HTTP transport:
requests
- HTML parsing:
beautifulsoup4
- image processing:
pillow
- SQL toolkit:
sqlalchemy
- python versioning:
pyenv
- env/deps management:
poetry
- linter:
pycodestyle
- unit tests:
pytest
# python version and DB
$ brew install pyenv sqlite
$ pyenv install 3.10.2
# python deps
$ curl -sSL https://install.python-poetry.org | python3 -
$ poetry install
# init DB
$ sqlite3 ./db/imgs.db < ./db/schema.sql
# run
$ time poetry run python ./src/main.py
# check results
$ tree ./img-rotated/
$ du -s ./img-*
$ sqlite3 ./db/imgs.db 'SELECT COUNT(*) FROM imgs;'
$ sqlite3 ./db/imgs.db 'SELECT * FROM imgs;'
$ sqlite3 ./db/imgs.db .dump > ./db/imgs.dump.sql
# lint
$ poetry run pycodestyle --show-source ./src/
# unit tests
$ poetry run pytest ./src/
# cleanup
$ rm ./log/*.log
$ rm -rf ./img-*