Move post card images to S3
simonw opened this issue · 16 comments
I can increase the resolution of the images too when I do this, since they won't need to be small enough to not take up too much space any more.
I can use the til.simonwillison.net
S3 bucket for this.
Images are currently generated by shot-scraper
run from this Python script:
Lines 15 to 36 in e2e4819
Huh... those are PNGs. I bet they'd be a lot smaller if they were JPEGs, and even retina JPEGs might be smaller while still displaying well.
Ran this locally:
datasette . --get /sqlite/multiple-indexes > generate.html
Then:
shot-scraper shot generate.html -w 800 -h 400 --retina
Got this 216KB image:
Tried a JPEG too - quality 80 was almost as big, but this got a smaller image (159KB):
shot-scraper shot generate.html -w 800 -h 400 --retina --quality 60
Biggest question to decide is how to tell if an image has been created in S3 or not.
I'm tempted to do it based on the filename: use the shot hash as that name, do a quick list-files operation to see what files exist already, create the ones that don't.
That should run in GitHub Actions and generate JPEGs for every post and upload them to S3.
https://github.com/simonw/til/actions/runs/4842339363/jobs/8629221973
It's working...
% s3-credentials list-bucket til.simonwillison.net
[
{
"Key": "0cf1e455f161435a4aea07480c27da89.jpg",
"LastModified": "2023-04-30 03:54:06+00:00",
"ETag": "\"c1ef69673fda4ebf1cd1cfa41d8dc255\"",
"Size": 90039,
"StorageClass": "STANDARD"
},
{
"Key": "1447c8cdd4caa68e5514a1bb5b9f9f49.jpg",
"LastModified": "2023-04-30 03:54:12+00:00",
"ETag": "\"4adfdd03def8e54c651451f5b56e43b9\"",
"Size": 111841,
"StorageClass": "STANDARD"
},
{
"Key": "14e4b902d5511a639a6c8d1e91d3dabb.jpg",
"LastModified": "2023-04-30 03:54:35+00:00",
"ETag": "\"2d3e29f3eaca62ba688c04a82d923fba\"",
"Size": 118002,
"StorageClass": "STANDARD"
},
Generated image example: http://s3.amazonaws.com/til.simonwillison.net/f19a4a99ca28b20786ed7e35d8f9a8e7.jpg
To see how many are done:
% s3-credentials list-bucket til.simonwillison.net | jq length
43
410 total.
Partial logs from that GitHub Actions run:
Stored 96126 byte JPEG for github-actions_grep-tests.md shot hash 3e71efb58ec2d72ce37d6c93d7ace74e
Stored 70990 byte JPEG for github-actions_commit-if-file-changed.md shot hash 3b4a2012993962434fc8f5853cf5396b
Stored 72935 byte JPEG for bash_loop-over-csv.md shot hash d06963c31326ae773a8e7face614668c
It finished. All 410 images should be there now.
This query shows all the images on one page:
select
json_object(
'img_src',
'https://s3.amazonaws.com/til.simonwillison.net/' || shot_hash || '.jpg',
'width',
400
) as img
from
til
https://til.simonwillison.net/tils
I scrolled through and they all look good. This one was a favourite: https://s3.amazonaws.com/til.simonwillison.net/990ce33b65e40356be0035f185b3484c.jpg
Last steps:
- Remove the
datasette-media
plugin and configuration - Delete the old cached images
- Update the template to reference the new ones (oh no! That's going to require regenerating them all since the template hash will change)
Oops broke it:
Traceback (most recent call last):
File "generate_screenshots.py", line 92, in <module>
generate_screenshots(root)
File "generate_screenshots.py", line 55, in generate_screenshots
shot_html_hash.update(filepath.read_text().encode("utf-8"))
File "/opt/hostedtoolcache/Python/3.8.16/x64/lib/python3.8/pathlib.py", line 1236, in read_text
with self.open(mode='r', encoding=encoding, errors=errors) as f:
File "/opt/hostedtoolcache/Python/3.8.16/x64/lib/python3.8/pathlib.py", line 1222, in open
return io.open(self, mode, buffering, encoding, errors, newline,
File "/opt/hostedtoolcache/Python/3.8.16/x64/lib/python3.8/pathlib.py", line 1078, in _opener
return self._accessor.open(self, flags, mode)
FileNotFoundError: [Errno 2] No such file or directory: '/home/runner/work/til/til/main/templates/row.html'
That's deployed now.
Wrote this up as a TIL: https://til.simonwillison.net/shot-scraper/social-media-cards