Move post card images to S3

Following:

#73

I can increase the resolution of the images too when I do this, since they won't need to be small enough to not take up too much space any more.

I can use the til.simonwillison.net S3 bucket for this.

Images are currently generated by shot-scraper run from this Python script:

til/generate_screenshots.py

Lines 15 to 36 in e2e4819

    
           def png_for_path(path): 
        
               page_html = str(TMP_PATH / "generate-screenshots-page.html") 
        
               # Use datasette to generate HTML 
        
               proc = subprocess.run(["datasette", ".", "--get", path], capture_output=True) 
        
               open(page_html, "wb").write(proc.stdout) 
        
               # Now use shot-scraper to generate a PNG 
        
               proc2 = subprocess.run( 
        
                   [ 
        
                       "shot-scraper", 
        
                       "shot", 
        
                       page_html, 
        
                       "-w", 
        
                       "800", 
        
                       "-h", 
        
                       "400", 
        
                       "-o", 
        
                       "-", 
        
                   ], 
        
                   capture_output=True, 
        
               ) 
        
               png_bytes = proc2.stdout 
        
               return png_bytes

Huh... those are PNGs. I bet they'd be a lot smaller if they were JPEGs, and even retina JPEGs might be smaller while still displaying well.

Ran this locally:

datasette . --get /sqlite/multiple-indexes > generate.html

Then:

shot-scraper shot generate.html -w 800 -h 400 --retina

Got this 216KB image:

Tried a JPEG too - quality 80 was almost as big, but this got a smaller image (159KB):

shot-scraper shot generate.html -w 800 -h 400 --retina --quality 60

Biggest question to decide is how to tell if an image has been created in S3 or not.

I'm tempted to do it based on the filename: use the shot hash as that name, do a quick list-files operation to see what files exist already, create the ones that don't.

That should run in GitHub Actions and generate JPEGs for every post and upload them to S3.

https://github.com/simonw/til/actions/runs/4842339363/jobs/8629221973

It's working...

 % s3-credentials list-bucket til.simonwillison.net
[
  {
    "Key": "0cf1e455f161435a4aea07480c27da89.jpg",
    "LastModified": "2023-04-30 03:54:06+00:00",
    "ETag": "\"c1ef69673fda4ebf1cd1cfa41d8dc255\"",
    "Size": 90039,
    "StorageClass": "STANDARD"
  },
  {
    "Key": "1447c8cdd4caa68e5514a1bb5b9f9f49.jpg",
    "LastModified": "2023-04-30 03:54:12+00:00",
    "ETag": "\"4adfdd03def8e54c651451f5b56e43b9\"",
    "Size": 111841,
    "StorageClass": "STANDARD"
  },
  {
    "Key": "14e4b902d5511a639a6c8d1e91d3dabb.jpg",
    "LastModified": "2023-04-30 03:54:35+00:00",
    "ETag": "\"2d3e29f3eaca62ba688c04a82d923fba\"",
    "Size": 118002,
    "StorageClass": "STANDARD"
  },

Generated image example: http://s3.amazonaws.com/til.simonwillison.net/f19a4a99ca28b20786ed7e35d8f9a8e7.jpg

To see how many are done:

% s3-credentials list-bucket til.simonwillison.net | jq length
43

410 total.

Partial logs from that GitHub Actions run:

Stored 96126 byte JPEG for github-actions_grep-tests.md shot hash 3e71efb58ec2d72ce37d6c93d7ace74e
Stored 70990 byte JPEG for github-actions_commit-if-file-changed.md shot hash 3b4a2012993962434fc8f5853cf5396b
Stored 72935 byte JPEG for bash_loop-over-csv.md shot hash d06963c31326ae773a8e7face614668c

It finished. All 410 images should be there now.

This query shows all the images on one page:

select
  json_object(
    'img_src',
    'https://s3.amazonaws.com/til.simonwillison.net/' || shot_hash || '.jpg',
    'width',
    400
  ) as img
from
  til

https://til.simonwillison.net/tils

I scrolled through and they all look good. This one was a favourite: https://s3.amazonaws.com/til.simonwillison.net/990ce33b65e40356be0035f185b3484c.jpg

Last steps:

Remove the datasette-media plugin and configuration
Delete the old cached images
Update the template to reference the new ones (oh no! That's going to require regenerating them all since the template hash will change)

Oops broke it:

Traceback (most recent call last):
  File "generate_screenshots.py", line 92, in <module>
    generate_screenshots(root)
  File "generate_screenshots.py", line 55, in generate_screenshots
    shot_html_hash.update(filepath.read_text().encode("utf-8"))
  File "/opt/hostedtoolcache/Python/3.8.16/x64/lib/python3.8/pathlib.py", line 1236, in read_text
    with self.open(mode='r', encoding=encoding, errors=errors) as f:
  File "/opt/hostedtoolcache/Python/3.8.16/x64/lib/python3.8/pathlib.py", line 1222, in open
    return io.open(self, mode, buffering, encoding, errors, newline,
  File "/opt/hostedtoolcache/Python/3.8.16/x64/lib/python3.8/pathlib.py", line 1078, in _opener
    return self._accessor.open(self, flags, mode)
FileNotFoundError: [Errno 2] No such file or directory: '/home/runner/work/til/til/main/templates/row.html'

That's deployed now.

https://developers.facebook.com/tools/debug/?q=https%3A%2F%2Ftil.simonwillison.net%2Fllms%2Ftraining-nanogpt-on-my-blog shows this:

Wrote this up as a TIL: https://til.simonwillison.net/shot-scraper/social-media-cards

	def png_for_path(path):
	page_html = str(TMP_PATH / "generate-screenshots-page.html")
	# Use datasette to generate HTML
	proc = subprocess.run(["datasette", ".", "--get", path], capture_output=True)
	open(page_html, "wb").write(proc.stdout)
	# Now use shot-scraper to generate a PNG
	proc2 = subprocess.run(
	[
	"shot-scraper",
	"shot",
	page_html,
	"-w",
	"800",
	"-h",
	"400",
	"-o",
	"-",
	],
	capture_output=True,
	)
	png_bytes = proc2.stdout
	return png_bytes