senderle/bookworm-compose

Write script for build command

Closed this issue · 5 comments

A topic for a phone call is whether to split the build and api images. A medium ground between my original approach and yours here would be to use the same dockerfile for both instances, but different commands, variables, and mounted dirs. My reasoning here is:

  1. The build needs the mysql root password, the API does not. Safest to make sure it just never sees the thing.
  2. /corpus may contain copyrighted materials that might not get deleted and will be online during build.
    Same reasoning as above, just make sure that there's no possible
    way even some unforeseen bug in the API code could expose those.

Both of these reasons are pretty weak--I can't actually imagine what the security vulnerability would be, and some of the existing non-docker instances aren't this careful.

But also

  1. resource usage differs significantly. But docker's resource allocation remains a little magical to me.
  bookworm-build:
    build:
      context: .
      dockerfile: ./compose/bookworm/Dockerfile
    environment:
      # Override this in docker-compose-override.yml
      - MYSQL_ROOT_PASSWORD='insecure_dev_root_password'
    depends_on:
      - mariadb
    volumes:
      - ./corpus:/corpus
    command:
      - cd /corpus & bookworm build
  bookworm-api:
    build:
      context: .
      dockerfile: ./compose/bookworm/Dockerfile
    depends_on:
      - mariadb
    command:
      - bookworm serve

My only worry about this is that generally docker-compose up/start starts all containers every time — so we'll be running the import container every time we start up the stack. We might be able to add some logic so that the container looks to see if it has run before, and then does nothing if it has? Seems a little awkward but certainly doable. There might be other approaches I'm missing. But generally I find that docker compose works more smoothly when every service is actually a service, and one-off jobs are run using docker-compose run.

I see what you mean about the password, but in general, the worry about passwords is having them stored in the image. Having them in memory in the container (as happens with environment variables provided by compose) is not a big issue, because as I understand it, if an attacker can read memory in the container, they can already get root access on the host machine and do whatever they want; the battle is already lost.

In terms of resource usage, this entry ensures that no resources related to the corpus itself are used:

volumes:
  - ./corpus:/corpus

This just maps the local corpus folder and its contents to the folder /corpus in the container. Nothing is copied, nothing is stored in the container. If you delete the files in the corpus folder, they are gone.

OK, sure, I hadn't really looked into docker-compose run. It looks as though it might even do basically what I'm suggesting of starting a separate container from the image and deleting it afterwards if you do docker-compose run --rm bookworm cd corpus & bookworm init & bookworm build all.

Yeah that's essentially exactly what it does, runs the command in a disposable container. If we need to cd into the corpus dir, then it's probably easiest to write a simple bash script for this, unless you want to modify bookworm init to take a corpus argument? Either way could make sense.

Looking at the code, I see bookworm init now barely does anything (after revisions to the codebase.) I think I'll probably deprecate and remove it from the codebase. If you include a file at bookworm.cnf in corpus that defines the database name, it's not necessary to run it at all.

I'll think about how best to handle the cd. Probably a shell script at first.

OK, this is scripted in a sub-optimal way as bin/run bookworm_build