Create a shared docker-based infrastructure for the deploying the OBO PURL server
kltm opened this issue ยท 30 comments
Currently, the OBO PURL server is held on a privately held AWS EC2 instance. We'd like to move this into commonly held infrastructure (AWS EC2, IAM, on GO master payer) based on docker images.
Tackling this issue would include:
- a docker image for the PURL server that could be used for development and production
- a well-tested automated system (ansible?) or v/tight SOP doc for deploying a production PURL server with correct DNS
- includes adding in the fixes from #63
- Setting the appropriate IAM users and groups to give all parties sufficient access to accomplish the tasks
- distribute ssh keys
I might also suggest:
- external uptime monitoring. Uptimerobot? Something else? CloudWatch?
- a failback server (@jamesaoverton 's old server?), in case we're nervous
Stages of development may include:
- getting a local docker instance of the PURL server working
- updating current travis to GHA using this image
- getting a docker-based instance working on AWS EC2
- getting a docker-based instance working on AWS EC2 under "dummy domains"
- aim production DNS at new system
- confirm automatic deployment elements are in place
Tagging @jamesaoverton @cmungall for feedback on the above before moving ahead.
@jamesaoverton We're thinking about starting a Dockerfile from scratch, but I seem to recall that you may have already started on something? Is there anything worth basing work from that already exists?
@kltm I just pushed this docker-experiments branch.
@jamesaoverton Great, thank you!
Talking to @cmungall , it sounds like this ticket is the common understanding of our recent conversation, so I'm going to move it into our "ready" hopper.
Tagging @abessiari for further discussion as we progress.
A bit more...
My top priority is to replace Travis, which is broken for this repo, with GitHub Actions, so we have CI for PRs again. The experiment branch was working toward that, but then I got clobbered by other deadlines.
About my docker-experiment
branch:
tools/site.yml
: the Ansible script we've been using, which should still workMakefile
: working on some improvements that are not relevant to this issueDockerfile
(new)- almost empty
- really just needs the latest Ubuntu LTS
install.sh
(new)- should do the same thing as the Ansible script, but more concise
- should also be idemopotent
- I guess I didn't set up
cron
here - maybe easier to use than Ansible in GitHub Actions
run.sh
(new):- build the Dockerfile then run whatever command you give inside Docker
- for local development
If you're going to use Ansible, then I'm not sure these new files are relevant.
@jamesaoverton @kltm
Thanks I will take a look. I remember Travis did work when I made my changes ...
@jamesaoverton Thank you for the information.
I'm assuming that @cmungall would be fine with expanding a little bit to include getting travis->gha working w/the docker images.
Going from that, I guess a final question would be: should there be automatic updates in a final production system, or should be left to a human?
The key human interaction is to merge the PR, and the rest should be automatic. The current production server uses cron
to check every 10 minutes that the master
branch has a new passing build on Travis, and if so then it updates. So I'm happy with anything similar: when master
is green it should be automatically be deployed to production.
@abessiari As discussed on Wednesday, here are some of the projects where we've started using GitHub Actions:
- https://github.com/geneontology/go-site/tree/master/.github/workflows
runs standard integrations tests and does a forced document update - https://github.com/geneontology/go-ontology/tree/master/.github/workflows
runs standard tests and is bound to PRs tomaster
- https://github.com/geneontology/neo/tree/master/.github/workflows
tests bound to PRs and pushes tomaster
Looking, we do not seems to have anything that uses a remote API, but the "on" declarations seem to be fairly powerful and likely more there that we have not dug into.
From @abessiari . Current work at: #765 (comment)
For next steps: @abessiari will try testing w/docker deployment on test instance in AWS w/test URL test-purl.obofoundry.io.
I just merged #765 to master. Hopefully that will make the testing easier. Sorry for the delay.
Catching up with @abessiari , we're now fairly close to the end and will want to work out how to flip to a new production site, as well as sharing credentials, responsibilities. Ideally, multiple people and fix/redeploy this service in case of issues. We may want to work out monitoring as well (see above).
To round off some discussion from yesterday about log compression and upload to S3, I would note that we've had some timeout issues for some of our larger logs, even when compressed. I think you'll likely have an easier time (smaller logs and being in AWS already), but it might be worth keeping an eye on.
Will do Thanks.
@abessiari @jamesaoverton I was trying to do a little testing of the docker image, etc., so I wanted to but together a little set of test cases just to confirm function. In doing so, I found some things that went against my tuition and wanted to figure out what's going on.
Going through the docker README (https://github.com/OBOFoundry/purl.obolibrary.org/tree/master/docker and started with the command docker run --name my_purl -v /tmp/foo.txt:/opt/credentials/s3cfg -p 8080:80 -it purl:latest /bin/bash
, noting the dummy/empty credentials file as I do not want logrotate working while testing), I tried out some URLs:
localhost, local docker:
sjcarbon@moiraine:~/local/src/git/purl.obolibrary.org[master]$:) http http://localhost:8080/obo/CHEBI_15377
HTTP/1.1 302 Found
Connection: Keep-Alive
Content-Length: 297
Content-Type: text/html; charset=iso-8859-1
Date: Sat, 31 Jul 2021 00:09:06 GMT
Keep-Alive: timeout=5, max=100
Location: http://purl.oclc.org/obo/CHEBI_15377
Server: Apache/2.4.41 (Ubuntu)
<!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML 2.0//EN">
<html><head>
<title>302 Found</title>
</head><body>
<h1>Found</h1>
<p>The document has moved <a href="http://purl.oclc.org/obo/CHEBI_15377">here</a>.</p>
<hr>
<address>Apache/2.4.41 (Ubuntu) Server at localhost Port 8080</address>
</body></html>
public purl server:
sjcarbon@moiraine:~/local/src/git/purl.obolibrary.org[master]$:) http http://purl.obolibrary.org/obo/CHEBI_15377
HTTP/1.1 303 See Other
Connection: Keep-Alive
Content-Length: 350
Content-Type: text/html; charset=iso-8859-1
Date: Sat, 31 Jul 2021 00:10:09 GMT
Keep-Alive: timeout=5, max=100
Location: http://www.ebi.ac.uk/chebi/searchId.do?chebiId=CHEBI:15377
Server: Apache/2.4.18 (Ubuntu)
<!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML 2.0//EN">
<html><head>
<title>303 See Other</title>
</head><body>
<h1>See Other</h1>
<p>The answer to your request is located <a href="http://www.ebi.ac.uk/chebi/searchId.do?chebiId=CHEBI:15377">here</a>.</p>
<hr>
<address>Apache/2.4.18 (Ubuntu) Server at purl.obolibrary.org Port 80</address>
</body></html>
I'd note that the 302 resolution of http://purl.oclc.org/obo/CHEBI_15377 from the local docker does not seem to go anywhere useful. Is this a case of a bad update that wasn't propagated to the production server but visible in testing, or something else?
It would be good to collect a set of URLs for testing. Any favorite terms or ontologies @jamesaoverton @cmungall ?
http://purl.oclc.org/ is the global failover. It's what we were using until 2015 when we deployed our own PURL service. So something is wrong.
My guess is that make
has not yet been run to actually build the Apache config.
yep, the test server should give exactly the same responses as the current live one, in this can the 303 to the EBI site is the intended behavior
Thanks for testing this. I will take closer look.
Okay, just poking a little bit more to orient myself and to spell out a little more whats in the README, a good way to test:
docker rm purl && docker run --name purl -v /tmp/foo.txt:/opt/credentials/s3cfg -p 8080:80 -it purl:latest /bin/bash
sudo su
cd /var/www/purl.obolibrary.org/
make test
cat tests/development/go.tsv
For "external confirmation" while the server is running locally:
{ http -h http://purl.obolibrary.org/obo/go.owl & http -h http://localhost:8080/obo/go.owl; } | grep Location && \
{ http -h http://purl.obolibrary.org/obo/cob.owl & http -h http://localhost:8080/obo/cob.owl; } | grep Location && \
{ http -h http://purl.obolibrary.org/obo/GO_0022008 & http -h http://localhost:8080/obo/GO_0022008; } | grep Location && \
{ http -h http://purl.obolibrary.org/obo/CHEBI_15377 & http -h http://localhost:8080/obo/CHEBI_15377; } | grep Location
currently giving:
Location: http://purl.oclc.org/obo/go.owl
Location: http://current.geneontology.org/ontology/go.owl
Location: http://purl.oclc.org/obo/cob.owl
Location: https://raw.githubusercontent.com/OBOFoundry/COB/master/cob.owl
Location: http://purl.oclc.org/obo/GO_0022008
Location: http://www.ontobee.org/browser/rdf.php?o=GO&iri=http://purl.obolibrary.org/obo/GO_0022008
Location: http://purl.oclc.org/obo/CHEBI_15377
Location: http://www.ebi.ac.uk/chebi/searchId.do?chebiId=CHEBI:15377
My best guess is still that the make all
task needs to be run before make test
. I haven't had time to replicate, sorry.
@jamesaoverton @abessiari
Okay, that does indeed seem to be the issue. So:
docker rm purl && docker run --name purl -v /tmp/foo.txt:/opt/credentials/s3cfg -p 8080:80 -it purl:latest /bin/bash
sudo su
cd /var/www/purl.obolibrary.org/
make all
make test
cat tests/development/go.tsv
With this, everything seems to work from the inside and outside.
Planning on continuing once we've wrapped #771 to make testing safe and easy all within Route 53.
@jamesaoverton, #771 is now wrapped up: we have full domain control, TTL at 300s, and @abessiari has put up a testing server (that you should now have credentials for). Given that the unit/self tests work, is there a protocol that you'd like to follow for the switchover, or should we just go ahead and try it out?
I checked the server and things look good to me. All my specific concerns have been addressed, and now I just have vague worries ๐. My suggestion would be that we schedule a time when the three of us are working (maybe tomorrow afternoon?) and do the migration. We can coordinate on Slack or Signal.
@jamesaoverton I'll leave exact scheduling to you and @abessiari--I'm fairly flexible.
As we can now switch pretty easily, I think it would look something like the following:
- Drop record TTL and give it a chance to propagate (done--already at five minutes)
- At a coordinated time, switch; previous machine is left up
- Test; if okay, leave both machines up; if not okay, revert
- At our leisure, go through documentation, editing where needed, and bring up another instance according to the word of the documentation
- Test
- Check that all people involved have credentials
- Bring down all machines but the current target of purl.obolibrary.org, safe in the knowledge that we have an SOP for anybody to bring another up and switch
- Sleep well
@abessiari @jamesaoverton Excellent!
I've gone through a bit of documentation and done a little testing and things seem good so far. It would be good to get somebody else's feedback and testing in here to close this issue out.