OBOFoundry/purl.obolibrary.org

Create a shared docker-based infrastructure for the deploying the OBO PURL server

kltm opened this issue ยท 30 comments

kltm commented

Currently, the OBO PURL server is held on a privately held AWS EC2 instance. We'd like to move this into commonly held infrastructure (AWS EC2, IAM, on GO master payer) based on docker images.

Tackling this issue would include:

  • a docker image for the PURL server that could be used for development and production
  • a well-tested automated system (ansible?) or v/tight SOP doc for deploying a production PURL server with correct DNS
  • includes adding in the fixes from #63
  • Setting the appropriate IAM users and groups to give all parties sufficient access to accomplish the tasks
  • distribute ssh keys

I might also suggest:

  • external uptime monitoring. Uptimerobot? Something else? CloudWatch?
  • a failback server (@jamesaoverton 's old server?), in case we're nervous

Stages of development may include:

  • getting a local docker instance of the PURL server working
  • updating current travis to GHA using this image
  • getting a docker-based instance working on AWS EC2
  • getting a docker-based instance working on AWS EC2 under "dummy domains"
  • aim production DNS at new system
  • confirm automatic deployment elements are in place

Tagging @jamesaoverton @cmungall for feedback on the above before moving ahead.

kltm commented

Noting work for logs/analysis fat #63 on #747

kltm commented

@jamesaoverton We're thinking about starting a Dockerfile from scratch, but I seem to recall that you may have already started on something? Is there anything worth basing work from that already exists?

@kltm I just pushed this docker-experiments branch.

kltm commented

@jamesaoverton Great, thank you!
Talking to @cmungall , it sounds like this ticket is the common understanding of our recent conversation, so I'm going to move it into our "ready" hopper.
Tagging @abessiari for further discussion as we progress.

A bit more...

My top priority is to replace Travis, which is broken for this repo, with GitHub Actions, so we have CI for PRs again. The experiment branch was working toward that, but then I got clobbered by other deadlines.

About my docker-experiment branch:

  • tools/site.yml: the Ansible script we've been using, which should still work
  • Makefile: working on some improvements that are not relevant to this issue
  • Dockerfile (new)
    • almost empty
    • really just needs the latest Ubuntu LTS
  • install.sh (new)
    • should do the same thing as the Ansible script, but more concise
    • should also be idemopotent
    • I guess I didn't set up cron here
    • maybe easier to use than Ansible in GitHub Actions
  • run.sh (new):
    • build the Dockerfile then run whatever command you give inside Docker
    • for local development

If you're going to use Ansible, then I'm not sure these new files are relevant.

@jamesaoverton @kltm
Thanks I will take a look. I remember Travis did work when I made my changes ...

kltm commented

@jamesaoverton Thank you for the information.
I'm assuming that @cmungall would be fine with expanding a little bit to include getting travis->gha working w/the docker images.
Going from that, I guess a final question would be: should there be automatic updates in a final production system, or should be left to a human?

The key human interaction is to merge the PR, and the rest should be automatic. The current production server uses cron to check every 10 minutes that the master branch has a new passing build on Travis, and if so then it updates. So I'm happy with anything similar: when master is green it should be automatically be deployed to production.

kltm commented

@abessiari As discussed on Wednesday, here are some of the projects where we've started using GitHub Actions:

Looking, we do not seems to have anything that uses a remote API, but the "on" declarations seem to be fairly powerful and likely more there that we have not dug into.

kltm commented

From @abessiari . Current work at: #765 (comment)

kltm commented

For next steps: @abessiari will try testing w/docker deployment on test instance in AWS w/test URL test-purl.obofoundry.io.

I just merged #765 to master. Hopefully that will make the testing easier. Sorry for the delay.

kltm commented

Catching up with @abessiari , we're now fairly close to the end and will want to work out how to flip to a new production site, as well as sharing credentials, responsibilities. Ideally, multiple people and fix/redeploy this service in case of issues. We may want to work out monitoring as well (see above).

kltm commented

To round off some discussion from yesterday about log compression and upload to S3, I would note that we've had some timeout issues for some of our larger logs, even when compressed. I think you'll likely have an easier time (smaller logs and being in AWS already), but it might be worth keeping an eye on.

Will do Thanks.

kltm commented

@abessiari @jamesaoverton I was trying to do a little testing of the docker image, etc., so I wanted to but together a little set of test cases just to confirm function. In doing so, I found some things that went against my tuition and wanted to figure out what's going on.

Going through the docker README (https://github.com/OBOFoundry/purl.obolibrary.org/tree/master/docker and started with the command docker run --name my_purl -v /tmp/foo.txt:/opt/credentials/s3cfg -p 8080:80 -it purl:latest /bin/bash, noting the dummy/empty credentials file as I do not want logrotate working while testing), I tried out some URLs:

localhost, local docker:

sjcarbon@moiraine:~/local/src/git/purl.obolibrary.org[master]$:) http http://localhost:8080/obo/CHEBI_15377
HTTP/1.1 302 Found
Connection: Keep-Alive
Content-Length: 297
Content-Type: text/html; charset=iso-8859-1
Date: Sat, 31 Jul 2021 00:09:06 GMT
Keep-Alive: timeout=5, max=100
Location: http://purl.oclc.org/obo/CHEBI_15377
Server: Apache/2.4.41 (Ubuntu)

<!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML 2.0//EN">
<html><head>
<title>302 Found</title>
</head><body>
<h1>Found</h1>
<p>The document has moved <a href="http://purl.oclc.org/obo/CHEBI_15377">here</a>.</p>
<hr>
<address>Apache/2.4.41 (Ubuntu) Server at localhost Port 8080</address>
</body></html>

public purl server:

sjcarbon@moiraine:~/local/src/git/purl.obolibrary.org[master]$:) http http://purl.obolibrary.org/obo/CHEBI_15377
HTTP/1.1 303 See Other
Connection: Keep-Alive
Content-Length: 350
Content-Type: text/html; charset=iso-8859-1
Date: Sat, 31 Jul 2021 00:10:09 GMT
Keep-Alive: timeout=5, max=100
Location: http://www.ebi.ac.uk/chebi/searchId.do?chebiId=CHEBI:15377
Server: Apache/2.4.18 (Ubuntu)

<!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML 2.0//EN">
<html><head>
<title>303 See Other</title>
</head><body>
<h1>See Other</h1>
<p>The answer to your request is located <a href="http://www.ebi.ac.uk/chebi/searchId.do?chebiId=CHEBI:15377">here</a>.</p>
<hr>
<address>Apache/2.4.18 (Ubuntu) Server at purl.obolibrary.org Port 80</address>
</body></html>

I'd note that the 302 resolution of http://purl.oclc.org/obo/CHEBI_15377 from the local docker does not seem to go anywhere useful. Is this a case of a bad update that wasn't propagated to the production server but visible in testing, or something else?

It would be good to collect a set of URLs for testing. Any favorite terms or ontologies @jamesaoverton @cmungall ?

http://purl.oclc.org/ is the global failover. It's what we were using until 2015 when we deployed our own PURL service. So something is wrong.

My guess is that make has not yet been run to actually build the Apache config.

yep, the test server should give exactly the same responses as the current live one, in this can the 303 to the EBI site is the intended behavior

Thanks for testing this. I will take closer look.

kltm commented

Okay, just poking a little bit more to orient myself and to spell out a little more whats in the README, a good way to test:

docker rm purl && docker run --name purl -v /tmp/foo.txt:/opt/credentials/s3cfg -p 8080:80 -it purl:latest /bin/bash
sudo su
cd /var/www/purl.obolibrary.org/
make test
cat tests/development/go.tsv

For "external confirmation" while the server is running locally:

{ http -h http://purl.obolibrary.org/obo/go.owl & http -h http://localhost:8080/obo/go.owl; } | grep Location && \
{ http -h http://purl.obolibrary.org/obo/cob.owl & http -h http://localhost:8080/obo/cob.owl; } | grep Location && \
{ http -h http://purl.obolibrary.org/obo/GO_0022008 & http -h http://localhost:8080/obo/GO_0022008; } | grep Location && \
{ http -h http://purl.obolibrary.org/obo/CHEBI_15377 & http -h http://localhost:8080/obo/CHEBI_15377; } | grep Location

currently giving:

Location: http://purl.oclc.org/obo/go.owl
Location: http://current.geneontology.org/ontology/go.owl
Location: http://purl.oclc.org/obo/cob.owl
Location: https://raw.githubusercontent.com/OBOFoundry/COB/master/cob.owl
Location: http://purl.oclc.org/obo/GO_0022008
Location: http://www.ontobee.org/browser/rdf.php?o=GO&iri=http://purl.obolibrary.org/obo/GO_0022008
Location: http://purl.oclc.org/obo/CHEBI_15377
Location: http://www.ebi.ac.uk/chebi/searchId.do?chebiId=CHEBI:15377

My best guess is still that the make all task needs to be run before make test. I haven't had time to replicate, sorry.

kltm commented

@jamesaoverton @abessiari
Okay, that does indeed seem to be the issue. So:

docker rm purl && docker run --name purl -v /tmp/foo.txt:/opt/credentials/s3cfg -p 8080:80 -it purl:latest /bin/bash
sudo su
cd /var/www/purl.obolibrary.org/
make all
make test
cat tests/development/go.tsv

With this, everything seems to work from the inside and outside.

@kltm @jamesaoverton

Yes indeed make all fixes the problem.
Please see PR #773

kltm commented

Planning on continuing once we've wrapped #771 to make testing safe and easy all within Route 53.

kltm commented

@jamesaoverton, #771 is now wrapped up: we have full domain control, TTL at 300s, and @abessiari has put up a testing server (that you should now have credentials for). Given that the unit/self tests work, is there a protocol that you'd like to follow for the switchover, or should we just go ahead and try it out?

I checked the server and things look good to me. All my specific concerns have been addressed, and now I just have vague worries ๐Ÿ˜„. My suggestion would be that we schedule a time when the three of us are working (maybe tomorrow afternoon?) and do the migration. We can coordinate on Slack or Signal.

kltm commented

@jamesaoverton I'll leave exact scheduling to you and @abessiari--I'm fairly flexible.
As we can now switch pretty easily, I think it would look something like the following:

  • Drop record TTL and give it a chance to propagate (done--already at five minutes)
  • At a coordinated time, switch; previous machine is left up
  • Test; if okay, leave both machines up; if not okay, revert
  • At our leisure, go through documentation, editing where needed, and bring up another instance according to the word of the documentation
  • Test
  • Check that all people involved have credentials
  • Bring down all machines but the current target of purl.obolibrary.org, safe in the knowledge that we have an SOP for anybody to bring another up and switch
  • Sleep well

@kltm
Switch was done, So far so good.

kltm commented

I've gone through a bit of documentation and done a little testing and things seem good so far. It would be good to get somebody else's feedback and testing in here to close this issue out.