Create a shared docker-based infrastructure for the deploying the OBO PURL server

Question

Create a shared docker-based infrastructure for the deploying the OBO PURL server

kltm opened this issue 4 years ago · 30 comments

Answer 1 · 2021-04-28T18:39:34.000Z

Noting work for logs/analysis fat #63 on #747

Answer 2 · 2021-04-29T01:39:33.000Z

@jamesaoverton We're thinking about starting a Dockerfile from scratch, but I seem to recall that you may have already started on something? Is there anything worth basing work from that already exists?

Answer 3 · 2021-04-29T18:04:59.000Z

@kltm I just pushed this docker-experiments branch.

Answer 4 · 2021-04-29T18:28:44.000Z

@jamesaoverton Great, thank you!
Talking to @cmungall , it sounds like this ticket is the common understanding of our recent conversation, so I'm going to move it into our "ready" hopper.
Tagging @abessiari for further discussion as we progress.

Answer 5 · 2021-04-29T18:31:17.000Z

A bit more...

My top priority is to replace Travis, which is broken for this repo, with GitHub Actions, so we have CI for PRs again. The experiment branch was working toward that, but then I got clobbered by other deadlines.

About my docker-experiment branch:

tools/site.yml: the Ansible script we've been using, which should still work
Makefile: working on some improvements that are not relevant to this issue
Dockerfile (new)
- almost empty
- really just needs the latest Ubuntu LTS
install.sh (new)
- should do the same thing as the Ansible script, but more concise
- should also be idemopotent
- I guess I didn't set up cron here
- maybe easier to use than Ansible in GitHub Actions
run.sh (new):
- build the Dockerfile then run whatever command you give inside Docker
- for local development

If you're going to use Ansible, then I'm not sure these new files are relevant.

Answer 6 · 2021-04-29T20:27:19.000Z

@jamesaoverton @kltm
Thanks I will take a look. I remember Travis did work when I made my changes ...

Answer 7 · 2021-04-29T20:27:54.000Z

@jamesaoverton Thank you for the information.
I'm assuming that @cmungall would be fine with expanding a little bit to include getting travis->gha working w/the docker images.
Going from that, I guess a final question would be: should there be automatic updates in a final production system, or should be left to a human?

Answer 8 · 2021-04-29T20:41:42.000Z

The key human interaction is to merge the PR, and the rest should be automatic. The current production server uses cron to check every 10 minutes that the master branch has a new passing build on Travis, and if so then it updates. So I'm happy with anything similar: when master is green it should be automatically be deployed to production.

Answer 9 · 2021-06-11T22:39:59.000Z

@abessiari As discussed on Wednesday, here are some of the projects where we've started using GitHub Actions:

https://github.com/geneontology/go-site/tree/master/.github/workflows
runs standard integrations tests and does a forced document update
https://github.com/geneontology/go-ontology/tree/master/.github/workflows
runs standard tests and is bound to PRs to master
https://github.com/geneontology/neo/tree/master/.github/workflows
tests bound to PRs and pushes to master

Looking, we do not seems to have anything that uses a remote API, but the "on" declarations seem to be fairly powerful and likely more there that we have not dug into.

Answer 10 · 2021-06-28T20:16:40.000Z

From @abessiari . Current work at: #765 (comment)

Answer 11 · 2021-06-28T20:55:40.000Z

For next steps: @abessiari will try testing w/docker deployment on test instance in AWS w/test URL test-purl.obofoundry.io.

Answer 12 · 2021-06-29T16:34:18.000Z

I just merged #765 to master. Hopefully that will make the testing easier. Sorry for the delay.

Answer 13 · 2021-07-08T22:54:32.000Z

Catching up with @abessiari , we're now fairly close to the end and will want to work out how to flip to a new production site, as well as sharing credentials, responsibilities. Ideally, multiple people and fix/redeploy this service in case of issues. We may want to work out monitoring as well (see above).

Answer 14 · 2021-07-09T17:51:34.000Z

To round off some discussion from yesterday about log compression and upload to S3, I would note that we've had some timeout issues for some of our larger logs, even when compressed. I think you'll likely have an easier time (smaller logs and being in AWS already), but it might be worth keeping an eye on.

Answer 15 · 2021-07-12T18:17:41.000Z

Will do Thanks.

Answer 16 · 2021-07-31T00:12:42.000Z

@abessiari @jamesaoverton I was trying to do a little testing of the docker image, etc., so I wanted to but together a little set of test cases just to confirm function. In doing so, I found some things that went against my tuition and wanted to figure out what's going on.

Going through the docker README (https://github.com/OBOFoundry/purl.obolibrary.org/tree/master/docker and started with the command docker run --name my_purl -v /tmp/foo.txt:/opt/credentials/s3cfg -p 8080:80 -it purl:latest /bin/bash, noting the dummy/empty credentials file as I do not want logrotate working while testing), I tried out some URLs:

localhost, local docker:

sjcarbon@moiraine:~/local/src/git/purl.obolibrary.org[master]$:) http http://localhost:8080/obo/CHEBI_15377
HTTP/1.1 302 Found
Connection: Keep-Alive
Content-Length: 297
Content-Type: text/html; charset=iso-8859-1
Date: Sat, 31 Jul 2021 00:09:06 GMT
Keep-Alive: timeout=5, max=100
Location: http://purl.oclc.org/obo/CHEBI_15377
Server: Apache/2.4.41 (Ubuntu)

<!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML 2.0//EN">
<html><head>
<title>302 Found</title>
</head><body>
<h1>Found</h1>
<p>The document has moved <a href="http://purl.oclc.org/obo/CHEBI_15377">here</a>.</p>
<hr>
<address>Apache/2.4.41 (Ubuntu) Server at localhost Port 8080</address>
</body></html>

public purl server:

sjcarbon@moiraine:~/local/src/git/purl.obolibrary.org[master]$:) http http://purl.obolibrary.org/obo/CHEBI_15377
HTTP/1.1 303 See Other
Connection: Keep-Alive
Content-Length: 350
Content-Type: text/html; charset=iso-8859-1
Date: Sat, 31 Jul 2021 00:10:09 GMT
Keep-Alive: timeout=5, max=100
Location: http://www.ebi.ac.uk/chebi/searchId.do?chebiId=CHEBI:15377
Server: Apache/2.4.18 (Ubuntu)

<!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML 2.0//EN">
<html><head>
<title>303 See Other</title>
</head><body>
<h1>See Other</h1>
<p>The answer to your request is located <a href="http://www.ebi.ac.uk/chebi/searchId.do?chebiId=CHEBI:15377">here</a>.</p>
<hr>
<address>Apache/2.4.18 (Ubuntu) Server at purl.obolibrary.org Port 80</address>
</body></html>

I'd note that the 302 resolution of http://purl.oclc.org/obo/CHEBI_15377 from the local docker does not seem to go anywhere useful. Is this a case of a bad update that wasn't propagated to the production server but visible in testing, or something else?

It would be good to collect a set of URLs for testing. Any favorite terms or ontologies @jamesaoverton @cmungall ?

Answer 17 · 2021-08-02T16:23:18.000Z

http://purl.oclc.org/ is the global failover. It's what we were using until 2015 when we deployed our own PURL service. So something is wrong.

My guess is that make has not yet been run to actually build the Apache config.

Answer 18 · 2021-08-02T16:27:18.000Z

yep, the test server should give exactly the same responses as the current live one, in this can the 303 to the EBI site is the intended behavior

Answer 19 · 2021-08-03T17:05:50.000Z

Thanks for testing this. I will take closer look.

Answer 20 · 2021-08-03T22:46:37.000Z

Okay, just poking a little bit more to orient myself and to spell out a little more whats in the README, a good way to test:

docker rm purl && docker run --name purl -v /tmp/foo.txt:/opt/credentials/s3cfg -p 8080:80 -it purl:latest /bin/bash
sudo su
cd /var/www/purl.obolibrary.org/
make test
cat tests/development/go.tsv

For "external confirmation" while the server is running locally:

{ http -h http://purl.obolibrary.org/obo/go.owl & http -h http://localhost:8080/obo/go.owl; } | grep Location && \
{ http -h http://purl.obolibrary.org/obo/cob.owl & http -h http://localhost:8080/obo/cob.owl; } | grep Location && \
{ http -h http://purl.obolibrary.org/obo/GO_0022008 & http -h http://localhost:8080/obo/GO_0022008; } | grep Location && \
{ http -h http://purl.obolibrary.org/obo/CHEBI_15377 & http -h http://localhost:8080/obo/CHEBI_15377; } | grep Location

currently giving:

Location: http://purl.oclc.org/obo/go.owl
Location: http://current.geneontology.org/ontology/go.owl
Location: http://purl.oclc.org/obo/cob.owl
Location: https://raw.githubusercontent.com/OBOFoundry/COB/master/cob.owl
Location: http://purl.oclc.org/obo/GO_0022008
Location: http://www.ontobee.org/browser/rdf.php?o=GO&iri=http://purl.obolibrary.org/obo/GO_0022008
Location: http://purl.oclc.org/obo/CHEBI_15377
Location: http://www.ebi.ac.uk/chebi/searchId.do?chebiId=CHEBI:15377

Answer 21 · 2021-08-04T12:52:17.000Z

My best guess is still that the make all task needs to be run before make test. I haven't had time to replicate, sorry.

Answer 22 · 2021-08-04T16:58:56.000Z

@jamesaoverton @abessiari
Okay, that does indeed seem to be the issue. So:

docker rm purl && docker run --name purl -v /tmp/foo.txt:/opt/credentials/s3cfg -p 8080:80 -it purl:latest /bin/bash
sudo su
cd /var/www/purl.obolibrary.org/
make all
make test
cat tests/development/go.tsv

With this, everything seems to work from the inside and outside.

Answer 23 · 2021-08-04T17:48:08.000Z

@kltm @jamesaoverton

Yes indeed make all fixes the problem.
Please see PR #773

Answer 24 · 2022-01-11T23:18:44.000Z

Planning on continuing once we've wrapped #771 to make testing safe and easy all within Route 53.

Answer 25 · 2022-02-02T23:37:23.000Z

@jamesaoverton, #771 is now wrapped up: we have full domain control, TTL at 300s, and @abessiari has put up a testing server (that you should now have credentials for). Given that the unit/self tests work, is there a protocol that you'd like to follow for the switchover, or should we just go ahead and try it out?

Answer 26 · 2022-02-03T19:16:44.000Z

I checked the server and things look good to me. All my specific concerns have been addressed, and now I just have vague worries 😄. My suggestion would be that we schedule a time when the three of us are working (maybe tomorrow afternoon?) and do the migration. We can coordinate on Slack or Signal.

Answer 27 · 2022-02-03T20:23:00.000Z

@jamesaoverton I'll leave exact scheduling to you and @abessiari--I'm fairly flexible.
As we can now switch pretty easily, I think it would look something like the following:

Drop record TTL and give it a chance to propagate (done--already at five minutes)
At a coordinated time, switch; previous machine is left up
Test; if okay, leave both machines up; if not okay, revert
At our leisure, go through documentation, editing where needed, and bring up another instance according to the word of the documentation
Test
Check that all people involved have credentials
Bring down all machines but the current target of purl.obolibrary.org, safe in the knowledge that we have an SOP for anybody to bring another up and switch
Sleep well

Answer 28 · 2022-02-04T15:46:56.000Z

@kltm
Switch was done, So far so good.

Answer 29 · 2022-02-04T21:38:39.000Z

@abessiari @jamesaoverton Excellent!

Answer 30 · 2022-02-11T02:10:40.000Z

I've gone through a bit of documentation and done a little testing and things seem good so far. It would be good to get somebody else's feedback and testing in here to close this issue out.