Knowledge-Graph-Hub/kg-covid-19

Problem pushing blazegraph journal to SPARQL endpoint in Jenkins

Closed this issue · 9 comments

Describe the bug

In the last "Deploy blazegraph" stage of the Jenkins pipeline, I'm getting an ssh authentication error, see here:

19:35:53  + pwd
19:35:53  + HOME=/var/lib/jenkins/workspace/dge-graph-hub_kg-covid-19_master/ansible
19:35:53  + ansible-playbook update-kg-hub-endpoint.yaml --inventory=hosts.local-rdf-endpoint --private-key=**** -e target_user=bbop --extra-vars=endpoint=internal
19:35:54  [DEPRECATION WARNING]: The TRANSFORM_INVALID_GROUP_CHARS settings is set to 
19:35:54  allow bad characters in group names by default, this will change, but still be 
19:35:54  user configurable on deprecation. This feature will be removed in version 2.10.
19:35:54   Deprecation warnings can be disabled by setting deprecation_warnings=False in 
19:35:54  ansible.cfg.
19:35:54  [WARNING]: Invalid characters were found in group names but not replaced, use
19:35:54  -vvvv to see details
19:35:54  
19:35:54  PLAY [pipeline-rdf] ************************************************************
19:35:54  
19:35:54  TASK [Gathering Facts] *********************************************************
19:35:54  [WARNING]: Unhandled error in Python interpreter discovery for host
19:35:54  pan.lbl.gov: Failed to connect to the host via ssh: Host key verification
19:35:54  failed.
19:35:54  fatal: [pan.lbl.gov]: UNREACHABLE! => {"changed": false, "msg": "Data could not be sent to remote host \"pan.lbl.gov\". Make sure this host can be reached over ssh: Host key verification failed.\r\n", "unreachable": true}
19:35:54  
19:35:54  PLAY RECAP *********************************************************************
19:35:54  pan.lbl.gov                : ok=0    changed=0    unreachable=1    failed=0    skipped=0    rescued=0    ignored=0   
19:35:54  

@kltm, can you remind me how authentication works in order for the ansible playbook to execute properly? Do we need to provide this Docker container an ssh key or something?

To Reproduce

Go here

Expected behavior

Should push blazegraph journal to our SPARQL endpoint

Version

This commit

kltm commented

Yup--a private key for the jenkins user (or accessible to whatever user inside a container). Should be ansible-bbop-local-slave.

kltm commented

It might be worth testing outside of the container to make sure it works. If it does, there is likely some combination of path manipulation ("HOME=pwd"?) or docker weirdness that is causing problems.

Thanks @kltm

It might be worth testing outside of the container to make sure it works.

Well, it's been working for a year or so outside the container, and it works, so it's likely to do with the container. I'm giving Jenkins the credentials here, so maybe it can't find this file when it's in the docker container

If it does, there is likely some combination of path manipulation ("HOME=pwd"?)

FWIW, this isn't the issue - I've removed the HOME=pwd business and it fails in the same way:

13:57:09  pan.lbl.gov: Failed to connect to the host via ssh: Host key verification
13:57:09  failed.
kltm commented

Okay, I think the issue is that the file that you're trying to use in this case either 1) does not exist or 2) has the wrong permissions/mod to be used for the given task.
What you're working with is the "file" credential binding (https://www.jenkins.io/doc/pipeline/steps/credentials-binding/). This file (should) exist for real on the filesystem for this to work. I don't believe that Jenkins copies anything into the docker image, rather binds things in various cute ways through runtime variables and volume mounts. E.g.

$ docker run -t -d -u 114:120 -w /var/lib/jenkins/workspace/vid-19_check_ansible_run_jenkins -v /var/lib/jenkins/workspace/vid-19_check_ansible_run_jenkins:/var/lib/jenkins/workspace/vid-19_check_ansible_run_jenkins:rw,z -v /var/lib/jenkins/workspace/vid-19_check_ansible_run_jenkins@tmp:/var/lib/jenkins/workspace/vid-19_check_ansible_run_jenkins@tmp:rw,z -e ******** -e ******** -e ******** -e ******** -e ******** -e ******** -e ******** -e ******** -e ******** -e ******** -e ******** -e ******** -e ******** -e ******** -e ******** -e ******** -e ******** -e ******** -e ******** -e ******** -e ******** -e ******** -e ******** -e ******** -e ******** -e ******** -e ******** -e ******** -e ******** -e ******** -e ******** -e ******** justaddcoffee/ubuntu20-python-3-8-5-dev:4 cat

The credential that you need likely exists in one of the mounted volumes and it's exact runtime bound location hidden in one of those variables.

The question here is how probe all of this without actually exposing any secrets that we wouldn't want public. Please be conservative in using messages for debugging here--an accidental exposure would be painful.

Possibilities:

  • You get the system to that point and have it take a looong nop; somebody then invades the image to try and figure out what's going on. (Lowest risk, but time consuming and annoying to coordinate.)
  • You add in something like "sh "ls -AlF $DEPLOY_LOCAL_IDENTITY"" or whatever to try and figure out more about the exact disposition of the file. (Easy, but uou are essentially trying to bypass Jenkins security at this point, so please be careful about accidental exposures.)
  • You don't use an overall docker agent but rather a per stage docker agent and then no docker agent at all in the deploy phase, bypassing this issue. (No idea about difficulty, but will work.)
  • Deploy from a different pipeline or by some other mechanism that avoids this. (Annoying and dirty, but will work.)

If I had to guess at this point, I'd say that playing with permissions/users elsewhere could have caused something like this to happen--the wrong perms or user would prevent an ssh key from getting used by an alien caller. OTOH, given the shell game of passing files through different levels, something getting lost doesn't strike me as too too unlikely either, even though Jenkins is supposedly designed for this.

Thanks very much @kltm - I have done this:

You get the system to that point and have it take a looong nop; somebody then invades the image to try and figure out what's going on. (Lowest risk, but time consuming and annoying to coordinate.)

Here with an infinite loop:
https://build.berkeleybop.io/job/knowledge-graph-hub/job/kg-covid-19/job/check_ansible_run_jenkins/14/console
would you mind invading that image and having a look to see why it can't find $DEPLOY_LOCAL_IDENTITY?

kltm commented

Okay, yeah. What I'm seeing does not seem that great. Do you have a channel where we could chat?

BBOP slack?

Long story short, ansible-playbook command fails because it doesn't have an entry in ~/.ssh/known_hosts for pan.lbl.gov (since this is running in Docker).

Fix is fairly simple - just do this before we run ansible:

                        sh 'mkdir -p ~/.ssh/'
                        sh 'ssh-keyscan -H pan.lbl.gov >> ~/.ssh/known_hosts'

Thanks again @kltm for help in sorting this out