Problem pushing blazegraph journal to SPARQL endpoint in Jenkins
Closed this issue · 9 comments
Describe the bug
In the last "Deploy blazegraph" stage of the Jenkins pipeline, I'm getting an ssh authentication error, see here:
19:35:53 + pwd
19:35:53 + HOME=/var/lib/jenkins/workspace/dge-graph-hub_kg-covid-19_master/ansible
19:35:53 + ansible-playbook update-kg-hub-endpoint.yaml --inventory=hosts.local-rdf-endpoint --private-key=**** -e target_user=bbop --extra-vars=endpoint=internal
19:35:54 [DEPRECATION WARNING]: The TRANSFORM_INVALID_GROUP_CHARS settings is set to
19:35:54 allow bad characters in group names by default, this will change, but still be
19:35:54 user configurable on deprecation. This feature will be removed in version 2.10.
19:35:54 Deprecation warnings can be disabled by setting deprecation_warnings=False in
19:35:54 ansible.cfg.
19:35:54 [WARNING]: Invalid characters were found in group names but not replaced, use
19:35:54 -vvvv to see details
19:35:54
19:35:54 PLAY [pipeline-rdf] ************************************************************
19:35:54
19:35:54 TASK [Gathering Facts] *********************************************************
19:35:54 [WARNING]: Unhandled error in Python interpreter discovery for host
19:35:54 pan.lbl.gov: Failed to connect to the host via ssh: Host key verification
19:35:54 failed.
19:35:54 fatal: [pan.lbl.gov]: UNREACHABLE! => {"changed": false, "msg": "Data could not be sent to remote host \"pan.lbl.gov\". Make sure this host can be reached over ssh: Host key verification failed.\r\n", "unreachable": true}
19:35:54
19:35:54 PLAY RECAP *********************************************************************
19:35:54 pan.lbl.gov : ok=0 changed=0 unreachable=1 failed=0 skipped=0 rescued=0 ignored=0
19:35:54
@kltm, can you remind me how authentication works in order for the ansible playbook to execute properly? Do we need to provide this Docker container an ssh key or something?
To Reproduce
Go here
Expected behavior
Should push blazegraph journal to our SPARQL endpoint
Version
Yup--a private key for the jenkins user (or accessible to whatever user inside a container). Should be ansible-bbop-local-slave.
It might be worth testing outside of the container to make sure it works. If it does, there is likely some combination of path manipulation ("HOME=pwd
"?) or docker weirdness that is causing problems.
Thanks @kltm
It might be worth testing outside of the container to make sure it works.
Well, it's been working for a year or so outside the container, and it works, so it's likely to do with the container. I'm giving Jenkins the credentials here, so maybe it can't find this file when it's in the docker container
If it does, there is likely some combination of path manipulation ("HOME=pwd"?)
FWIW, this isn't the issue - I've removed the HOME=pwd
business and it fails in the same way:
13:57:09 pan.lbl.gov: Failed to connect to the host via ssh: Host key verification
13:57:09 failed.
Okay, I think the issue is that the file that you're trying to use in this case either 1) does not exist or 2) has the wrong permissions/mod to be used for the given task.
What you're working with is the "file" credential binding (https://www.jenkins.io/doc/pipeline/steps/credentials-binding/). This file (should) exist for real on the filesystem for this to work. I don't believe that Jenkins copies anything into the docker image, rather binds things in various cute ways through runtime variables and volume mounts. E.g.
$ docker run -t -d -u 114:120 -w /var/lib/jenkins/workspace/vid-19_check_ansible_run_jenkins -v /var/lib/jenkins/workspace/vid-19_check_ansible_run_jenkins:/var/lib/jenkins/workspace/vid-19_check_ansible_run_jenkins:rw,z -v /var/lib/jenkins/workspace/vid-19_check_ansible_run_jenkins@tmp:/var/lib/jenkins/workspace/vid-19_check_ansible_run_jenkins@tmp:rw,z -e ******** -e ******** -e ******** -e ******** -e ******** -e ******** -e ******** -e ******** -e ******** -e ******** -e ******** -e ******** -e ******** -e ******** -e ******** -e ******** -e ******** -e ******** -e ******** -e ******** -e ******** -e ******** -e ******** -e ******** -e ******** -e ******** -e ******** -e ******** -e ******** -e ******** -e ******** -e ******** justaddcoffee/ubuntu20-python-3-8-5-dev:4 cat
The credential that you need likely exists in one of the mounted volumes and it's exact runtime bound location hidden in one of those variables.
The question here is how probe all of this without actually exposing any secrets that we wouldn't want public. Please be conservative in using messages for debugging here--an accidental exposure would be painful.
Possibilities:
- You get the system to that point and have it take a looong nop; somebody then invades the image to try and figure out what's going on. (Lowest risk, but time consuming and annoying to coordinate.)
- You add in something like "sh "ls -AlF $DEPLOY_LOCAL_IDENTITY"" or whatever to try and figure out more about the exact disposition of the file. (Easy, but uou are essentially trying to bypass Jenkins security at this point, so please be careful about accidental exposures.)
- You don't use an overall docker agent but rather a per stage docker agent and then no docker agent at all in the deploy phase, bypassing this issue. (No idea about difficulty, but will work.)
- Deploy from a different pipeline or by some other mechanism that avoids this. (Annoying and dirty, but will work.)
If I had to guess at this point, I'd say that playing with permissions/users elsewhere could have caused something like this to happen--the wrong perms or user would prevent an ssh key from getting used by an alien caller. OTOH, given the shell game of passing files through different levels, something getting lost doesn't strike me as too too unlikely either, even though Jenkins is supposedly designed for this.
Thanks very much @kltm - I have done this:
You get the system to that point and have it take a looong nop; somebody then invades the image to try and figure out what's going on. (Lowest risk, but time consuming and annoying to coordinate.)
Here with an infinite loop:
https://build.berkeleybop.io/job/knowledge-graph-hub/job/kg-covid-19/job/check_ansible_run_jenkins/14/console
would you mind invading that image and having a look to see why it can't find $DEPLOY_LOCAL_IDENTITY
?
Okay, yeah. What I'm seeing does not seem that great. Do you have a channel where we could chat?
BBOP slack?
Long story short, ansible-playbook
command fails because it doesn't have an entry in ~/.ssh/known_hosts
for pan.lbl.gov
(since this is running in Docker).
Fix is fairly simple - just do this before we run ansible:
sh 'mkdir -p ~/.ssh/'
sh 'ssh-keyscan -H pan.lbl.gov >> ~/.ssh/known_hosts'
Thanks again @kltm for help in sorting this out