Jenkins hanging while writing blazegraph journal from NT file
Closed this issue · 9 comments
Describe the bug
When running Jenkins pipeline, the process hangs for >3 days on the blazegraph journal stage - see here for Jenkins logs. Here's the chatter before/during the hang:
10:43:02 + export JAVA_OPTS=-Xmx128G
10:43:02 + ./target/universal/stage/bin/blazegraph-runner load --informat=ntriples --journal=../merged-kg.jnl --use-ontology-graph=true ../data/merged/merged-kg.nt
10:43:03 log4j:WARN No appenders could be found for logger (com.bigdata.config.Configuration).
10:43:03 log4j:WARN Please initialize the log4j system properly.
10:43:03 log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for more info.
10:43:03 ERROR: com.bigdata.util.config.LogUtil : Could not initialize Log4J logging utility.
10:43:03 Set system property '-Dlog4j.configuration=file:bigdata/src/resources/logging/log4j.properties'
10:43:03 and / or
10:43:03 Set system property '-Dlog4j.primary.configuration=file:<installDir>/bigdata/src/resources/logging/log4j.properties'
10:43:03
10:43:03 BlazeGraph(TM) Graph Engine
10:43:03
10:43:03 Flexible
10:43:03 Reliable
10:43:03 Affordable
10:43:03 Web-Scale Computing for the Enterprise
10:43:03
10:43:03 Copyright SYSTAP, LLC DBA Blazegraph 2006-2016. All rights reserved.
10:43:03
10:43:03 26234271d5bc
10:43:03 Tue Aug 17 17:43:02 UTC 2021
10:43:03 Linux/4.15.0-142-generic amd64
10:43:03 Intel(R) Xeon(R) CPU X5675 @ 3.07GHz Family 6 Model 44 Stepping 2, GenuineIntel #CPU=24
10:43:03 Private Build 14.0.2
10:43:03 freeMemory=2097767840
10:43:03 buildVersion=2.1.4
10:43:03 gitCommit=738d05f08cffd319233a4bfbb0ec2a858e260f9c
10:43:03
10:43:03 Dependency License
10:43:03 ICU http://source.icu-project.org/repos/icu/icu/trunk/license.html
10:43:03 bigdata-ganglia http://www.apache.org/licenses/LICENSE-2.0.html
10:43:03 blueprints-core https://github.com/tinkerpop/blueprints/blob/master/LICENSE.txt
10:43:03 colt http://acs.lbl.gov/software/colt/license.html
10:43:03 commons-codec http://www.apache.org/licenses/LICENSE-2.0.html
10:43:03 commons-fileupload http://www.apache.org/licenses/LICENSE-2.0.html
10:43:03 commons-io http://www.apache.org/licenses/LICENSE-2.0.html
10:43:03 commons-logging http://www.apache.org/licenses/LICENSE-2.0.html
10:43:03 dsiutils http://www.gnu.org/licenses/lgpl-2.1.html
10:43:03 fastutil http://www.apache.org/licenses/LICENSE-2.0.html
10:43:03 flot http://www.opensource.org/licenses/mit-license.php
10:43:03 high-scale-lib http://creativecommons.org/licenses/publicdomain
10:43:03 httpclient http://www.apache.org/licenses/LICENSE-2.0.html
10:43:03 httpclient-cache http://www.apache.org/licenses/LICENSE-2.0.html
10:43:03 httpcore http://www.apache.org/licenses/LICENSE-2.0.html
10:43:03 httpmime http://www.apache.org/licenses/LICENSE-2.0.html
10:43:03 jackson-core http://www.apache.org/licenses/LICENSE-2.0.html
10:43:03 jetty http://www.apache.org/licenses/LICENSE-2.0.html
10:43:03 jquery https://github.com/jquery/jquery/blob/master/MIT-LICENSE.txt
10:43:03 jsonld https://raw.githubusercontent.com/jsonld-java/jsonld-java/master/LICENCE
10:43:03 log4j http://www.apache.org/licenses/LICENSE-2.0.html
10:43:03 lucene http://www.apache.org/licenses/LICENSE-2.0.html
10:43:03 nanohttp http://elonen.iki.fi/code/nanohttpd/#license
10:43:03 rexster-core https://github.com/tinkerpop/rexster/blob/master/LICENSE.txt
10:43:03 river http://www.apache.org/licenses/LICENSE-2.0.html
10:43:03 semargl https://github.com/levkhomich/semargl/blob/master/LICENSE
10:43:03 servlet-api http://www.apache.org/licenses/LICENSE-2.0.html
10:43:03 sesame http://www.openrdf.org/download.jsp
10:43:03 slf4j http://www.slf4j.org/license.html
10:43:03 zookeeper http://www.apache.org/licenses/LICENSE-2.0.html
10:43:03
10:43:04 �[0m2021.08.17 17:43:04:244 [main ] [�[34mINFO �[0m] �[32morg.renci.blazegraph.Load.runUsingConnection:43�[0m - Loading ../data/merged/merged-kg.nt�[0m
To Reproduce
Run pipeline on 157cd65
Expected behavior
Should run to completion
Version
Additional context
Discussed with @kltm and had a look at the Docker image while hanging. It did not seem to actually be writing anything to the blazegraph journal file. Not clear why.
It looked as if it was constantly doing something to /var/lib/jenkins/workspace/dge-graph-hub_kg-covid-19_master/gitrepo/merged-kg.jnl, but it did not change the file size. It could have just been very aggressively bit flipping.
There was definitely a fair amount of write going on, but I'm not exactly sure what was being written:
28125 be/4 jenkins 0.00 B/s 17.39 M/s 0.00 % 93.13 % java -Xmx128G -cp /var/lib/jenkins/workspace/dge~graph-hub_kg-covid-19_master/gi [com.bigdata.rws]
I would also note that at the docker level, it looked like:
bbop@stove:~$ docker ps
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
26234271d5bc justaddcoffee/ubuntu20-python-3-8-5-dev:4 "cat" 3 days ago Up 3 days upbeat_knuth
which would be consistent with an issue like https://stackoverflow.com/questions/54585747/jenkins-docker-container-simply-hangs-and-never-executes-steps
Whatever happened prevented docker from being about to safely close out the running image with stop and kill, which also prevented us from re-invading the image to see what was going on (possibly the processes needed to invade the image had already been offed).
In the end, the machine was power cycled. Due to other restrictions on the machine, we'll not be lightly doing that again.
I would try running that a second time first.
"... in case this was something that rebooting might fix"
The only things a reboot would fix would be cruft in temp filesystems, otherwise it would/could occur again in the future under some circumstance. Also noting a manual purging of the filesystem that would not have been touched by a reboot.
Interestingly, it failed the second time, but this time because the journal file from the previous run seems to still exist, which is puzzling:
+ pigz ../merged-kg.jnl
17:14:52 pigz: abort: write error on ../merged-kg.jnl.gz (Inappropriate ioctl for device)
^^ this is pigz's unhelpful argument saying that it wants to ask if it should overwrite the file, but it can't because this isn't an interactive terminal
I had a feeling something like that might happen. Wanna guess where my current theory is heading ;)
I think once a workspace has been dirtied in certain ways, things get "weird". We can try and track down these as specific cases until a pattern comes up.
Wanna guess where my current theory is heading ;)
Haha before we go blaming my beloved jenkinsuser
, I am going to try cleanWs()
and see if it removes files from previous runs...
It's evident that previous runs in a given workspace are causing some issues in this Jenkins run:
- download stages are clearly using existing data from previous runs instead of download fresh data
- blazegraph journal file from previous runs still exist, causing unexpected behavior (pigz dies with unhelpful message caused by it wanting to talk to an interactive terminal to ask whether it should overwrite existing file)
- git repo from previous run still exists, causing git clone to fail
Although it's not clear if this is the cause of the blazegraph step hanging
Confirming that this is fixed. I am assuming (without absolute proof) that my failure to remove stuff from previous runs was causing this problem