Knowledge-Graph-Hub/kg-covid-19

Jenkins hanging while writing blazegraph journal from NT file

Closed this issue · 9 comments

Describe the bug

When running Jenkins pipeline, the process hangs for >3 days on the blazegraph journal stage - see here for Jenkins logs. Here's the chatter before/during the hang:

10:43:02  + export JAVA_OPTS=-Xmx128G
10:43:02  + ./target/universal/stage/bin/blazegraph-runner load --informat=ntriples --journal=../merged-kg.jnl --use-ontology-graph=true ../data/merged/merged-kg.nt
10:43:03  log4j:WARN No appenders could be found for logger (com.bigdata.config.Configuration).
10:43:03  log4j:WARN Please initialize the log4j system properly.
10:43:03  log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for more info.
10:43:03  ERROR: com.bigdata.util.config.LogUtil : Could not initialize Log4J logging utility.
10:43:03  Set system property '-Dlog4j.configuration=file:bigdata/src/resources/logging/log4j.properties'
10:43:03    and / or 
10:43:03  Set system property '-Dlog4j.primary.configuration=file:<installDir>/bigdata/src/resources/logging/log4j.properties'
10:43:03  
10:43:03  BlazeGraph(TM) Graph Engine
10:43:03  
10:43:03                     Flexible
10:43:03                     Reliable
10:43:03                    Affordable
10:43:03        Web-Scale Computing for the Enterprise
10:43:03  
10:43:03  Copyright SYSTAP, LLC DBA Blazegraph 2006-2016.  All rights reserved.
10:43:03  
10:43:03  26234271d5bc
10:43:03  Tue Aug 17 17:43:02 UTC 2021
10:43:03  Linux/4.15.0-142-generic amd64
10:43:03  Intel(R) Xeon(R) CPU           X5675  @ 3.07GHz Family 6 Model 44 Stepping 2, GenuineIntel #CPU=24
10:43:03  Private Build 14.0.2
10:43:03  freeMemory=2097767840
10:43:03  buildVersion=2.1.4
10:43:03  gitCommit=738d05f08cffd319233a4bfbb0ec2a858e260f9c
10:43:03  
10:43:03  Dependency         License                                                                 
10:43:03  ICU                http://source.icu-project.org/repos/icu/icu/trunk/license.html          
10:43:03  bigdata-ganglia    http://www.apache.org/licenses/LICENSE-2.0.html                         
10:43:03  blueprints-core    https://github.com/tinkerpop/blueprints/blob/master/LICENSE.txt         
10:43:03  colt               http://acs.lbl.gov/software/colt/license.html                           
10:43:03  commons-codec      http://www.apache.org/licenses/LICENSE-2.0.html                         
10:43:03  commons-fileupload http://www.apache.org/licenses/LICENSE-2.0.html                         
10:43:03  commons-io         http://www.apache.org/licenses/LICENSE-2.0.html                         
10:43:03  commons-logging    http://www.apache.org/licenses/LICENSE-2.0.html                         
10:43:03  dsiutils           http://www.gnu.org/licenses/lgpl-2.1.html                               
10:43:03  fastutil           http://www.apache.org/licenses/LICENSE-2.0.html                         
10:43:03  flot               http://www.opensource.org/licenses/mit-license.php                      
10:43:03  high-scale-lib     http://creativecommons.org/licenses/publicdomain                        
10:43:03  httpclient         http://www.apache.org/licenses/LICENSE-2.0.html                         
10:43:03  httpclient-cache   http://www.apache.org/licenses/LICENSE-2.0.html                         
10:43:03  httpcore           http://www.apache.org/licenses/LICENSE-2.0.html                         
10:43:03  httpmime           http://www.apache.org/licenses/LICENSE-2.0.html                         
10:43:03  jackson-core       http://www.apache.org/licenses/LICENSE-2.0.html                         
10:43:03  jetty              http://www.apache.org/licenses/LICENSE-2.0.html                         
10:43:03  jquery             https://github.com/jquery/jquery/blob/master/MIT-LICENSE.txt            
10:43:03  jsonld             https://raw.githubusercontent.com/jsonld-java/jsonld-java/master/LICENCE
10:43:03  log4j              http://www.apache.org/licenses/LICENSE-2.0.html                         
10:43:03  lucene             http://www.apache.org/licenses/LICENSE-2.0.html                         
10:43:03  nanohttp           http://elonen.iki.fi/code/nanohttpd/#license                            
10:43:03  rexster-core       https://github.com/tinkerpop/rexster/blob/master/LICENSE.txt            
10:43:03  river              http://www.apache.org/licenses/LICENSE-2.0.html                         
10:43:03  semargl            https://github.com/levkhomich/semargl/blob/master/LICENSE               
10:43:03  servlet-api        http://www.apache.org/licenses/LICENSE-2.0.html                         
10:43:03  sesame             http://www.openrdf.org/download.jsp                                     
10:43:03  slf4j              http://www.slf4j.org/license.html                                       
10:43:03  zookeeper          http://www.apache.org/licenses/LICENSE-2.0.html                         
10:43:03  
10:43:04  �[0m2021.08.17 17:43:04:244 [main      ] [�[34mINFO �[0m] �[32morg.renci.blazegraph.Load.runUsingConnection:43�[0m - Loading ../data/merged/merged-kg.nt�[0m

To Reproduce

Run pipeline on 157cd65

Expected behavior

Should run to completion

Version

157cd65

Additional context

Discussed with @kltm and had a look at the Docker image while hanging. It did not seem to actually be writing anything to the blazegraph journal file. Not clear why.

kltm commented

It looked as if it was constantly doing something to /var/lib/jenkins/workspace/dge-graph-hub_kg-covid-19_master/gitrepo/merged-kg.jnl, but it did not change the file size. It could have just been very aggressively bit flipping.

There was definitely a fair amount of write going on, but I'm not exactly sure what was being written:

28125 be/4 jenkins 0.00 B/s 17.39 M/s 0.00 % 93.13 % java -Xmx128G -cp /var/lib/jenkins/workspace/dge~graph-hub_kg-covid-19_master/gi [com.bigdata.rws]

I would also note that at the docker level, it looked like:

bbop@stove:~$ docker ps
CONTAINER ID   IMAGE                                       COMMAND   CREATED      STATUS      PORTS     NAMES
26234271d5bc   justaddcoffee/ubuntu20-python-3-8-5-dev:4   "cat"     3 days ago   Up 3 days             upbeat_knuth

which would be consistent with an issue like https://stackoverflow.com/questions/54585747/jenkins-docker-container-simply-hangs-and-never-executes-steps

Whatever happened prevented docker from being about to safely close out the running image with stop and kill, which also prevented us from re-invading the image to see what was going on (possibly the processes needed to invade the image had already been offed).

In the end, the machine was power cycled. Due to other restrictions on the machine, we'll not be lightly doing that again.

So this PR which processing only a small subset of all the data runs okay. I'm going to start a build of master in case this was something that rebooting might fix

kltm commented

I would try running that a second time first.

"... in case this was something that rebooting might fix"
The only things a reboot would fix would be cruft in temp filesystems, otherwise it would/could occur again in the future under some circumstance. Also noting a manual purging of the filesystem that would not have been touched by a reboot.

Interestingly, it failed the second time, but this time because the journal file from the previous run seems to still exist, which is puzzling:

+ pigz ../merged-kg.jnl
17:14:52  pigz: abort: write error on ../merged-kg.jnl.gz (Inappropriate ioctl for device)

^^ this is pigz's unhelpful argument saying that it wants to ask if it should overwrite the file, but it can't because this isn't an interactive terminal

kltm commented

I had a feeling something like that might happen. Wanna guess where my current theory is heading ;)
I think once a workspace has been dirtied in certain ways, things get "weird". We can try and track down these as specific cases until a pattern comes up.

Wanna guess where my current theory is heading ;)

Haha before we go blaming my beloved jenkinsuser, I am going to try cleanWs() and see if it removes files from previous runs...

It's evident that previous runs in a given workspace are causing some issues in this Jenkins run:

  • download stages are clearly using existing data from previous runs instead of download fresh data
  • blazegraph journal file from previous runs still exist, causing unexpected behavior (pigz dies with unhelpful message caused by it wanting to talk to an interactive terminal to ask whether it should overwrite existing file)
  • git repo from previous run still exists, causing git clone to fail

Although it's not clear if this is the cause of the blazegraph step hanging

This #432 may have fixed this. Next Jenkins run here may help confirm this

Confirming that this is fixed. I am assuming (without absolute proof) that my failure to remove stuff from previous runs was causing this problem