Memory errors

Question

Memory errors

Closed this issue 6 years ago · 27 comments

When running a build of master (a560428) using the command
java -jar fcrepo3-rdf-extractor.jar -a /home/ubuntu/akubra-llstore.xml -o /mnt/rdf/juno.nq -g '<info:ca.umanitoba.fedora#ri>' -n 2 -s 2 --logback /home/ubuntu/fcrepo-rdf-extractor-logback.xml

I got the following errors.
https://gist.github.com/whikloj/79ca322a06d61391889d3c7063b6c423

ajs6f commented 6 years ago

Thanks!

Answer 1 · 2017-02-25T17:13:18.000Z

I was able to run the rdf extractor with the HEAD - 1 commit, so perhaps that change you made for me is un-necessary. I think due to our NAS setup it takes longer to produce the initial set of files to stream, but once that completes it does work.

Answer 2 · 2017-02-25T17:47:58.000Z

In a sense it is unnecessary, but if done correctly (unlike the buggy way I did it) it should improve performance all around. So I'm going to redo it, right.

Answer 3 · 2017-02-25T19:49:45.000Z

@whikloj can you try this?

Answer 4 · 2017-02-25T19:52:12.000Z

That branch is a different direction in which to go. Couple of differences: the results will now appear in as many quad files as threads of extraction (remember that for NQuads merging is just concatenation, so just cat file1.nq file2.nq file3.nq > all.nq) so you need to give an output directory, not an output file. This is to avoid having to synch the various threads into serialization and should provide a speedup. The queuing now occurs between reading and parsing, so much earlier. And there should be no problems with GC. I don't think.

Answer 5 · 2017-02-26T15:30:45.000Z

@ruebot, I know you have a lot to do, but if you have time, I would appreciate you trying out that branch too. I think you might find some nice performance improvement.

Answer 6 · 2017-02-27T17:08:39.000Z

Output of my run of CorrectingStreaming

INFO 2017-02-27 17:01:34.871 [main] (Extract) Using 2 threads for extraction and a queue size of 1048576.
INFO 2017-02-27 17:01:34.875 [main] (Extract) with Akubra configuration from /home/ubuntu/akubra-llstore.xml.
INFO 2017-02-27 17:01:35.168 [main] (Extract) Beginning extraction.
INFO 2017-02-27 17:01:35.170 [pool-2-thread-1] (ObjectProcessor) Operating on object URI: -s
INFO 2017-02-27 17:01:35.170 [pool-2-thread-2] (ObjectProcessor) Operating on object URI: 2
ERROR 2017-02-27 17:01:35.183 [pool-2-thread-2] (ObjectProcessor) Error reading from object: 2
org.akubraproject.MissingBlobException: (Missing blob with id = 'file:c8/2')
	at org.akubraproject.fs.FSBlob.openInputStream(FSBlob.java:100)
	at org.akubraproject.impl.BlobWrapper.openInputStream(BlobWrapper.java:93)
	at edu.si.fcrepo.ObjectProcessor.accept(ObjectProcessor.java:103)
	at edu.si.fcrepo.ObjectProcessor.accept(ObjectProcessor.java:62)
	at edu.si.fcrepo.Extract.lambda$null$4(Extract.java:231)
	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
	at java.lang.Thread.run(Thread.java:745)
ERROR 2017-02-27 17:01:35.184 [pool-2-thread-1] (ObjectProcessor) Error reading from object: -s
org.akubraproject.MissingBlobException: (Missing blob with id = 'file:40/-s')
	at org.akubraproject.fs.FSBlob.openInputStream(FSBlob.java:100)
	at org.akubraproject.impl.BlobWrapper.openInputStream(BlobWrapper.java:93)
	at edu.si.fcrepo.ObjectProcessor.accept(ObjectProcessor.java:103)
	at edu.si.fcrepo.ObjectProcessor.accept(ObjectProcessor.java:62)
	at edu.si.fcrepo.Extract.lambda$null$4(Extract.java:231)
	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
	at java.lang.Thread.run(Thread.java:745)
INFO 2017-02-27 17:01:35.186 [main] (Extract) Finished extraction.

Answer 7 · 2017-02-27T17:09:51.000Z

I'm guessing my use of -s 2 caused a problem the command was
java -jar /home/ubuntu/fcrepo3-rdf-extractor/target/fcrepo3-rdf-extractor-0.0.1-SNAPSHOT.jar -a /home/ubuntu/akubra-llstore.xml -o /mnt/rdf/juno.nq -g <info:ca.umanitoba.fedora#ri> -n 2 -s 2 --logback /home/ubuntu/fcrepo3-rdf-extractor-logback.xml

Answer 8 · 2017-02-27T17:12:18.000Z

I probably shouldn't have deleted those 137 million triples from my last run 🤦‍♂️

Answer 9 · 2017-02-27T19:56:57.000Z

That problem with -s is weird-- maybe a bug. But on the side, why are you using such a low value for the queue size? Can you try with no -s flag (so the default)?

Answer 10 · 2017-02-27T20:04:47.000Z

I was running it low to see if that helped with accounting for our slow NAS connection. I am trying now without the -s and -n arguments

Answer 11 · 2017-02-27T21:44:46.000Z

@whikloj ++

Answer 12 · 2017-02-28T14:24:11.000Z

That seems to have run to completion, but it also seems to have ended up way short. I got a directory of 5 files, total lines of 71,559,316.

Running the HEAD - 1 commit last week, I was around 137,000,000.

Logs show lots of triples, but also lots of the following 2 Parser ERRORs
ERROR 2017-02-27 22:13:45.526 [pool-2-thread-3] (Parser) Compiler warnings:

ERROR 2017-02-27 22:13:45.556 [pool-2-thread-3] (Parser) WARNING: 'org.apache.xerces.jaxp.SAXParserImpl: Property 'http://javax.xml.XMLConstants/property/accessExternalDTD' is not recognized.'

Answer 13 · 2017-02-28T14:28:21.000Z

Urg, that's not a prob at all, that's Xerces sucking. But the missing triples are another matter.

Can you use uniq to determine if there are duplicates in the 137Mt run?

Answer 14 · 2017-02-28T14:31:07.000Z

I'll have to run it again, as I stupidly deleted it before trying this run. I'll get it going now.

I will say that my actual triplestore was reporting about 137M triples too.

Answer 15 · 2017-02-28T14:34:08.000Z

Erm, that does indeed imply that the new branch is just wrong. But the question is: wrong how?

Answer 16 · 2017-02-28T14:41:05.000Z

Note: In IRC @whikloj agreed to run the extractors against a stable dev instance with fewer triples as a start towards getting this thing nailed down.

Answer 17 · 2017-07-25T18:31:26.000Z

@whikloj Did you ever have a chance to do the further test runs we talked about for the CorrectingStreaming branch?

Answer 18 · 2018-03-27T21:50:29.000Z

Finally running this again as I have a new production server and I want to migrate to it.

The only errors I am seeing are

Caused by: javax.xml.stream.XMLStreamException: The namespace URI "http://islandora.ca/ontology/relsext#" has not been bound to a prefix.

But I think I've mentioned those before and you weren't concerned. Can't remember so good anymore.

Answer 19 · 2018-03-28T14:03:10.000Z

Not sure, perhaps you are thinking of #4 (comment) and my reply? I don't quite recognize what you are reporting now-- but here's the first question: are you getting the right results? If so, this is probably some sort of "TMI, Xerces" deal, but if not, it might mean something more problematic. In any event, feel free to open a new ticket, because it doesn't seem to be the memory problems originally reported, right?

Answer 20 · 2018-03-28T15:12:18.000Z

No memory problems, it ran all night and seemingly when I came in and resumed the screen it froze. I'll have to try running it again. The above namespace URI error seems related to this chunk from some of our foxml files.

<foxml:datastream ID="RELS-INT" STATE="A" CONTROL_GROUP="X" VERSIONABLE="true">
<foxml:datastreamVersion ID="RELS-INT.0" LABEL="Fedora Relationship Metadata." CREATED="2016-07-09T00:20:13.830Z" MIMETYPE="application/rdf+xml" FORMAT_URI="info:fedora/fedora-system:FedoraRELSExt-1.0" SIZE="286">
<foxml:xmlContent>
<rdf:RDF xmlns:islandora="http://islandora.ca/ontology/relsint#" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#">
  <rdf:Description rdf:about="info:fedora/uofm:2211108/JP2">
    <width xmlns="http://islandora.ca/ontology/relsext#">4478</width>
  </rdf:Description>
</rdf:RDF>
</foxml:xmlContent>
</foxml:datastreamVersion>
<foxml:datastreamVersion ID="RELS-INT.1" LABEL="Fedora Relationship Metadata." CREATED="2016-07-09T00:20:14.014Z" MIMETYPE="application/rdf+xml" FORMAT_URI="info:fedora/fedora-system:FedoraRELSExt-1.0" SIZE="358">
<foxml:xmlContent>
<rdf:RDF xmlns:islandora="http://islandora.ca/ontology/relsint#" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#">
  <rdf:Description rdf:about="info:fedora/uofm:2211108/JP2">
    <width xmlns="http://islandora.ca/ontology/relsext#">4478</width>
    <height xmlns="http://islandora.ca/ontology/relsext#">6762</height>
  </rdf:Description>
</rdf:RDF>
</foxml:xmlContent>
</foxml:datastreamVersion>
</foxml:datastream>

Because those triples seem to have been lost.

[whikloj@jujo]/var/indexes/triples% grep 'http://islandora.ca/ontology/relsext#width' quads?.nq               
[whikloj@jujo]/var/indexes/triples% grep 'http://islandora.ca/ontology/relsext#height' quads?.nq
[whikloj@jujo]/var/indexes/triples% grep 'height' quads?.nq 
quads0.nq:<info:fedora/uofm:2625315> <http://purl.org/dc/elements/1.1/relation> "Manitoba Historical Society Resources:  Historic Sites of Manitoba--http://www.mhs.mb.ca/docs/sites/riverheightsschool.shtml" <ca.umanitoba.dam.fedora#ri> .
quads0.nq:<info:fedora/uofm:2629607> <http://purl.org/dc/elements/1.1/relation> "Winnipeg Architecture Foundation--http://www.winnipegarchitecture.ca/silver-heights-gates/" <ca.umanitoba.dam.fedora#ri> .
quads1.nq:<info:fedora/uofm:2631159> <http://purl.org/dc/elements/1.1/relation> "Manitoba Historical Society Resources:  Historic Sites of Manitoba--http://www.mhs.mb.ca/docs/sites/riverheightstelephone.shtml" <ca.umanitoba.dam.fedora#ri> .
[whikloj@jujo]/var/indexes/triples% grep 'width' quads?.nq
quads1.nq:<info:fedora/uofm:1414465> <http://purl.org/dc/elements/1.1/description> "The annotated back of a photograph of St. John's College at the University of Manitoba's Fort Garry Campus, showing the building while under construction with a large sign dominating the foreground. Annotations read: \"4. 3 col width. 6\" wide, 4 1/2\" deep. 2554.\"" <ca.umanitoba.dam.fedora#ri> .
[whikloj@jujo]/var/indexes/triples%

Answer 21 · 2018-03-28T15:17:08.000Z

I think we're hitting this. I will cut a branch with a proposed fix pronto. Before I do that, can you open a new ticket so we can close this one?

As far as freezing on resume, that is weird. I haven't ever seen that, but I normally make use of nohup and tail -f the file to which it redirects SYSOUT, instead of trying to watch the SYSOUT of the utility itself.

Answer 22 · 2018-03-28T15:24:46.000Z

P.S., for doing testing with this code, it's worth noting that you can use the CLI flags to select only a certain list of objects to index.

Answer 23 · 2018-03-28T16:17:24.000Z

@whikloj Closing this in favor of #5 .

Answer 24 · 2018-03-28T18:02:25.000Z

No memory errors with this branch. 👍

Answer 25 · 2018-03-28T21:44:22.000Z

Thanks, @whikloj ! Does "this branch" mean https://github.com/ajs6f/fcrepo3-rdf-extractor/tree/CorrectingStreaming? Because that has been merged into master, so we should be good to continue working with master. In fact, because of the last paragraph of #5 (comment), we really definitely surely very much ought to be working with master.

Answer 26 · 2018-03-29T14:17:46.000Z

Yes this branch means CorrectingStreaming and I will re-run a set of objects (say 100) today using master.