ajs6f/fcrepo3-rdf-extractor

Memory errors

Closed this issue ยท 27 comments

When running a build of master (a560428) using the command
java -jar fcrepo3-rdf-extractor.jar -a /home/ubuntu/akubra-llstore.xml -o /mnt/rdf/juno.nq -g '<info:ca.umanitoba.fedora#ri>' -n 2 -s 2 --logback /home/ubuntu/fcrepo-rdf-extractor-logback.xml

I got the following errors.
https://gist.github.com/whikloj/79ca322a06d61391889d3c7063b6c423

I was able to run the rdf extractor with the HEAD - 1 commit, so perhaps that change you made for me is un-necessary. I think due to our NAS setup it takes longer to produce the initial set of files to stream, but once that completes it does work.

ajs6f commented

In a sense it is unnecessary, but if done correctly (unlike the buggy way I did it) it should improve performance all around. So I'm going to redo it, right.

ajs6f commented

@whikloj can you try this?

ajs6f commented

That branch is a different direction in which to go. Couple of differences: the results will now appear in as many quad files as threads of extraction (remember that for NQuads merging is just concatenation, so just cat file1.nq file2.nq file3.nq > all.nq) so you need to give an output directory, not an output file. This is to avoid having to synch the various threads into serialization and should provide a speedup. The queuing now occurs between reading and parsing, so much earlier. And there should be no problems with GC. I don't think.

ajs6f commented

@ruebot, I know you have a lot to do, but if you have time, I would appreciate you trying out that branch too. I think you might find some nice performance improvement.

Output of my run of CorrectingStreaming

INFO 2017-02-27 17:01:34.871 [main] (Extract) Using 2 threads for extraction and a queue size of 1048576.
INFO 2017-02-27 17:01:34.875 [main] (Extract) with Akubra configuration from /home/ubuntu/akubra-llstore.xml.
INFO 2017-02-27 17:01:35.168 [main] (Extract) Beginning extraction.
INFO 2017-02-27 17:01:35.170 [pool-2-thread-1] (ObjectProcessor) Operating on object URI: -s
INFO 2017-02-27 17:01:35.170 [pool-2-thread-2] (ObjectProcessor) Operating on object URI: 2
ERROR 2017-02-27 17:01:35.183 [pool-2-thread-2] (ObjectProcessor) Error reading from object: 2
org.akubraproject.MissingBlobException: (Missing blob with id = 'file:c8/2')
	at org.akubraproject.fs.FSBlob.openInputStream(FSBlob.java:100)
	at org.akubraproject.impl.BlobWrapper.openInputStream(BlobWrapper.java:93)
	at edu.si.fcrepo.ObjectProcessor.accept(ObjectProcessor.java:103)
	at edu.si.fcrepo.ObjectProcessor.accept(ObjectProcessor.java:62)
	at edu.si.fcrepo.Extract.lambda$null$4(Extract.java:231)
	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
	at java.lang.Thread.run(Thread.java:745)
ERROR 2017-02-27 17:01:35.184 [pool-2-thread-1] (ObjectProcessor) Error reading from object: -s
org.akubraproject.MissingBlobException: (Missing blob with id = 'file:40/-s')
	at org.akubraproject.fs.FSBlob.openInputStream(FSBlob.java:100)
	at org.akubraproject.impl.BlobWrapper.openInputStream(BlobWrapper.java:93)
	at edu.si.fcrepo.ObjectProcessor.accept(ObjectProcessor.java:103)
	at edu.si.fcrepo.ObjectProcessor.accept(ObjectProcessor.java:62)
	at edu.si.fcrepo.Extract.lambda$null$4(Extract.java:231)
	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
	at java.lang.Thread.run(Thread.java:745)
INFO 2017-02-27 17:01:35.186 [main] (Extract) Finished extraction.

I'm guessing my use of -s 2 caused a problem the command was
java -jar /home/ubuntu/fcrepo3-rdf-extractor/target/fcrepo3-rdf-extractor-0.0.1-SNAPSHOT.jar -a /home/ubuntu/akubra-llstore.xml -o /mnt/rdf/juno.nq -g <info:ca.umanitoba.fedora#ri> -n 2 -s 2 --logback /home/ubuntu/fcrepo3-rdf-extractor-logback.xml

I probably shouldn't have deleted those 137 million triples from my last run ๐Ÿคฆโ€โ™‚๏ธ

ajs6f commented

That problem with -s is weird-- maybe a bug. But on the side, why are you using such a low value for the queue size? Can you try with no -s flag (so the default)?

I was running it low to see if that helped with accounting for our slow NAS connection. I am trying now without the -s and -n arguments

That seems to have run to completion, but it also seems to have ended up way short. I got a directory of 5 files, total lines of 71,559,316.

Running the HEAD - 1 commit last week, I was around 137,000,000.

Logs show lots of triples, but also lots of the following 2 Parser ERRORs
ERROR 2017-02-27 22:13:45.526 [pool-2-thread-3] (Parser) Compiler warnings:

ERROR 2017-02-27 22:13:45.556 [pool-2-thread-3] (Parser) WARNING: 'org.apache.xerces.jaxp.SAXParserImpl: Property 'http://javax.xml.XMLConstants/property/accessExternalDTD' is not recognized.'

ajs6f commented

Urg, that's not a prob at all, that's Xerces sucking. But the missing triples are another matter.

Can you use uniq to determine if there are duplicates in the 137Mt run?

I'll have to run it again, as I stupidly deleted it before trying this run. I'll get it going now.

I will say that my actual triplestore was reporting about 137M triples too.

ajs6f commented

Erm, that does indeed imply that the new branch is just wrong. But the question is: wrong how?

ajs6f commented

Note: In IRC @whikloj agreed to run the extractors against a stable dev instance with fewer triples as a start towards getting this thing nailed down.

ajs6f commented

@whikloj Did you ever have a chance to do the further test runs we talked about for the CorrectingStreaming branch?

Finally running this again as I have a new production server and I want to migrate to it.

The only errors I am seeing are

Caused by: javax.xml.stream.XMLStreamException: The namespace URI "http://islandora.ca/ontology/relsext#" has not been bound to a prefix.

But I think I've mentioned those before and you weren't concerned. Can't remember so good anymore.

ajs6f commented

Not sure, perhaps you are thinking of #4 (comment) and my reply? I don't quite recognize what you are reporting now-- but here's the first question: are you getting the right results? If so, this is probably some sort of "TMI, Xerces" deal, but if not, it might mean something more problematic. In any event, feel free to open a new ticket, because it doesn't seem to be the memory problems originally reported, right?

No memory problems, it ran all night and seemingly when I came in and resumed the screen it froze. I'll have to try running it again. The above namespace URI error seems related to this chunk from some of our foxml files.

<foxml:datastream ID="RELS-INT" STATE="A" CONTROL_GROUP="X" VERSIONABLE="true">
<foxml:datastreamVersion ID="RELS-INT.0" LABEL="Fedora Relationship Metadata." CREATED="2016-07-09T00:20:13.830Z" MIMETYPE="application/rdf+xml" FORMAT_URI="info:fedora/fedora-system:FedoraRELSExt-1.0" SIZE="286">
<foxml:xmlContent>
<rdf:RDF xmlns:islandora="http://islandora.ca/ontology/relsint#" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#">
  <rdf:Description rdf:about="info:fedora/uofm:2211108/JP2">
    <width xmlns="http://islandora.ca/ontology/relsext#">4478</width>
  </rdf:Description>
</rdf:RDF>
</foxml:xmlContent>
</foxml:datastreamVersion>
<foxml:datastreamVersion ID="RELS-INT.1" LABEL="Fedora Relationship Metadata." CREATED="2016-07-09T00:20:14.014Z" MIMETYPE="application/rdf+xml" FORMAT_URI="info:fedora/fedora-system:FedoraRELSExt-1.0" SIZE="358">
<foxml:xmlContent>
<rdf:RDF xmlns:islandora="http://islandora.ca/ontology/relsint#" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#">
  <rdf:Description rdf:about="info:fedora/uofm:2211108/JP2">
    <width xmlns="http://islandora.ca/ontology/relsext#">4478</width>
    <height xmlns="http://islandora.ca/ontology/relsext#">6762</height>
  </rdf:Description>
</rdf:RDF>
</foxml:xmlContent>
</foxml:datastreamVersion>
</foxml:datastream>

Because those triples seem to have been lost.

[whikloj@jujo]/var/indexes/triples% grep 'http://islandora.ca/ontology/relsext#width' quads?.nq               
[whikloj@jujo]/var/indexes/triples% grep 'http://islandora.ca/ontology/relsext#height' quads?.nq
[whikloj@jujo]/var/indexes/triples% grep 'height' quads?.nq 
quads0.nq:<info:fedora/uofm:2625315> <http://purl.org/dc/elements/1.1/relation> "Manitoba Historical Society Resources:  Historic Sites of Manitoba--http://www.mhs.mb.ca/docs/sites/riverheightsschool.shtml" <ca.umanitoba.dam.fedora#ri> .
quads0.nq:<info:fedora/uofm:2629607> <http://purl.org/dc/elements/1.1/relation> "Winnipeg Architecture Foundation--http://www.winnipegarchitecture.ca/silver-heights-gates/" <ca.umanitoba.dam.fedora#ri> .
quads1.nq:<info:fedora/uofm:2631159> <http://purl.org/dc/elements/1.1/relation> "Manitoba Historical Society Resources:  Historic Sites of Manitoba--http://www.mhs.mb.ca/docs/sites/riverheightstelephone.shtml" <ca.umanitoba.dam.fedora#ri> .
[whikloj@jujo]/var/indexes/triples% grep 'width' quads?.nq
quads1.nq:<info:fedora/uofm:1414465> <http://purl.org/dc/elements/1.1/description> "The annotated back of a photograph of St. John's College at the University of Manitoba's Fort Garry Campus, showing the building while under construction with a large sign dominating the foreground. Annotations read: \"4. 3 col width. 6\" wide, 4 1/2\" deep. 2554.\"" <ca.umanitoba.dam.fedora#ri> .
[whikloj@jujo]/var/indexes/triples%
ajs6f commented

I think we're hitting this. I will cut a branch with a proposed fix pronto. Before I do that, can you open a new ticket so we can close this one?

As far as freezing on resume, that is weird. I haven't ever seen that, but I normally make use of nohup and tail -f the file to which it redirects SYSOUT, instead of trying to watch the SYSOUT of the utility itself.

ajs6f commented

P.S., for doing testing with this code, it's worth noting that you can use the CLI flags to select only a certain list of objects to index.

ajs6f commented

@whikloj Closing this in favor of #5 .

No memory errors with this branch. ๐Ÿ‘

ajs6f commented

Thanks, @whikloj ! Does "this branch" mean https://github.com/ajs6f/fcrepo3-rdf-extractor/tree/CorrectingStreaming? Because that has been merged into master, so we should be good to continue working with master. In fact, because of the last paragraph of #5 (comment), we really definitely surely very much ought to be working with master.

Yes this branch means CorrectingStreaming and I will re-run a set of objects (say 100) today using master.

ajs6f commented

Thanks!