ajs6f/fcrepo3-rdf-extractor

Can't find datastream

Closed this issue · 28 comments

Testing this out on a subset of URIs I have hit one where it can't find the DC datastream.

Error log says

INFO 2018-05-30 10:27:48.123 [pool-2-thread-1] (ObjectProcessor) Operating on object URI: info:fedora/uofm:2939588
ERROR 2018-05-30 10:27:48.138 [pool-2-thread-1] (ObjectProcessor) Couldn't find datastream DC from object info:fedora/uofm:2939588! Caused by:
org.akubraproject.MissingBlobException: (Missing blob with id = 'file:0f/uofm%3A2939588%2BDC%2BDC.0')
        at org.akubraproject.fs.FSBlob.openInputStream(FSBlob.java:100)
        at org.akubraproject.impl.BlobWrapper.openInputStream(BlobWrapper.java:93)
        at edu.si.fcrepo.ObjectProcessor.getDatastreamContent(ObjectProcessor.java:205)
        at edu.si.fcrepo.ObjectProcessor.consume(ObjectProcessor.java:180)
        at edu.si.fcrepo.ObjectProcessor.accept(ObjectProcessor.java:152)
        at edu.si.fcrepo.Extract.lambda$null$3(Extract.java:240)
        at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
        at java.util.concurrent.FutureTask.run(FutureTask.java:266)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:748)
INFO 2018-05-30 10:27:48.169 [main] (Extract) Reached 101 objects at end of objects with 0 in-queue after 1 errors.
INFO 2018-05-30 10:27:48.169 [main] (Extract) Finished extraction.

The file does not exist in the 0f directory, nor does it exist there with the name info%3Afedora%2Fuofm%3A2939588%2BDC%2BDC.0. There is a managed DC datastream that I can access via Fedora. 0

I'll generate a new list of random pids to test with.

For my own sanity later the objectStore file is located at ./f3/info%3Afedora%2Fuofm%3A2939588

ajs6f commented

Weird. So you're saying that the indexer is reading the FOXML incorrectly and trying to find a managed datastream under the wrong URI?

I don't know that it is reading it incorrectly. I don't fully understand the HashPathIdMapper so I can't figure out what might be the correct hash for info%3Afedora%2Fuofm%3A2939588%2BDC%2BDC.0 versus uofm%3A2939588%2BDC%2BDC.0

What I know is that there is a DC datastream defined in the objectXML, Fedora can retrieve the location of it. But the fcrepo3-rdf-extractor is looking in the wrong spot for it.

ajs6f commented

The one that begins with info is almost certainly right-- I'm not sure how the other one is being derived within the indexer. Can you get me the FOXML?

ajs6f commented

Also, can you look and see if the URI is correct in your repo SQL db? I suspect that what we have here is a situation in which the db is right (so the repo, which uses the db for datastream dissemination) is cool, but the FOXML is wrong, so the indexer (which knows nothing about the db) snarls and pukes.

Yep (this is why I marked down the objectXML location 😉 ) do you want the entire thing or just the DC datastreamVersion?

ajs6f commented

How about the whole DC datastream element?

<foxml:datastream ID="DC" STATE="A" CONTROL_GROUP="M" VERSIONABLE="true">
   <foxml:datastreamVersion ID="DC.0" LABEL="DC Record" CREATED="2018-03-19T17:48:13.685Z" MIMETYPE="application/xml" SIZE="750">
       <foxml:contentLocation TYPE="INTERNAL_ID" REF="uofm:2939588+DC+DC.0"/>
   </foxml:datastreamVersion>
</foxml:datastream>
ajs6f commented

Yuuup. According to the FOXML, the indexer is looking in the right spot. IOW, the FOXML is wrong-ish (that is, the datastream could be at that location, but it's really unlikely because that's not how the code is written).

I will bet you a loonie vs. a Sacagawea dollar that if you look in the SQL db, that entry will be "correct" (will have the info prefix). If so, you can just correct the FOXML.

Where am I looking in the SQL db, most of the tables seem empty?

ajs6f commented

Incidentally, the hashpath mapper takes the raw URI (e.g. info:fedora/uom:4354534) and hashes it for a pair-tree, then URL-encodes the original URI for a filename.

MySQL [UML_DAM]> show tables;
+---------------------+
| Tables_in_UML_DAM   |
+---------------------+
| datastreamPaths     |
| dcDates             |
| doFields            |
| doRegistry          |
| fcrepoRebuildStatus |
| modelDeploymentMap  |
| objectPaths         |
| pidGen              |
+---------------------+
8 rows in set (0.00 sec)

MySQL [UML_DAM]> describe datastreamPaths;
+-----------+--------------+------+-----+---------+----------------+
| Field     | Type         | Null | Key | Default | Extra          |
+-----------+--------------+------+-----+---------+----------------+
| tokenDbID | int(11)      | NO   | PRI | NULL    | auto_increment |
| token     | varchar(199) | NO   | UNI |         |                |
| path      | varchar(255) | NO   |     |         |                |
+-----------+--------------+------+-----+---------+----------------+
3 rows in set (0.00 sec)

MySQL [UML_DAM]> select * from datastreamPaths limit 2;
Empty set (0.00 sec)
ajs6f commented

Do you use disseminations from the CMA? If not, many of those tables won't be in use. But I think you want doRegistry. I'll have to go look at the code...

MySQL [UML_DAM]> select * from doRegistry where doPID = 'uofm:2939588';
+--------------+---------------+-------------------------------------+-------------+-----------------------------------+
| doPID        | systemVersion | ownerId                             | objectState | label                             |
+--------------+---------------+-------------------------------------+-------------+-----------------------------------+
| uofm:2939588 |             6 | the ownerID field is no longer used | A           | the label field is no longer used |
+--------------+---------------+-------------------------------------+-------------+-----------------------------------+
1 row in set (0.01 sec)

Or maybe it only exists in the database?

MySQL [UML_DAM]> describe doFields;
+---------------+--------------+------+-----+---------+-------+
| Field         | Type         | Null | Key | Default | Extra |
+---------------+--------------+------+-----+---------+-------+
| pid           | varchar(64)  | NO   | MUL | NULL    |       |
| label         | varchar(255) | YES  |     | NULL    |       |
| state         | varchar(1)   | NO   |     | A       |       |
| ownerId       | varchar(64)  | YES  |     | NULL    |       |
| cDate         | bigint(20)   | NO   |     | NULL    |       |
| mDate         | bigint(20)   | NO   |     | NULL    |       |
| dcmDate       | bigint(20)   | YES  |     | NULL    |       |
| dcTitle       | text         | YES  |     | NULL    |       |
| dcCreator     | text         | YES  |     | NULL    |       |
| dcSubject     | text         | YES  |     | NULL    |       |
| dcDescription | text         | YES  |     | NULL    |       |
| dcPublisher   | text         | YES  |     | NULL    |       |
| dcContributor | text         | YES  |     | NULL    |       |
| dcDate        | text         | YES  |     | NULL    |       |
| dcType        | text         | YES  |     | NULL    |       |
| dcFormat      | text         | YES  |     | NULL    |       |
| dcIdentifier  | text         | YES  |     | NULL    |       |
| dcSource      | text         | YES  |     | NULL    |       |
| dcLanguage    | text         | YES  |     | NULL    |       |
| dcRelation    | text         | YES  |     | NULL    |       |
| dcCoverage    | text         | YES  |     | NULL    |       |
| dcRights      | text         | YES  |     | NULL    |       |
+---------------+--------------+------+-----+---------+-------+
22 rows in set (0.00 sec)

MySQL [UML_DAM]> select * from doFields where pid = 'uofm:2939588';
+--------------+----------+-------+---------+---------------+---------------+---------------+-------------+-----------+-----------+---------------+-------------+-----------------------------+---------+-------------------------------+----------------------+-----------------+----------+------------+------------+---------------------+-------------------------------------------------------------------------------------+
| pid          | label    | state | ownerId | cDate         | mDate         | dcmDate       | dcTitle     | dcCreator | dcSubject | dcDescription | dcPublisher | dcContributor               | dcDate  | dcType                        | dcFormat             | dcIdentifier    | dcSource | dcLanguage | dcRelation | dcCoverage          | dcRights                                                                            |
+--------------+----------+-------+---------+---------------+---------------+---------------+-------------+-----------+-----------+---------------+-------------+-----------------------------+---------+-------------------------------+----------------------+-----------------+----------+------------+------------+---------------------+-------------------------------------------------------------------------------------+
| uofm:2939588 | untitled | a     | whikloj | 1521481693685 | 1521484097596 | 1521481693685 |  untitled . | NULL      |   .       | NULL          | NULL        |  schappert, rachel, 1987- . |  2011 . |  stillimage mural paintings . |  latex on concrete . |  uofm:2939588 . | NULL     | NULL       | NULL       |  49.8102,-97.1316 . |  requests to use/reproduce this work should be sent to liv.valmestad@umanitoba.ca . |
+--------------+----------+-------+---------+---------------+---------------+---------------+-------------+-----------+-----------+---------------+-------------+-----------------------------+---------+-------------------------------+----------------------+-----------------+----------+------------+------------+---------------------+-------------------------------------------------------------------------------------+
1 row in set (0.00 sec)
ajs6f commented

Something is wrong here. I don't think datastreamPaths can be empty. Can you check objectPaths?

Yeah I did, this is why I could never understand this DB.

MySQL [UML_DAM]> select * from objectPaths limit 2;
Empty set (0.00 sec)

So if I do

> php -r 'print hash("md5", "uofm:2939588+DC+DC.0");' 
0fd62a45ed9c9125721d74b631a7b5d4

that seems to match where the rdf-extractor was looking, but if I try

> php -r 'print hash("md5", "info:fedora/uofm:2939588+DC+DC.0");'
09fdd3ed8f9f3405909931a4c87c481b

I still don't find the datastream

Found it.

> php -r 'print hash("md5", "info:fedora/uofm:2939588/DC/DC.0");'
acf73e7fb57fdd8d1df22c0eae4546dc
[whikloj@jujo]/local/dam/productionDAM/datastreamStore/ac% ls info%3Afedora%2Fuofm%3A2939588*
info%3Afedora%2Fuofm%3A2939588%2FDC%2FDC.0
ajs6f commented

Right. So my theory is correct- -the FOXML is wrong (which is why the indexer is wrong) but the db is right (which is why the repo works). So you should be okay to just fix the FOXML and move on, but if you want to try to find the entry in the db to confirm, we can do that.

Also, you might want to think a bit about how the FOXML got wrong, because if it's an ongoing process, you might have a bug to fix. Nine times out of ten when I've seen this sort of thing with a repo it's because there is some non-repo code or process that is editing FOXML directly.

So I didn't modify the FOXML. There were ingested via Islandora, but I don't believe there was anything unusual about them. But...maybe.

My larger concern is there is nothing in the database. So how the heck is it finding these things?

ajs6f commented

Just because I do this all the time: are you sure you're looking at the right db, and the right schema, and using a db role that has full access?

ajs6f commented

If you have to , run SELECT COUNT(*) over all the tables and you should find the data.

So if I access the MODS datastream I see fedora log a message

INFO 2018-05-30 13:25:58.342 [http-bio-8080-exec-12] (DefaultManagement) Completed getDatastream(pid: uofm:2939567, datastreamID: MODS, asOfDateTime: null)

but if I get the DC datastream, there is no message. I think it is really just output from the database.

If I edit the DC datastream via the Admin interface I wonder if it propagates those to the filesystem.

I made sure I was using the same DB connection information as Fedora (from the fedora.fcfg) so I am seeing what Fedora is seeing.

Anyways this is a Fedora issue, but seemingly one that is going to stop my use of this tool

Closing...😞

ajs6f commented

Hm, I'm really disappointed to hear that. To be clear, the problem here isn't with the tool-- you have a corrupt repository. Maybe you can try making a modification to the DC datastream (via the repo API) and see if the FOXML is corrected?

No no totally, it is not this tools fault.

Sooooooo looking through my datastream directories a lot of my datastreams are hashed with /s where the +s appear in the objectXML.

I took a new Fedora 3.8.1 repo from the islandora_vagrant and using the /fedora/admin flash interface created an object jared:1 and added an XML datastream STUFF

FOXML

<foxml:datastream ID="STUFF" STATE="A" CONTROL_GROUP="M" VERSIONABLE="true">
   <foxml:datastreamVersion ID="STUFF.0" LABEL="The Stuff" CREATED="2018-05-30T19:53:33.609Z" MIMETYPE="text/xml" SIZE="902">
      <foxml:contentLocation TYPE="INTERNAL_ID" REF="jared:1+STUFF+STUFF.0"/>
   </foxml:datastreamVersion>
</foxml:datastream>
> php -r 'print hash("md5", "jared:1+STUFF+STUFF.0");'
ea9bf3a4d5ebaaa0d4a641c8b6774ab3

> ls -1 /usr/local/fedora/data/datastreamStore/ea
info%3Afedora%2Fislandora%3AbookCModel%2FDS-COMPOSITE-MODEL%2FDS-COMPOSITE-MODEL.0
info%3Afedora%2Fislandora%3Anewspaper%5Fcollection%2FTN%2FTN.0

> php -r 'print hash("md5", "info:fedora/jared:1+STUFF+STUFF.0");'
aaa7c826c3b48d49f1a855a58521b49f

> ls -1 /usr/local/fedora/data/datastreamStore/aa
ls: cannot access /usr/local/fedora/data/datastreamStore/aa: No such file or directory

> php -r 'print hash("md5", "info:fedora/jared:1/STUFF/STUFF.0");'
1f5f2cc14f708ddf1cb20a0341a45a59

> ls -1 /usr/local/fedora/data/datastreamStore/1f
info%3Afedora%2Fislandora%3A5%2FTN%2FTN.0
info%3Afedora%2Fjared%3A1%2FSTUFF%2FSTUFF.0

My guess for why this did not reveal itself before is that most of my older objects have inline XML for their RELS-EXT, RELS-INT, MODS and DC datastreams. These newer objects have managed datastreams for MODS and DC.

I might be totally confused here but when persisting a datastream it appears that it takes the internal ID (the one used in the FOXML file) and does this.

https://github.com/fcrepo3/fcrepo/blob/master/fcrepo-server/src/main/java/org/fcrepo/server/storage/lowlevel/akubra/AkubraLowlevelStorage.java#L581-L594

Which converts + to /

This is Akubra, so perhaps the type of low level storage is important?

I can try to work that function in to the code, but as it is a private static here. Do you have a suggestion for the best method for adding it? Should I create a new class with just that getBlobId function?