omeka-s-modules/CSVImport

Resources import doubles media

Closed this issue · 12 comments

When importing a csv as resource, the final item - in this case media - keeps getting imported twice.

This csv file:

this is what,dcterms:title,description,item identifier,item set id,creator,visibility status,date,media url,html media url
item set,Birbs,Lots of birbs,,,,public,,,
item,Barn Owl,The barn owl (Tyto alba) is the most widely distributed species of owl and one of the most widespread of all birds,,Birbs,,public,,,
media,Tyto alba,"A Barn Owl at British Wildlife Centre, Surrey, England.",Barn Owl,,snowmanradio,public,"23 July 2011, 16:03",https://upload.wikimedia.org/wikipedia/commons/c/c6/Tyto_alba_-British_Wildlife_Centre%2C_Surrey%2C_England-8a_%281%29.jpg,

had this result:
screen shot 2018-03-15 at 2 23 16 pm

This happens on both danielkm and ui-improvements.

I tinkered with the CSV and was able to successfully import only 1 media item by adding a newline at the end. Could you try the import again with this version of the file?

Came through with only one of everything.

Does the problem with the original file happen only when used as a "resources" import, or does it also happen if used as an "items" import?

Importing as item (ignoring any of the resources mappings) works as expected - two items and one item with a media.

So 3 items total?

yes

OK.. so the problem's almost certainly in the module's logic specific to mapping resources or something like that, not the parsing of the CSV itself.

Probably fixed with d064d9a too like #130 .

Okay, this error seems to be a bad interaction between the SplFileObject iterator set with our flags and the LimitIterator: setting an offset past the end of the file keeps returning the last row. We're seeing it on Resource imports only because they explicitly step over the file one row at a time (to allow back-references to previously-imported resources).

I believe it's the READ_AHEAD/SKIP_EMPTY/DROP_NEW_LINE combo is a factor here too: a plain SplFileObject works fine (or rather actually works even worse... it seems to actually block access to the final row). Unfortunately SplFileObject does some odd things that don't play nicely with the composable iterators.

We can interpose our own iterator between the LimitIterator and the file... but I'd prefer not to do that if we can avoid it.

Turns out just to be a simple off-by-one error... if the underlying iterators were working properly this would probably have been invisible but this should prevent the weirdness from ever actually happening since we won't ever ask for offsets past the end of the file.