Sotera/DatawakeDepot

Extractor error

bwhiteman opened this issue · 5 comments

When creating a trail with a website that is part of another trail, none of the extracted data is persisted.

The following error occurs in the depot:
{ [MongoError: E11000 duplicate key error index: datawake.DwUrlExtraction.$id dup key: { : 1 }]
name: 'MongoError',
message: 'E11000 duplicate key error index: datawake.DwUrlExtraction.$id dup key: { : 1 }',
driver: true,
code: 11000,
index: 0,
errmsg: 'E11000 duplicate key error index: datawake.DwUrlExtraction.$id dup key: { : 1 }',
getOperation: [Function],
toJSON: [Function],
toString: [Function] }

I believe this is because of the dw-url-extraction model. The model properties are: value, occurrences, extractorTypes, extractor, requester, created.

If we extract "John Smith" from two different urls in a trail you can see how all of the above properties would be duplicated and cause the error (the created field is populated on save so doesn't count). I had thought that the "dwTrailUrlId" from the "trailUrl" relation would cause the record to be unique, but this is apparently not the case.

Need to investigate how to get the trailUrl included so that the insert is unique.

believe we need to add this to the DWUrlExtraction model. This will need to be done on a clean database (at least with no extractions). Haven't tested:

"indexes":{
"trailUrl_extraction_index":{
"keys":{
"id":1,
"dwTrailUrlId":1
},
"options":{
"unique":true
}
}

do we see a similar problem if we hit the same url multiple times within a trail?

Not that I've ever noticed but that doesn't mean much.

Went to same URL as contained within another trail. I got extractions just fine and they were persisted. Closing as not an issue (at least not anymore).