pkiraly/metadata-qa-api

Make FieldExtractor independent from recordId

Closed this issue · 12 comments

As @atiro pointed out in #66 the FieldExtractor depends on an existence of recordId field. This is a bad dependency the class should be independent from any concrete field.

atiro commented

Hi @pkiraly have just tried this. Couple of comments (sorry!):

  • It doesn't seem to keep the field name in the header as specified in the schema, it's now always called recordId (maybe it could be extract:[fieldname] to follow the pattern of other named columns?)
  • Multiple fields set to extractable are joined together with commas but that means there are more values than columns (as only the first field is named in the header (as recordId))
  • Field values containing commas are not string quoted so will also create the same problem with too many values than columns
  • If the field is empty in the record, it's showing as null in the CSV, should just be empty

@atiro I made some changes to fix these errors. I am not sure about he second one, but I hope that fixing the other 3 solves this one as well. Could you please take another try?

atiro commented

Thanks @pkiraly, just tried and that's resolved the first, second and last issues, CSV quoting is not there though (I can log that as a new issue as appreciate it's a new feature). I think looking at the commit you are just handling it in the header and not the row value ?

@atiro Many thanks for the feedback! Could you please copy some lines from which shed light on which API metod do you call?

e.g.:

CSVIterator iterator = new CSVIterator(new CSVReaderHeaderAware(new FileReader(fileName)));
List<List<String>> result = new ArrayList<>();
while (iterator.hasNext()) {
  String line = CsvReader.toCsv(iterator.next());
  result.add(facade.measureAsList(line)); <--- this is the method call what I am interested in
                    ^^^^^^^^^^^^^
}
assertEquals(2, result.size());
assertEquals(List.of("https://neurovault.org/images/384958/"), result.get(0));
assertEquals(List.of("https://neurovault.org/images/93390/"), result.get(1));
atiro commented

As an example @pkiraly , with this as input:

{"title": [], "objectType": "photograph", "partTypes": [{"text": "photograph", "id": "AAT46300"}], "contentWarning": [{"apprise": "", "note": ""}], "placesOfOrigin": [], "productionDates": [{"date": {"text": "1890-1911", "earliest": "01/01/1890", "latest": "31/12/1911"}, "association": {"text": "made", "id": "x28654"}, "note": ""}], "associatedObjects": [], "artistMakerPerson": [{"name": {"text": "Hurry, Leslie", "id": "A9404"}, "association": {"text": "designer", "id": "AAT25681"}, "note": ""}], "artistMakerOrganisation": [], "artistMakerPeople": [{"name": {"text": "Hurry, Leslie", "id": "A9404"}, "association": {"text": "designer", "id": "AAT25681"}, "note": ""}], "materialsTechniques": "", "creditLine": "", "accessionNumber": "908-1911", "summaryDescription": "", "physicalDescription": "", "dimensions": [], "dimensionsNote": "", "materials": [], "techniques": [{"text": "photography", "id": "AAT54225"}], "marksAndInscriptions": [{"content": "'H. W. Fincham 1905'", "inscriber": {"name": {"text": "", "id": ""}, "association": {"text": "", "id": ""}}, "date": {"text": "", "earliest": "", "latest": ""}, "description": "", "interpretation": "", "language": "", "medium": "", "method": "", "position": "", "script": "", "translation": "", "transliteration": "", "type": "", "note": "Signed and dated in pencil"}], "objectHistoryNote": "", "historicalContextNote": "", "descriptiveLine": "Photograph by H. W. Fincham, depicting houses adjoining the Bay & Say factory, Dedham, Essex, 1905", "bibliographicReferences": [], "productionNote": "", "productionType": {"text": "", "id": ""}, "contentDescription": "", "contentPlace": [], "associatedPlace": [], "contentPerson": [], "associatedPerson": [], "contentOrganisation": [], "associatedOrganisation": [], "contentPeople": [], "associatedPeople": [], "contentEvent": [], "associatedEvent": [], "contentOther": [], "contentConcept": [], "contentLiteraryRefs": [], "categories": [], "styles": [], "collectionCode": {"text": "PDP", "id": "THES48595"}, "galleryLabels": [], "partNumbers": ["908-1911"], "accessionNumberPrefix": [""], "accessionNumberNum": ["908"], "accessionNumberSubExtension": [""], "accessionYear": ["1911"], "otherNumbers": [], "copyNumber": "", "editionNumber": "", "imageResolution": "low", "aspects": ["WHOLE"], "systemNumber": "O1299410", "recordModificationDate": "15/07/2020", "recordCreationDate": "26/08/2014", "galleryLocation": [{"current": {"text": "511M", "id": "THES49494"}, "free": "", "case": "DU2", "shelf": "4", "box": "", "date": {"text": "", "earliest": "", "latest": ""}}], "imagesAll": [], "assetsAll": []}

and the following schema:

format: json
fields:
- name: artistMakerPerson
path: $.['artistMakerPerson'][].name.text
extractable: true
- name: artistMakerOrganisation
path: $.['artistMakerOrganisation'][
].name.text
- name: artistMakerPeople
path: $.['artistMakerPeople'][].name.text
- name: title
path: $.['title'][
].title
extractable: true

I get this as the CSV from calling measure (with headers):

[artistMakerPerson, title, completeness:TOTAL, existence:artistMakerPerson, existence:artistMakerOrganisation, existence:artistMakerPeople, existence:title, cardinality:artistMakerPerson, cardinality:artistMakerOrganisation, cardinality:artistMakerPeople, cardinality:title]
Hurry, Leslie,,0.5,1,0,1,0,1,0,1,0

(measureAsList will be correct, was just the output from measure needs quoting)

@atiro I turned this into a test and fixed it. Still it needs to improve in a larger refactoring, which comes soon - but now it produces quoted strings if the value contains one or more comma.

@pkiraly when will you release this on sonatype? I'm blocked by this issue and considering building from source, but I'd prefer not to

@pkiraly when will you release this on sonatype? I'm blocked by this issue and considering building from source, but I'd prefer not to

@mielvds Is it OK to create a snapshot jar in Sonatype? Or do you mean a proper, not snapshot version?

Snapshot is perfect! It just so I can sync my console application with the latest fixes

It works! Issue can be closed I believe

@mielvds Thanks for the feedback!