pkiraly/metadata-qa-api

Extending Problem Catalogue - Limits on complexity ?

Opened this issue · 0 comments

atiro commented

I wondered how complex the problem catalogue rules code should get, or if it's better to allow some of these calculations to be done further on. For example, to identify,

  • object types that appear in both singular and plural form,
  • object types that appear in both upper and lower case form.
  • object types that are aggregations of other object types (e.g. "watercolour and ink drawing")

These are relatively simple to do in Pandas (well, except the last one) by reading the CSV output from MQAF with the relevant extracted field, but they could also be something added to MQAF as a problem catalogue rule, although it feels like it would end up writing a lot of code to replicate large data analysis that Pandas handles very well.

Trying to decide where the line should be drawn, maybe if the rule requires the entire dataset to be in memory that's where something like Pandas should come in ?