OpenRefine?
arthurpsmith opened this issue · 5 comments
Open Refine - https://github.com/OpenRefine/OpenRefine - seems to me to address a lot of the use cases that have been listed so far. It allows easy transformation of your data and reconciliation (by default right now with wikidata). I've used it for example to reconcile countries from address records with standard country lists, and then to match up organizational records by name and standardized country. I know others have done more detailed location-matching with it (city and state for location, for instance). You can set it to auto-match when it has high confidence, and leave less well-matched entries for human review. I am sure there are a lot of your use cases it doesn't address easily, but I think it provides an excellent starting point at least - and it's also an open-source github project...
OpenRefine is a useful tool. It would be nice to some more support behind its RDF extension, or maybe there is a better, newer alternative?
Karma and WebKarma are more powerful in many ways but more difficult to use.
It would be nice to some more support behind its RDF extension, or maybe there is a better, newer alternative?
Indeed there are, but mostly due to long neglect of the RDF extension. There are many reconciliation scripts that work alongside Refine (not natively, and not via a GUI - so typically python or Java code), and some, like codeforkJeff's VIAF recon service, are hosted, so no local installation of code is necessary.
I attempted to gather references to all the local python-based recon scripts that I've used into one repo, and to make it slightly less infuriating for dependency hell reasons, created a conda environment. This section has a walkthrough of targeted vocabularies that are hopefully useful to people.
Openrefine needs to be situated in a number of contexts to make sense of it for this work exploration:
- RDF in / RDF out is not easy due to RDF extensions issues raised above (and is just not trivial regardless due to OpenRefine's expectations of tabular format as its working data format)
- Setting up reconciliation services is very easy, but setting them up to scale for the level of data we're discussing is hard + not cost trivial, especially when adding the consideration of where the data lives that we are matching against
- We don't have shared matching algorithms or expectation of confidence levels for considering matches a match, nor provenance expectations and storage modeling for this work. (this is something we would produce as part of this work, to share across tools / beyond OR)
- OpenRefine is focused on GUI-based data work, and not all reconciliation or entity resolution will be happening in a GUI (I don't want to transform + perform entity resolution for all entities in all 10 million MARC records in my library catalog in OpenRefine)
We can leave this ticket open for further discussion of OpenRefine is people are interested, but the use case work is to gather requirements so that we don't start from a position of 'this is the tool to be used', but from a position of 'these are our requirements, here are the shared needs / workflow requirements, and these possible tools do / don't fit because of XYZ.'
Archiving this repository (has been inactive since 2017)