UB-Mannheim/bbw

Add option for selecting a language to reconcile with

hay opened this issue · 4 comments

hay commented

I tried using this tool to reconcile a list of about 100 church denominations (a gist can be found here). Unfortunately, the results were pretty mediocre (only around 5 got matched) because the list is in Dutch while the matching is done using only English labels.

I think it would be a very useful addition to make sure it's possible to set up the language code. For both the OpenRefine reconciliation endpoint as well as the WD query service this is very easy. Also see my wdreconcile tool for some inspiration on how something like that could be done.

Hi @hay did you happen to see and try using the lowercase language codes for the labels, descriptions, aliases, and sitelinks?
For instance, you can use this syntax Lnl for Language=NL (Netherlands)
We (OpenRefine Wikidata recon service maintainers) have this documented for Wikidata reconciling with OpenRefine at the following doc location: https://wikidata.reconci.link/

@hay, thank you for testing and for your good example.

I have added automatic language detector using the langid-library. It should easily detect Dutch. Alternatively it's possible to specify 'language="lang-code"' in the annotate- and contextual_matching-functions.

Note, that your example is a one-column table. The important feature of bbw is contextual matching with at least two cells in a row. We augment one-column table by copying the column.

@shigapov I wonder if some of that info could be put into the docs or README.md? That seems useful to know. Hmm, maybe start a new /doc folder and begin putting some .md files in there just as a starting part for users who could also contribute back with PR's!

hay commented

Thanks, i tried running my list again on the same codebase and the language detection works pretty well. Thanks!