internetarchive/openlibrary

Index normalized author name in solr

anandology opened this issue · 31 comments

Imagine the case where the author name author name has special accent characters like "Ghaṭṭi Añjanēyaśarma". Most of the time, the user won't be able to enter the accent characters and autocomplete will fail.

The search engine should index the accent-stripped version of the author name along with the real name to avoid such issues.

👍

Looks like I've already reported this bug 3 years back, but not fixed yet.

https://bugs.launchpad.net/openlibrary/+bug/540866

Edward had some suggestions about how it can be fixed.

So it's a matter of configuration? Edward's solution was using http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.ASCIIFoldingFilterFactory:

solr.ASCIIFoldingFilterFactory

Creates org.apache.lucene.analysis.ASCIIFoldingFilter.

Converts alphabetic, numeric, and symbolic Unicode characters which are not in the first 127 ASCII characters (the "Basic Latin" Unicode block) into their ASCII equivalents, if one exists.

<filter class="solr.ASCIIFoldingFilterFactory"/>

I tried that didn't seem to work. Requires more exploration.

On Thursday, March 28, 2013, bencomp wrote:

So it's a matter of configuration? Edward's solution was using
http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.ASCIIFoldingFilterFactory
:

solr.ASCIIFoldingFilterFactory

Creates org.apache.lucene.analysis.ASCIIFoldingFilter.

Converts alphabetic, numeric, and symbolic Unicode characters which are
not in the first 127 ASCII characters (the "Basic Latin" Unicode block)
into their ASCII equivalents, if one exists.


Reply to this email directly or view it on GitHubhttps://github.com//issues/178#issuecomment-15572628
.

Anand
http://anandology.com/

Working on moving to solr with single core and improved schema. Will fix that after that is done. Targeting this for May.

I would recommend something more sophisticated like the NFKC_Casefold option of:
http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.ICUNormalizer2FilterFactory

so that we handle Unicode normalization as well. I know I've seen both composed and decomposed forms in OpenLibrary.

This tokenizer probably deserves investigation as well:http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.ICUTokenizerFactory

Here are some very basic names which aren't found: Antonin Dvořak, Antonin Dvořák, Antonín Dvořák, Antonín Dvorák, Antonín Dvořak

Amongst other problems, not having them show up in search makes them very difficult to merge.

This is a duplicate of issue #11.

OpenLibrary is currently stuck on Solr v1.4.0 which is 4+ years old. Many of the useful diacritic folding capabilities were introduced with Solr 3.1 in 2011. Is there a reason not to move to a more modern version?

Time, I bet ;)

On Fri, Oct 11, 2013 at 5:04 AM, Tom Morris notifications@github.comwrote:

OpenLibrary is currently stuck on Solr v1.4.0 which is 4+ years old. Many
of the useful diacritic folding capabilities were introduced with Solr 3.1
in 2011. Is there a reason not to move to a more modern version?

In progress. I've already setup a node with solr 3.1 and improved setup to
handle searching for editions, authors and works. Will go live in a month
or so.

@Gio I know you made some Solr changes recently. Was diacritic folding and/or unicode normalization part of that work or is this still open?

@anandology there still seem to be two very similar records at https://openlibrary.org/search?q=Gha%E1%B9%AD%E1%B9%ADi&author_key=OL6A and at https://openlibrary.org/search?q=Gha%E1%B9%AD%E1%B9%ADi&author_key=OL6A
Neither of them is found yet by an author search for "Ghatti Anjaneyasarma"

@LeadSongDog Anand (anandology) isn't involved any more. As I understand it, Gio (@gdamdam) is the current dev. Unfortunately when I attempted to ping him for status back in January, I inadvertently used the wrong username.

@gdamdam Any update on Solr diacritic folding?

We've moved the full-text search engine to an Internet-Archive-based Elastic Search cluster. A decision needs to be made about the OL metadata search engine. Keep SOLR? Also move to Elastic Search?

A little context on the move to Elastic Search: The SOLR used for searching inside books was found to be continuously corrupting. Repaired data re-corrupted after a few weeks for no ascertainable reason. We weighed between upgrading SOLR and moving to ES, which has much more support within the Archive. We chose the latter.

@bfalling If a switch to Elasticsearch is a blocker for this task, has any progress been made on advertising the potential change (e.g. to ol-tech or ol-discuss), soliciting feedback, preparing downstream consumers for the change?

This bug represents a significant usability issue and was first reported in 2010. It'd be nice to make some progress on it.

Regarding operating an ES instance and a solr instance, I agree that it is somewhat indefensible to have OL and IA on completely different search indices and databases. @tfmorris one thing we've started to do is write back openlibrary_work and openlibrary_edition IDs into their corresponding archive.org items. This allows us to do more querying against Internet Archive Elastic Search. OL still need solr (or its own ES) in the interim because there are many works and editions for which there are no corresponding archive.org items and IA is reluctant to store metadata in ES for works/editions which are not digitized.

One of the current challenges is solr takes a while to update and its becoming increasingly difficult to keep our tiny solr instance sync'd with IA's borrow availability data. We've been switching Open Library to use a special Archive.org availability API to get this info (instead of trying to write back to solr). One downside is we can't easily query Open Library for available works.

In the next year or so I'd like to see tighter integration between IA and OL in terms of moving metadata away from OL's postgres and solr instance into some official shared infrastructure which both services can agree upon. This direction is a very early stage idea, but it's worth bringing up in case there are strong opinions which may help us avoid "gotchas".

As pointed out in #599, the ICU Normalizer, mentioned in my Aug 2013 note, isn't powerful enough and we actually want ICU Folding.

@cdrini is this related to #599?

I've confirmed that the ICUFolderingFilter correctly handles this case and for
q=type:author AND name:Ghatti Anjaneyasarma returns both records:

      {
        "name":["Ghaṭṭi Añjanēyaśarma"],
        "key":"/authors/OL6A"},
      {
        "name":["Ghaṭṭi Āñjanēyaśarma"],
        "key":"/authors/OL179948A"}]

One nuance is that the giant concatenated text field doesn't use ICUFoldingFilter because it's a mishmash of different types of text, but I think we probably want to move away from it anyway because it doubles the amount of data that we need to index.

Searching finds each of those two keys individually, with no indication of the other. Surely we would want to display some indication of the nearly-identical spelling's existence to facilitate their merger.

Searching finds each of those two keys individually, with no indication of the other. Surely we would want to display some indication of the nearly-identical spelling's existence to facilitate their merger.

You've just restated the original request.

Making this a sub-task of #789

Making this a sub-task of #789

Why? They have nothing to do with each other. This requires a change to the indexing schema while #789 is about update frequency and accuracy.

Mostly because #789 refers to

...missing expected data fields

and I would expect an normalized author name to be in Solr. But I can remove the sub-task if you feel it's inappropriate.

No longer a sub-task #789

#11 was closed in favour of this. Consider that
https://openlibrary.org/search?q=%22tiananmen%22&mode=everything finds 447 hits, but
https://openlibrary.org/search?q=%22tian%27anmen%22&mode=everything finds only 45 and
https://openlibrary.org/search?q=%22tiananmin%22&mode=everything finds zero.
Some sort of soundex or metaphone normalization has to be indexed

[update: spun off as #2752]

Some sort of soundex or metaphone normalization has to be indexed

@LeadSongDog That's a different feature request. Let's keep the discussion here focused on the original request ie diacritic folding and Unicode normalization.

The fix for this is in tfmorris@c7026ff and is straightforward, but it requires a Solr config change by OL staff which is unlikely to ever happen so unassigning myself.