Index normalized author name in solr

Question

Index normalized author name in solr

anandology opened this issue 12 years ago · 31 comments

Imagine the case where the author name author name has special accent characters like "Ghaṭṭi Añjanēyaśarma". Most of the time, the user won't be able to enter the accent characters and autocomplete will fail.

The search engine should index the accent-stripped version of the author name along with the real name to avoid such issues.

bencomp commented 12 years ago

👍

Answer 1 · 2013-03-28T07:09:55.000Z

Looks like I've already reported this bug 3 years back, but not fixed yet.

https://bugs.launchpad.net/openlibrary/+bug/540866

Edward had some suggestions about how it can be fixed.

Answer 2 · 2013-03-28T07:58:28.000Z

So it's a matter of configuration? Edward's solution was using http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.ASCIIFoldingFilterFactory:

solr.ASCIIFoldingFilterFactory

Creates org.apache.lucene.analysis.ASCIIFoldingFilter.

Converts alphabetic, numeric, and symbolic Unicode characters which are not in the first 127 ASCII characters (the "Basic Latin" Unicode block) into their ASCII equivalents, if one exists.

<filter class="solr.ASCIIFoldingFilterFactory"/>

Answer 3 · 2013-03-28T08:16:57.000Z

I tried that didn't seem to work. Requires more exploration.

On Thursday, March 28, 2013, bencomp wrote:

So it's a matter of configuration? Edward's solution was using
http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.ASCIIFoldingFilterFactory
:

solr.ASCIIFoldingFilterFactory

Creates org.apache.lucene.analysis.ASCIIFoldingFilter.

Converts alphabetic, numeric, and symbolic Unicode characters which are
not in the first 127 ASCII characters (the "Basic Latin" Unicode block)
into their ASCII equivalents, if one exists.

—
Reply to this email directly or view it on GitHubhttps://github.com//issues/178#issuecomment-15572628
.

Anand
http://anandology.com/

Answer 4 · 2013-05-01T18:03:03.000Z

Working on moving to solr with single core and improved schema. Will fix that after that is done. Targeting this for May.

Answer 5 · 2013-08-30T15:08:18.000Z

I would recommend something more sophisticated like the NFKC_Casefold option of:
http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.ICUNormalizer2FilterFactory

so that we handle Unicode normalization as well. I know I've seen both composed and decomposed forms in OpenLibrary.

This tokenizer probably deserves investigation as well:http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.ICUTokenizerFactory

Here are some very basic names which aren't found: Antonin Dvořak, Antonin Dvořák, Antonín Dvořák, Antonín Dvorák, Antonín Dvořak

Amongst other problems, not having them show up in search makes them very difficult to merge.

Answer 6 · 2013-09-01T18:31:11.000Z

This is a duplicate of issue #11.

Answer 7 · 2013-10-10T23:34:04.000Z

OpenLibrary is currently stuck on Solr v1.4.0 which is 4+ years old. Many of the useful diacritic folding capabilities were introduced with Solr 3.1 in 2011. Is there a reason not to move to a more modern version?

Answer 8 · 2013-10-10T23:42:12.000Z

Time, I bet ;)

Answer 9 · 2013-10-11T01:22:09.000Z

On Fri, Oct 11, 2013 at 5:04 AM, Tom Morris notifications@github.comwrote:

OpenLibrary is currently stuck on Solr v1.4.0 which is 4+ years old. Many
of the useful diacritic folding capabilities were introduced with Solr 3.1
in 2011. Is there a reason not to move to a more modern version?

In progress. I've already setup a node with solr 3.1 and improved setup to
handle searching for editions, authors and works. Will go live in a month
or so.

Answer 10 · 2016-01-30T16:26:02.000Z

@Gio I know you made some Solr changes recently. Was diacritic folding and/or unicode normalization part of that work or is this still open?

Answer 11 · 2016-07-11T19:14:58.000Z

@anandology there still seem to be two very similar records at https://openlibrary.org/search?q=Gha%E1%B9%AD%E1%B9%ADi&author_key=OL6A and at https://openlibrary.org/search?q=Gha%E1%B9%AD%E1%B9%ADi&author_key=OL6A
Neither of them is found yet by an author search for "Ghatti Anjaneyasarma"

Answer 12 · 2016-07-11T21:41:07.000Z

@LeadSongDog Anand (anandology) isn't involved any more. As I understand it, Gio (@gdamdam) is the current dev. Unfortunately when I attempted to ping him for status back in January, I inadvertently used the wrong username.

@gdamdam Any update on Solr diacritic folding?

Answer 13 · 2016-09-22T17:54:42.000Z

We've moved the full-text search engine to an Internet-Archive-based Elastic Search cluster. A decision needs to be made about the OL metadata search engine. Keep SOLR? Also move to Elastic Search?

Answer 14 · 2016-10-05T06:28:28.000Z

I don't think it makes sense to have two different search technologies, but then it didn't make sense to move to ES just because that's what IA wanted. We know the last transition broke things which depended on the search query language, so a little more due diligence, public notice, and discussion should be done this time to at least notify users that their apps are about to break, well in advance of any migration.

Answer 15 · 2016-10-05T06:53:26.000Z

A little context on the move to Elastic Search: The SOLR used for searching inside books was found to be continuously corrupting. Repaired data re-corrupted after a few weeks for no ascertainable reason. We weighed between upgrading SOLR and moving to ES, which has much more support within the Archive. We chose the latter.

Answer 16 · 2016-10-06T05:05:59.000Z

All true, but the most relevant things for me were the lack of advance notice, public discussion, or any input from the community. It could have been entirely the correct decision, but arrived at in completely the wrong way. I'm suggesting not repeating the mistake.

Answer 17 · 2017-04-05T01:41:47.000Z

@bfalling If a switch to Elasticsearch is a blocker for this task, has any progress been made on advertising the potential change (e.g. to ol-tech or ol-discuss), soliciting feedback, preparing downstream consumers for the change?

This bug represents a significant usability issue and was first reported in 2010. It'd be nice to make some progress on it.

Answer 18 · 2017-10-18T02:24:41.000Z

Regarding operating an ES instance and a solr instance, I agree that it is somewhat indefensible to have OL and IA on completely different search indices and databases. @tfmorris one thing we've started to do is write back openlibrary_work and openlibrary_edition IDs into their corresponding archive.org items. This allows us to do more querying against Internet Archive Elastic Search. OL still need solr (or its own ES) in the interim because there are many works and editions for which there are no corresponding archive.org items and IA is reluctant to store metadata in ES for works/editions which are not digitized.

One of the current challenges is solr takes a while to update and its becoming increasingly difficult to keep our tiny solr instance sync'd with IA's borrow availability data. We've been switching Open Library to use a special Archive.org availability API to get this info (instead of trying to write back to solr). One downside is we can't easily query Open Library for available works.

In the next year or so I'd like to see tighter integration between IA and OL in terms of moving metadata away from OL's postgres and solr instance into some official shared infrastructure which both services can agree upon. This direction is a very early stage idea, but it's worth bringing up in case there are strong opinions which may help us avoid "gotchas".

Answer 19 · 2017-11-08T23:19:57.000Z

As pointed out in #599, the ICU Normalizer, mentioned in my Aug 2013 note, isn't powerful enough and we actually want ICU Folding.

Answer 20 · 2018-03-13T07:27:09.000Z

@cdrini is this related to #599?

Answer 21 · 2019-08-04T21:10:19.000Z

I've confirmed that the ICUFolderingFilter correctly handles this case and for
q=type:author AND name:Ghatti Anjaneyasarma returns both records:

      {
        "name":["Ghaṭṭi Añjanēyaśarma"],
        "key":"/authors/OL6A"},
      {
        "name":["Ghaṭṭi Āñjanēyaśarma"],
        "key":"/authors/OL179948A"}]

One nuance is that the giant concatenated text field doesn't use ICUFoldingFilter because it's a mishmash of different types of text, but I think we probably want to move away from it anyway because it doubles the amount of data that we need to index.

Answer 22 · 2019-10-25T17:52:30.000Z

Searching finds each of those two keys individually, with no indication of the other. Surely we would want to display some indication of the nearly-identical spelling's existence to facilitate their merger.

Answer 23 · 2019-10-28T01:33:45.000Z

Searching finds each of those two keys individually, with no indication of the other. Surely we would want to display some indication of the nearly-identical spelling's existence to facilitate their merger.

You've just restated the original request.

Answer 24 · 2019-11-12T20:58:06.000Z

Making this a sub-task of #789

Answer 25 · 2019-11-12T22:18:02.000Z

Making this a sub-task of #789

Why? They have nothing to do with each other. This requires a change to the indexing schema while #789 is about update frequency and accuracy.

Answer 26 · 2019-11-12T22:23:32.000Z

Mostly because #789 refers to

...missing expected data fields

and I would expect an normalized author name to be in Solr. But I can remove the sub-task if you feel it's inappropriate.

Answer 27 · 2019-11-14T02:07:32.000Z

No longer a sub-task #789

Answer 28 · 2019-12-09T16:03:21.000Z

#11 was closed in favour of this. Consider that
https://openlibrary.org/search?q=%22tiananmen%22&mode=everything finds 447 hits, but
https://openlibrary.org/search?q=%22tian%27anmen%22&mode=everything finds only 45 and
https://openlibrary.org/search?q=%22tiananmin%22&mode=everything finds zero.
Some sort of soundex or metaphone normalization has to be indexed

[update: spun off as #2752]

Answer 29 · 2019-12-17T21:10:12.000Z

Some sort of soundex or metaphone normalization has to be indexed

@LeadSongDog That's a different feature request. Let's keep the discussion here focused on the original request ie diacritic folding and Unicode normalization.

Answer 30 · 2020-04-30T21:34:20.000Z

The fix for this is in tfmorris@c7026ff and is straightforward, but it requires a Solr config change by OL staff which is unlikely to ever happen so unassigning myself.