Full re-index of solr data on prod

Question

Full re-index of solr data on prod

cdrini opened this issue 6 years ago · 14 comments

Answer 1 · 2019-02-16T23:06:20.000Z

These are the other types on production solr. Need to investigate why they're there/if they should be there. Also need to investigate why there wasn't any stats data there (as I previously thought).

type: subject (1510685)
Sample:

<doc>
  <str name="key">/subjects/org:conseil_national_économique_(france)</str>
  <str name="name">Conseil national économique (France)</str>
  <str name="subject_type">org</str>
  <arr name="text">
    <str>Conseil national économique (France)</str>
    <str>/subjects/org:conseil_national_économique_(france)</str>
  </arr>
  <str name="type">subject</str>
  <int name="work_count">1</int>
</doc>

type: edition (3419)
Sample:

<doc>
  <arr name="author_key">
    <str>OL6941607A</str>
  </arr>
  <arr name="author_name">
    <str>Carlos Arturo Jiménez</str>
  </arr>
  <bool name="has_fulltext">false</bool>
  <str name="key">/books/OL25648663M</str>
  <int name="last_modified_i">1419832732</int>
  <arr name="seed">
    <str>/books/OL25648663M</str>
    <str>/works/OL15935579W</str>
    <str>/subjects/politics_and_government</str>
    <str>/subjects/presidents</str>
    <str>/subjects/frente_sandinista_de_liberación_nacional</str>
    <str>/subjects/assassination_attempts</str>
    <str>/subjects/person:daniel_ortega</str>
    <str>/subjects/person:carlos_arturo_jiménez</str>
    <str>/subjects/place:nicaragua</str>
    <str>/subjects/time:1979-1990</str>
    <str>/authors/OL6941607A</str>
  </arr>
  <arr name="text">
    <str>Nosotros no le decíamos presidente</str>
    <str>Carlos Arturo Jiménez</str>
    <str>/books/OL25648663M</str>
    <str>OL6941607A</str>
  </arr>
  <str name="title">Nosotros no le decíamos presidente</str>
  <str name="title_suggest">Nosotros no le decíamos presidente</str>
  <str name="type">edition</str>
</doc>

Answer 2 · 2019-07-21T20:17:04.000Z

There are a set of official Docker images for Solr that we may want to consider using:
https://hub.docker.com/_/solr

Answer 3 · 2019-07-22T15:00:28.000Z

Unfortunately none of them support our current version of solr :/

Answer 4 · 2019-07-22T15:34:17.000Z

That's because Solr 3.6 is so ancient it hasn't been supported for years. Given that Solr only supports indexes from one major release prior before requiring a complete reindex, and we're planning a reindex anyway, it seems like the perfect opportunity to upgrade to a more modern (and supported) version.

As far as I know we have a pretty vanilla installation and schema and don't make use of any exotic features which are likely to be version dependent. The current supported Solr releases are 7.7 and 8.1.

Answer 5 · 2019-07-22T15:42:09.000Z

As far as I know we have a pretty vanilla installation...

Are you willing to bet on that assumption, though? :P

Doing them together increases the risk that the reindex will have a bug and be unusable. I want to switch openlibrary to the reindex as soon as possible so that we can resolve a lot of those outdated index issues we've been having. Next step is updating the schema to better support diacritics/etc. After that updating solr version (which would require an audit of every where the solr API is used in our code to make sure the APIs in the latest version are still the same).

The full reindex is mostly automated, so takes an ~fixed amount of time. Adding new features will take developer time (which is more valuable) and has more uncertainty about how long it will take to add/guarantee those features.

Answer 6 · 2019-07-22T16:34:35.000Z

As far as I know we have a pretty vanilla installation...

Are you willing to bet on that assumption, though? :P

I'm certainly willing to test the hypothesis. Based on my review of the 5 (!) major version upgrade notes and spot checking the upgrade notes for dozens of point releases in between, I judge the risk to be small. Facets are probably the most volatile API visible feature, but even there I didn't see anything that should impact us. A lot of the things affect clusters, replication, and other features that we don't use.

Another advantage of using a more modern version is that we get to take advantage of 7 years of performance improvements.

I want to switch openlibrary to the reindex as soon as possible so that we can resolve a lot of those outdated index issues we've been having.

Fixing the search infrastructure is a high priority, but it's valuable to keep the historical perspective in mind. Many of these problems have existed for 5+ years. Another few weeks isn't going to make or break users' perceptions of search quality on OpenLibrary.

The full reindex is mostly automated, so takes an ~fixed amount of time.

It needs to be fully automated and as lightweight as possible (preferably network independent) with no private side channel information required so that we can iterate on search improvements.

Adding new features will take developer time

True, but we've already invested the time for the main features that we want. Testing time is also significant and the more iterations we break this into, the greater the testing time required.

BTW, I'm not trying to talk anyone else into testing this. I'm happy to roll it into my testing and performance improvements. It may make sense to defer a decision until we have more supporting (or not) data.

Answer 7 · 2019-07-22T19:45:17.000Z

I still think lumping everything together is risky. Right now, we have 2 big changes: a full reindex (with lots of new code), and switching our production env to use docker (lots of room for strange errors). Hooking this up to production is crucial to fully testing this. This is essentially a refactor–we want to maintain the ~same functionality, but with changes to how the code/env works. The more changes we pile on, the harder it will be to know what is causing a bug if a bug appears.

To ~quote Martin Fowler:

How to refactor without doing more harm than good:

Don't add functionality at the same time.

Make sure your code has tests before refactoring. Run the tests frequently so you know quickly if your changes have broken something.

Take short, deliberate steps. Refactoring often involves making many localized changes that result in a larger-scale change. If you keep your steps small, and test after each step, you will avoid prolonged debugging.

So doing this in 3 stages has the benefits of:

Lower risk of bugs since fewer changes
Easier to debug since fewer changes
Gets improvements out faster
User experience improvements increases developer morale
Less likely to get blocked since there is less uncertainty in a smaller set of requirements

Doing this in 1 stage:

Have to perform only 1 instead of 3 full reindices
Larger but later site improvement
Less possible overlap (making changes which are no longer relevant with a different version of solr)

So I'm convinced that 3 stages is better ¯\_(ツ)_/¯

Answer 8 · 2019-12-09T16:35:14.000Z

Gall's Law:
A complex system that works is invariably found to have evolved from a simple system that worked. A complex system designed from scratch never works and cannot be patched up to make it work. You have to start over, beginning with a working simple system.

In other words, baby steps, please

Answer 9 · 2020-03-04T21:28:44.000Z

I reported on the results of my Solr 8.1 experiments many months ago but didn't update this issue, so to close the loop, re:

I'm happy to roll it into my testing and performance improvements. It may make sense to defer a decision until we have more supporting (or not) data.

#2246 includes all the necessary (very minimal) schema updates to support a modern Solr as well as the multicore changes required since there's no such thing as single core Solr any more. The commits should be easily identifiable from the commit messages, but I'm happy to break them out into a separate branch if that makes things easier.

Answer 10 · 2020-03-04T21:34:06.000Z

So I'm convinced that 3 stages is better ¯_(ツ)_/¯

This opinion is 9 months old, so hopefully it has changed, but I think a key factor which might be being overlooked is the testing cycle. Even the "minimal" reindex is a complete reboot which will require extensive human testing to confirm that things are working as expected. It's very likely that bug fixes will, themselves, require additional complete rebuilds. Given this, I think it makes sense to bundle a reasonable amount of functionality into these heavyweight rebuilds.

Answer 11 · 2020-03-11T01:38:28.000Z

This is deployed to prod ol-web3; monitoring for issues.

Answer 12 · 2020-03-30T20:14:46.000Z

Monitoring is going well; next month will do another re-index + deploy. There are hints that there might be some perf issues, need to add more graphite logging to check. This issue is done though. More issues need to be created for those other things.

Answer 13 · 2020-03-30T21:05:44.000Z

@cdrini Could you describe what "monitoring" means in this context and how the new index was validated to be correct and complete.

I've got to say that I'm finding this whole process quite opaque.

Answer 14 · 2020-04-02T23:41:53.000Z

The correctness of the new index was tested mostly here: #2222 ; and it was connected to 1 of our web nodes for ~3 weeks. The biggest risk of error at this point is mostly performance (which is what led to c702875, and I did notice some more peculiarities in performance even after this, but we'll get more information as it goes).

I closed this issue because a full re-index is running on production; the initial checklist on the issue had a number of issues, but I consider it done once it went to production and ran hooked to prod successfully for weeks. I need to create an issue for the next small steps (which involve removing the "old" solr entirely).

Full re-index of solr data on prod

Subtasks

Notes/Comments