internetarchive/openlibrary

Full re-index of solr data on prod

cdrini opened this issue · 14 comments

This will be an important step into having a more reliable solr environment. Being able to locally create an identical solr environment will get rid of a lot of confusion. It would also allow us a path to move forward on #178 and #599 , since we can spin up a new solr, re-index it with the new settings, and then swap it with the old solr without any downtime.

Subtasks

  • #1055 Create docker image for solr
  • Determine data on production solr
    • Why are there type: subject? This looks like it's used for /search/subjects, so these needed to be included.
    • Why are there type: edition? This looks like residuals of dead code for /search/editions (which does appear to work for the measly ~3.5K editions stored in solr)
    • Why isn't there any stats related data? /solr/process_stats.py looks like dead code.
    • Ensure dev's config file is the same as prod's. -> Copied from prod into solrbuilder, so they will be identical.
  • Create test solr on server.openjournal.foundation #2222
  • Create Docker-based solr for production use
    • Create solr environment on prod somewhere
    • Pause both solrupdaters
    • Copy OJF solr data to new prod environment
    • Link production to new solr endpoint
    • Destroy old solr endpoint

Notes/Comments

  • I believe solr is storing viewage statistics as well as just works/authors themselves
    • @mekarpeles Can you run this query on production solr: NOT(type:work) AND NOT(type:author)?

These are the other types on production solr. Need to investigate why they're there/if they should be there. Also need to investigate why there wasn't any stats data there (as I previously thought).

type: subject (1510685)
Sample:

<doc>
  <str name="key">/subjects/org:conseil_national_économique_(france)</str>
  <str name="name">Conseil national économique (France)</str>
  <str name="subject_type">org</str>
  <arr name="text">
    <str>Conseil national économique (France)</str>
    <str>/subjects/org:conseil_national_économique_(france)</str>
  </arr>
  <str name="type">subject</str>
  <int name="work_count">1</int>
</doc>

type: edition (3419)
Sample:

<doc>
  <arr name="author_key">
    <str>OL6941607A</str>
  </arr>
  <arr name="author_name">
    <str>Carlos Arturo Jiménez</str>
  </arr>
  <bool name="has_fulltext">false</bool>
  <str name="key">/books/OL25648663M</str>
  <int name="last_modified_i">1419832732</int>
  <arr name="seed">
    <str>/books/OL25648663M</str>
    <str>/works/OL15935579W</str>
    <str>/subjects/politics_and_government</str>
    <str>/subjects/presidents</str>
    <str>/subjects/frente_sandinista_de_liberación_nacional</str>
    <str>/subjects/assassination_attempts</str>
    <str>/subjects/person:daniel_ortega</str>
    <str>/subjects/person:carlos_arturo_jiménez</str>
    <str>/subjects/place:nicaragua</str>
    <str>/subjects/time:1979-1990</str>
    <str>/authors/OL6941607A</str>
  </arr>
  <arr name="text">
    <str>Nosotros no le decíamos presidente</str>
    <str>Carlos Arturo Jiménez</str>
    <str>/books/OL25648663M</str>
    <str>OL6941607A</str>
  </arr>
  <str name="title">Nosotros no le decíamos presidente</str>
  <str name="title_suggest">Nosotros no le decíamos presidente</str>
  <str name="type">edition</str>
</doc>

There are a set of official Docker images for Solr that we may want to consider using:
https://hub.docker.com/_/solr

Unfortunately none of them support our current version of solr :/

That's because Solr 3.6 is so ancient it hasn't been supported for years. Given that Solr only supports indexes from one major release prior before requiring a complete reindex, and we're planning a reindex anyway, it seems like the perfect opportunity to upgrade to a more modern (and supported) version.

As far as I know we have a pretty vanilla installation and schema and don't make use of any exotic features which are likely to be version dependent. The current supported Solr releases are 7.7 and 8.1.

As far as I know we have a pretty vanilla installation...

Are you willing to bet on that assumption, though? :P

Doing them together increases the risk that the reindex will have a bug and be unusable. I want to switch openlibrary to the reindex as soon as possible so that we can resolve a lot of those outdated index issues we've been having. Next step is updating the schema to better support diacritics/etc. After that updating solr version (which would require an audit of every where the solr API is used in our code to make sure the APIs in the latest version are still the same).

The full reindex is mostly automated, so takes an ~fixed amount of time. Adding new features will take developer time (which is more valuable) and has more uncertainty about how long it will take to add/guarantee those features.

As far as I know we have a pretty vanilla installation...

Are you willing to bet on that assumption, though? :P

I'm certainly willing to test the hypothesis. Based on my review of the 5 (!) major version upgrade notes and spot checking the upgrade notes for dozens of point releases in between, I judge the risk to be small. Facets are probably the most volatile API visible feature, but even there I didn't see anything that should impact us. A lot of the things affect clusters, replication, and other features that we don't use.

Another advantage of using a more modern version is that we get to take advantage of 7 years of performance improvements.

I want to switch openlibrary to the reindex as soon as possible so that we can resolve a lot of those outdated index issues we've been having.

Fixing the search infrastructure is a high priority, but it's valuable to keep the historical perspective in mind. Many of these problems have existed for 5+ years. Another few weeks isn't going to make or break users' perceptions of search quality on OpenLibrary.

The full reindex is mostly automated, so takes an ~fixed amount of time.

It needs to be fully automated and as lightweight as possible (preferably network independent) with no private side channel information required so that we can iterate on search improvements.

Adding new features will take developer time

True, but we've already invested the time for the main features that we want. Testing time is also significant and the more iterations we break this into, the greater the testing time required.

BTW, I'm not trying to talk anyone else into testing this. I'm happy to roll it into my testing and performance improvements. It may make sense to defer a decision until we have more supporting (or not) data.

I still think lumping everything together is risky. Right now, we have 2 big changes: a full reindex (with lots of new code), and switching our production env to use docker (lots of room for strange errors). Hooking this up to production is crucial to fully testing this. This is essentially a refactor–we want to maintain the ~same functionality, but with changes to how the code/env works. The more changes we pile on, the harder it will be to know what is causing a bug if a bug appears.

To ~quote Martin Fowler:

How to refactor without doing more harm than good:

  • Don't add functionality at the same time.
  • Make sure your code has tests before refactoring. Run the tests frequently so you know quickly if your changes have broken something.
  • Take short, deliberate steps. Refactoring often involves making many localized changes that result in a larger-scale change. If you keep your steps small, and test after each step, you will avoid prolonged debugging.

So doing this in 3 stages has the benefits of:

  • Lower risk of bugs since fewer changes
  • Easier to debug since fewer changes
  • Gets improvements out faster
  • User experience improvements increases developer morale
  • Less likely to get blocked since there is less uncertainty in a smaller set of requirements

Doing this in 1 stage:

  • Have to perform only 1 instead of 3 full reindices
  • Larger but later site improvement
  • Less possible overlap (making changes which are no longer relevant with a different version of solr)

So I'm convinced that 3 stages is better ¯\_(ツ)_/¯

Gall's Law:
A complex system that works is invariably found to have evolved from a simple system that worked. A complex system designed from scratch never works and cannot be patched up to make it work. You have to start over, beginning with a working simple system.

In other words, baby steps, please

I reported on the results of my Solr 8.1 experiments many months ago but didn't update this issue, so to close the loop, re:

I'm happy to roll it into my testing and performance improvements. It may make sense to defer a decision until we have more supporting (or not) data.

#2246 includes all the necessary (very minimal) schema updates to support a modern Solr as well as the multicore changes required since there's no such thing as single core Solr any more. The commits should be easily identifiable from the commit messages, but I'm happy to break them out into a separate branch if that makes things easier.

So I'm convinced that 3 stages is better ¯_(ツ)_/¯

This opinion is 9 months old, so hopefully it has changed, but I think a key factor which might be being overlooked is the testing cycle. Even the "minimal" reindex is a complete reboot which will require extensive human testing to confirm that things are working as expected. It's very likely that bug fixes will, themselves, require additional complete rebuilds. Given this, I think it makes sense to bundle a reasonable amount of functionality into these heavyweight rebuilds.

This is deployed to prod ol-web3; monitoring for issues.

Monitoring is going well; next month will do another re-index + deploy. There are hints that there might be some perf issues, need to add more graphite logging to check. This issue is done though. More issues need to be created for those other things.

@cdrini Could you describe what "monitoring" means in this context and how the new index was validated to be correct and complete.

I've got to say that I'm finding this whole process quite opaque.

The correctness of the new index was tested mostly here: #2222 ; and it was connected to 1 of our web nodes for ~3 weeks. The biggest risk of error at this point is mostly performance (which is what led to c702875, and I did notice some more peculiarities in performance even after this, but we'll get more information as it goes).

I closed this issue because a full re-index is running on production; the initial checklist on the issue had a number of issues, but I consider it done once it went to production and ran hooked to prod successfully for weeks. I need to create an issue for the next small steps (which involve removing the "old" solr entirely).