nrc/rustaceans.org

search could/should be loosened

pnkfelix opened this issue · 6 comments

Based on observations/reverse engineering, it looks like the search field on rustaceans.org attempts find records with a (case-folded) text match for every word in the search.

It also does not seem to attempt to include the Notes record fields in its search.

For example, searching for Felix Klock currently yields zero results. However, searching for either Felix or Klock yields my record. (And if you look at my record, you can see I have put "Felix Klock" into its Notes field to try to work around this.)

While it would be good to prioritize the results returned by the current search algorithm, it would be good to also include the results of a looser search, especially when the current search algorithm yields zero results. As an example of a looser search algorithm, we could look for records with any of the requested words, and take the union of the resulting sets of record.

nrc commented

One thing is that the notes field is excluded from search. That can easily be fixed (I excluded it because I thought it would bring up a whole bunch of bad results, but people don't seem to be using the notes section too heavily, so it seems that won't be an issue) and I should do that.

The other thing here is having a smarter search algorithm - the current tactic is really dumb. Obviously fixing this will take some effort (the right way to do this is to use a more search oriented backend - e.g., ElasticSearch, rather than SQLite, but that is more effort than I want to get into). I wonder if we could tweak search without too much effort, e.g., by just splitting search strings on spaces.

@nick29581 I don't think something like ElasticSearch is necessary. Postgres has some pretty decent string searching capabilities and will likely be much easier to set up. Also, much easier to move from SQLite to Postgres than to ElasticSearch.

Somehow I think that using the same engine that Wikipedia does is overkill for rustaceans.org 😁

Can't we just load all the JSON files into a HashMap and skip the database altogether? Even SQLite seems overkill for a read-only site with < 100 users.

Hash indexes and prefix matching aren't hard to implement manually. Even a dumb linear search would work fine for a few orders of magnitude beyond what we have now.

nrc commented

We could, but that would make the backend a lot more stateful than it is at the moment. Currently, there is no in-memory state, which is nice, but not essential.

What kind of statefulness are you thinking of?

I see that the GitHub daemon, after merging an entry, inserts the new data into the database itself. With a database-less system, we can instead have the daemon update the data on disk (git pull, maybe), then ask the HTTP server to reload everything from there (POST /reload).

At this point the changes would probably amount to a rewrite though, so I'm not so sure.

nrc commented

I guess I was imagining that on start up the daemon would read everything from disk into memory and would keep running forever. Then the in memory hashmap is preserved between accesses of the backend. I don't suppose that would be too bad since it is unlikely the program/hashtable would get corrupted, and it would mean we never need to worry about re-constructing the DB. But yeah, it would be pretty much a re-write.