KSP-SpaceDock/SpaceDock

[Feature] Show similar mods on mod page

Opened this issue · 5 comments

Motivation

Currently users can only find mods based on the featured list, creation/update time, overall popularity, and (a currently rather poor) text search. These features are only available via the mod listing pages specifically made for it. If a user happens to open a mod page from off-site, there is no easy path to finding more mods they might like.

image

Suggestion

We could add a Similar Mods list at the bottom of the mod page that would show a few (6? 12? unlimited paginated?) mods ranked by how similar they are to the main mod. Visually, it should be a pretty simple matter to re-use the existing mod box styling and functionality, kind of like:

image

Data model

I imagine implementing this with a new ModsSimilarity table to store similarities:

Column Purpose
main_mod_id Stores Mod.id of one of the mods being compared
other_mod_id Stores Mod.id of the other mod being compared
similarity A number that is larger for more similar mods and smaller for less similar mods

An index of (main_mod_id, similarity DESC) would allow us to quickly get the mods most similar to a given mod from other_mod_id of the rows returned. We would have to create two rows per pair of mods under this model, with the id values swapped in the two *_mod_id columns, but I think that may be the least bad approach anyway.

With 2913 mods currently in the db (counting deleted ones because I don't have an easy way to exclude them), there would be 8,485,569 rows in the table.

Calculating similarity values

We would probably base the similarity on a weighted sum of comparisons of these columns:

  • Mod.game_id - 1 if same, 0 if different
  • [Mod.user_id, SharedAuthor.user_id] (the authors) - 1 if all authors are same, 0 if all authors are different, fractions for partial matches
  • Mod.name
  • Mod.short_description
  • Mod.description
  • Mod.default_version.changelog (most recent changelog, maybe)
  • Mod.background (image files, maybe)

Ideally we would delegate the comparison of the string columns to a machine learning library with an interface like:

def get_string_similarity(s1: str, s2: str) -> float:
    """ Compare the strings with AI """

There are many such open source libraries, including for Python, but so far I have not found one that would make it that easy. They generally would require us to:

  • Maintain a lexicon of known words, which would probably have to be stored in its own new table to keep it consistent between runs
  • Tokenize the input strings into words and then into numbers using the lexicon
  • Provide training data, effectively a long list of pairs of strings and our interpretation of the "correct" similarity values
  • Store the trained neural network weights somewhere
  • Load the trained data when we want to compare strings

So rather than having "an AI" do the hard work for us, we would have to tell it that "probe" and "satellite" are similar but "future" and "SPH" are not, etc., and then micromanage its memory for it and fiddle with it until its comparisons looked acceptable. At that point we might be better off writing our own simpler ad hoc heuristic logic.

It would be nice if we could detect when the user clicks a similar mod link and use that to update the comparison of the mods, since in that case a human is confirming the similarity. I'm not sure how we would do that.

Batching the calculations

To get started, we would need to compare every mod with every other mod (O(N²) in the number of mods). Then as mods were created and edited and updated, we would have to re-compare the changed mod with all the other mods (O(N)). This probably isn't something we could run in the foreground on any page. Ideally we would add mods that need re-comparison to a queue and then have a background task perform the comparisons and update the db.

Specific idea for how to calculate the author comparison for author lists A and B:

image

This would have the desired properties that author lists with no intersection would be assigned a value of 0, and author lists that match completely would be assigned 1. Two 2-author mods with 1 in common would be assigned ⅓ (one in both divided by three total). If both of those mods add the same new author, it would become 0.5 (two in both divided by four total).

This could also be adapted as a very simple algorithm to compare description strings, substituting words for authors and dropping a list of known meaning-free words like "the" and "a". This would not handle synonyms, but maybe mod authors use identical words often enough in practice for that to not matter. Might have to try it to find out.

Playing with this a bit, long descriptions tend to use a lot of words that don't convey meaning, and a lot of the "meaning" isn't describing the mod (e.g., installation instructions, maintainer history, etc.). Comparing a shorter description with a longer description looks close to hopeless due to the disparate number of "extra" words used to make similar points.

A variant would be to count matches as worth the word's length instead of 1, so longer words are worth more than shorter words, on the assumption that these are likely to be more meaningful. Unfortunately this seems to make the similarities of similar mods even lower (thanks to all those long non-matching words).

Mod1 Mod2 Per-word weighting Per-letter weighting
ReStock ReStockPlus 0.050 0.035
NearFutureSolar NearFutureElectrical 0 0
NearFuturePropulsion CryoEngines 0.123 0.091

Dialing back the ambition significantly, maybe we should settle for:

  • A "More by (Author)" section at the bottom of the mod page
  • Detecting when mods mention other mods (compare names to descriptions?)
  • Tokenizing words based on capitalization so "ModuleManager" matches "module manager"

The descriptions for Scatterer and EVE Redux have almost nothing in common. 😭

My prototype is shaping up, this might work. A few more notes:

  • Mods for different games should have a similarity of 0
  • Mods with similarity of 0 shouldn't be stored in the table
  • Maybe for each mod we could store only the similarities of the 18 most similar mods in the table (3 rows of 6, 1 row visible by default)? The rest are pretty much useless and unlikely to ever be needed. This would reduce the number of new rows from 8,485,569 to 52,434. It would require having all of the comparisons for one mod in memory at one time so we could sort them, but my prototype essentially does that right now via the API, and it seems OK. Though it might be challenging to maintain that data model since it would require us to delete rows selectively.