timshannon/bolthold

Handle large indexes better

timshannon opened this issue · 1 comments

See #105

The cost of inserting into an index grows as the number of values an index key refers to grows. This shouldn't be the case. The cost should be the same if your inserting your 1st value or your millionth.

Indexes will need to be able to split into separate key ranges as they grow.

This will need a new index storage format, so I'll need to continue to support the old format and use the new one going forward if possible.

In my own implementation of indexing (which also supports multi-column indexing), for unique indexes I use this format:

key: table\x00s<INDEXINDEX>\x00<COL1>\x00<COL2> 
value: primary_id_value  ,   which will point to table\x00p\x00<ID>, where the value will be the actual row / struct

PS. COLs are serialized in fixed width format so everything is nicely sorted lexicographically. Mostly as raw []bytes, or hex strings. Strings require some extra care due to their variable length size.

I use this form, because it is trivial to detect unique constraint collisions without doing a range scan, and automatically will be detected by badger on txn.Commit that two concurrent transactions tried to insert same secondary unique key. Encoding the primary key id value in the key itself (similar to what is above), would be actually more efficient on lookups when using badger (there should be no difference when using boltdb), but would require a way more elaborate uniquness detection (possibly 2 phase commit even).

For non-unique indices I use this format:

key: table\x00s<INDEXINDEX>\x00<COL1>\x00<COL2>\x00<PrimaryID> 
value: ""

(For non-unique indices I even disable the value fetching in badger iterators, as value is irrelevant)

When I do a point scan, I do a iteration from table\x00s<INDEX>\x00<COL1>\x00<COL2>\x000000000000000000 to table\x00s<INDEXINDEX>\x00<COL1>\x00<COL2>\x00ffffffffffffffff , and decode the keys on the fly.

This also allows me to utilize prefix-scans to do range scans using partial column information over the index (i.e. leading equalities, and optional one trailing range).

And there is no limit of how many items are there for a given secondary index key, it will scale, and is efficient to add or remove items from it. The long prefix is also not an issue, due to key prefix compression in badger (and most other LSM KV stores).

INDEXINDEX is an internal number of the index used. But using a hash of the index specification could be a bit more safe in regards to upgrades. (Or store it in some metadata in the DB itself as a mapping, and cache in memory on startup).