/blurrily

Millisecond fuzzy string matching for Ruby

Primary LanguageCMIT LicenseMIT

Blurrily — Millisecond fuzzy string matching

Gem Version Build Status Dependency Status Code Climate Coverage Status

Show me photos of Marakech !

Here are some photos of Marrakesh, Morroco. Did you mean Martanesh, Albania, Marakkanam, India, or Marasheshty, Romania?

Blurrily finds misspelled, prefix, or partial needles in a haystack of strings, quickly. It scales well: its response time is typically 1-2ms on user-input datasets and 75-100ms on pathological datasets (more).

Blurrily is compatible and tested with all MRI Rubies from 2.2.0 to 2.4.4. It is tested on Linux 2.6 (32bit and 64bit) and MacOS X 10.8.

Blurrily uses a tweaked trigram-based approach to find good matches. If you're using ActiveRecord and looking for a lightweight (albeit much slower), in-process, Rails-friendly version of this, check out fuzzily, a Ruby gem to perform fuzzy text searching in ActiveRecord.

Installation

Add this line to your application's Gemfile:

gem 'blurrily'

Or install it yourself as:

$ gem install blurrily

Usage

Create the in-memory database:

> map = Blurrily::Map.new

Store a needle with a reference:

> map.put('London', 1337)

Recover a reference form the haystack:

> map.find('lonndon')
#=> [1337]

Save the database to disk:

> map.save('/var/db/data.trigrams')

Load a previously saved database:

> map = Blurrily::Map.load('/var/db/data.trigrams')

Caveats

Diacritics, non-latin languages

Blurrily forms trigrams from the 26 latin letters and a stop character (used to model start-of-string and separation between words in multi-word strings).

This means that case and diacritrics are completely ignored by Blurrily. For instance, Puy-de-Dôme is strictly equivalent to puy de dome.

It also means that any non-latin input will probably result in garbage data and garbage results (although it won't crash).

Multi-word needles and edge stickyness.

Multi-word needles (say, New York) are supported.

The engine always favours matches that begin and end similarly to the needle, with a bias to the beginning of the strings.

This is because internally, the string New York is turned into this sequence of trigrams: **n, *ne, new, ew*, w*y, *yo, yor, ork, rk*.

Production notes

Memory usage

Blurrily does not store your original strings but rather a flat map of references and weights for each trigram in your input strings.

In practice any database will use up a base 560KB for the index header, plus 128 bits per trigram.

As a rule of thumb idea memory usages is 40MB + 8 times the size of your input data, and 50% extra on top during bulk imports (lots of writes to the database).

For instance, /usr/share/dict/words is a list of 235k English words, and weighs 2.5MB. Importing the whole list uses up 75MB of memory, 51MB of which are the database.

Note that once a database has been written to disk and loaded from disk, memory usage is minimal (560KB per database) as the database file is memory mapped. For performance you do need as much free memory as the database size.

Disk usage

Disk usage is almost exactly like memory usage, since database files are nothing more than a memory dump.

In the /usr/share/dict/words example, on-disk size is 51MB. For the whole list of Geonames places, on-disk size is 1.1GB.

Read v write

Writing to blurrily (with #put) is fairly expensive—it's a search engine after all, optimized for intensive reads.

Supporting writes means the engine needs to keep a hash table of all references around, typically weighing 50% of your total input. This is build lazily while writing however; so if you load a database from disk and only ever read, you will not incur the memory penalty.

Saving & backing up

Blurrily saves atomically (writing to a separate file, then using rename(2) to overwrite the old file), meaning you should never lose data.

The server does this for you every 60 seconds and when quitting. If using Blurrily::Map directly, remember that a map loaded from disk is more memory efficient that a map in memory, so if your workload is read-heavy, you should .load after each #save.

Backing up comes with a caveat: database files are only portable across architectures if endianness and pointer size are the same (tested between darwin-x86_64 and linux-amd64).

Database files are very compressible; bzip2 typically shrinks them to 20% of their original size.

Benchmarks

Blurrily is wicked fast, often 100x faster than it's ancestor, fuzzily. This is because it's a close-to- the-metal, single-purpose index using almost exclusively libc primitives. On the inside the only expensive operations it performs are

  • memcpy(2) lots of data around (selection);
  • mergesort(3) to aggregate/count similar entries (reduction);
  • qsort(3) to order by counts (sort).

It tends to be faster with large datasets on BSD than on Linux because the former has fast quicksort and mergesort, wheras the latter only has qsort, a slower, catch-all sorter. In complexity terms this is because FIND tends to be O(n) on BSD and O(n ln n) on Linux.

Enough talk, here are the graphs. The LOAD and PUT operations are O(1) and take respectively ~10ms and ~100µs on any platform, so they aren't graphed here.

Contributing

  1. Fork it
  2. Create your feature branch (git checkout -b my-new-feature)
  3. Commit your changes (git commit -am 'Add some feature')
  4. Push to the branch (git push origin my-new-feature)
  5. Create new Pull Request