/SimString.jl

Native Julia implementation of CPMerge (SimString) algorithm

Primary LanguageJuliaMIT LicenseMIT

SimString

Stable Dev Build Status Coverage Code Style: Blue ColPrac: Contributor's Guide on Collaborative Practices for Community Packages

A native Julia implementation of the CPMerge algorithm, which is designed for approximate string matching. This package is be particulary useful for natural language processing tasks which demand the retrieval of strings/texts from a very large corpora (big amounts of texts). Currently, this package supports both Character and Word based N-grams feature generations and there are plans to open the package up for custom user defined feature generation methods.

Features

  • Fast algorithm for string matching
  • 100% exact retrieval
  • Support for unicodes
  • Support for building databases directly from text files
  • Mecab-based tokenizer support
  • Support for persistent databases like MongoDB

Suported String Similarity Measures

  • Dice coefficient
  • Jaccard coefficient
  • Cosine coefficient
  • Overlap coefficient
  • Exact match

Installation

You can grab the latest stable version of this package from Julia registries by simply running;

NB: Don't forget to invoke Julia's package manager with ]

pkg> add SimString

The few (and selected) brave ones can simply grab the current experimental features by simply adding the master branch to your development environment after invoking the package manager with ]:

pkg> add SimString#main

You are good to go with bleeding edge features and breakages!

To revert to a stable version, you can simply run:

pkg> free SimString