Add ability to change lunr tokenizer

Question

Add ability to change lunr tokenizer

saminzadeh opened this issue 7 years ago · 6 comments

saminzadeh commented 7 years ago

In order to make search more powerful, adding the ability to change the lunr.tokenizer.seperator would be nice.

https://github.com/olivernn/lunr.js/blob/master/lib/tokenizer.js#L69-L76

Out of the box, if you have the string:
this.test

The query test will return empty, but this.t will return this.test.

By changing the regex used for tokenization, you could solve problems like this

Answer 1 · 2018-09-06T21:14:48.000Z

@saminzadeh great idea! Unfortunately ItemsJS is using Lunr v1.0.0 and now Lunr is v2.3.3. Hope also Lunr is exposing public function or constructor for changing lunr.tokenizer.seperator. Ideally if simple change with v1.0.0 is possible otherwise ItemsJS should be upgraded to latest Lunr sooner or later. I could hopefully look into it when I find more free time

Answer 2 · 2018-09-07T13:22:42.000Z

Sounds good, might take a stab when I get a chance as well.

For now, I did this and it seems to work since it changes the lunr global instance.

import lunr from 'lunr';
import itemsjs from 'itemsjs';

lunr.tokenizer.separator = /[\s\-[\]:]+/g;

// itemsjs init here
const index = itemsjs(data, configuration);

Answer 3 · 2018-09-07T14:09:12.000Z

Nice hack! Integration should be easier than I thought

I am wondering how to test easily separators:
I've test it out with:

var paragraph = 'The quick brown fox jumped over the lazy dog. It barked.';
var regex = /[\s\-[\]:]+/g;
var found = paragraph.split(regex);

console.log(found);
// ["The", "quick", "brown", "fox", "jumped", "over", "the", "lazy", "dog.", "It", "barked."]

Testing with Lunr function lunr.tokenizer seems to be more complicated for testing results

I am wondering because if we implement this feature for devs then it could be as separator / regex option or as one of list of predefined analyzers (i.e. like in https://www.elastic.co/guide/en/elasticsearch/reference/2.3/analyzer.html). Second seems nice but first (your suggestion) the simplest to start with and very flexible

Answer 4 · 2018-09-07T15:41:00.000Z

I am wondering because if we implement this feature for devs then it could be as separator / regex option or as one of list of predefined analyzers (i.e. like in https://www.elastic.co/guide/en/elasticsearch/reference/2.3/analyzer.html).

Hmm, yes that could be nice. But I agree, would probably want to iron that out a bit more before it is introduced into the API interface.

Second seems nice but first (your suggestion) the simplest to start with and very flexible

Yes I think this could be the best starting point. Just having access to the lunr object via itemsjs would be the most flexible for advanced users and will ensure the correct dependency

Something like this:

import lunr from 'itemjs/lunr';

lunr.tokenizer.separator = /[\s\-[\]:]+/g;

or

import ItemsJS from 'itemjs';
const index = ItemsJS(data, config);
index.lunr.tokenizer.separator = /[\s\-[\]:]+/g;

index.search({...})

Answer 5 · 2018-09-07T17:06:02.000Z

Makes sense!

Answer 6 · 2021-04-09T10:54:08.000Z

@saminzadeh I've introduced simple full text integration with all external search engines in the latest version. You can see here -> https://github.com/itemsapi/itemsjs/blob/master/docs/lunr2-integration.md or in Readme