Add ability to change lunr tokenizer
saminzadeh opened this issue · 6 comments
In order to make search more powerful, adding the ability to change the lunr.tokenizer.seperator would be nice.
https://github.com/olivernn/lunr.js/blob/master/lib/tokenizer.js#L69-L76
Out of the box, if you have the string:
this.test
The query test will return empty, but this.t will return this.test.
By changing the regex used for tokenization, you could solve problems like this
@saminzadeh great idea! Unfortunately ItemsJS is using Lunr v1.0.0 and now Lunr is v2.3.3. Hope also Lunr is exposing public function or constructor for changing lunr.tokenizer.seperator. Ideally if simple change with v1.0.0 is possible otherwise ItemsJS should be upgraded to latest Lunr sooner or later. I could hopefully look into it when I find more free time
Sounds good, might take a stab when I get a chance as well.
For now, I did this and it seems to work since it changes the lunr global instance.
import lunr from 'lunr';
import itemsjs from 'itemsjs';
lunr.tokenizer.separator = /[\s\-[\]:]+/g;
// itemsjs init here
const index = itemsjs(data, configuration);Nice hack! Integration should be easier than I thought
I am wondering how to test easily separators:
I've test it out with:
var paragraph = 'The quick brown fox jumped over the lazy dog. It barked.';
var regex = /[\s\-[\]:]+/g;
var found = paragraph.split(regex);
console.log(found);
// ["The", "quick", "brown", "fox", "jumped", "over", "the", "lazy", "dog.", "It", "barked."]Testing with Lunr function lunr.tokenizer seems to be more complicated for testing results
I am wondering because if we implement this feature for devs then it could be as separator / regex option or as one of list of predefined analyzers (i.e. like in https://www.elastic.co/guide/en/elasticsearch/reference/2.3/analyzer.html). Second seems nice but first (your suggestion) the simplest to start with and very flexible
I am wondering because if we implement this feature for devs then it could be as separator / regex option or as one of list of predefined analyzers (i.e. like in https://www.elastic.co/guide/en/elasticsearch/reference/2.3/analyzer.html).
Hmm, yes that could be nice. But I agree, would probably want to iron that out a bit more before it is introduced into the API interface.
Second seems nice but first (your suggestion) the simplest to start with and very flexible
Yes I think this could be the best starting point. Just having access to the lunr object via itemsjs would be the most flexible for advanced users and will ensure the correct dependency
Something like this:
import lunr from 'itemjs/lunr';
lunr.tokenizer.separator = /[\s\-[\]:]+/g;or
import ItemsJS from 'itemjs';
const index = ItemsJS(data, config);
index.lunr.tokenizer.separator = /[\s\-[\]:]+/g;
index.search({...})Makes sense!
@saminzadeh I've introduced simple full text integration with all external search engines in the latest version. You can see here -> https://github.com/itemsapi/itemsjs/blob/master/docs/lunr2-integration.md or in Readme