Search does not support non-English languages
taills opened this issue · 12 comments
mdBook uses elasticlunr.js for offline searching under the hood. And according to this issue weixsong/elasticlunr.js#53, it seems that there is no plan to support searching in other languages.
it is a good tools. I love it.
We, more than 1 million peoples, have the same issue. could you help on it?
need chinese search support.
or, could we search chinese with google?
How to do it?
Yes, it is highly possible to add searching in Chinese characters.
- elasticlunr is the search engine used by mdBook
- Go to elasticlunr official documentation, read the section Other Languages. With just 3 more lines of code, elasticlunr can be used with other languages.
- The Chinese language support of lunr-languages is PRed but not yet merged.
- Alternatively, suggested by comments on MihaiValentin/lunr-languages#32, Japanese language can be used as a workaround. That's because this line has covered
一-龠
, which is a usual range including most Chinese characters on the Unicode table.
Any new progress for this issue?
Any new progress for this issue?
There is a PR #1496 working on it but needs help.
Replacing elasticlunr.js
with https://github.com/ajitid/fzf-for-js may allow this issue to be resolved.
Looking forward to new progress on this MR
@ehuss Would you please tell me if this feature would be accepted? I don't think I'm able to find out any more issues by prototyping.
In case you're too busy to read the whole comment, I will highlight some key issues for you. All modifications are feature-gated.
- I intend to import lunr-languages's Chinese extension and a WebAssembly as a segmenter, which leads to:
- extra static file dependencies, not only to several
.js
but also to a.wasm
- usage of ES6 Module and
async
- extra static file dependencies, not only to several
- I have a custom
Language
trait implementation created, which is sort of irrelevant to mdBook itself - I want to grant users the ability to include a custom dictionary. My plan is to add a subsection to
book.toml
such as[output.html.zh]
and add a fieldadditional-dict
just likeadditional-js
. I don't know if you will be comfortable with this.
All mentioned modifications except the custom dictionary are available to check in my fork. I would appreciate it a lot if you could tell me about your attitudes toward this feature and/or the issues I listed.
Importing lunr.zh.js
(with slight modification to make it compatible with elasticlunr
) and relevant extensions never works as mdBook is using a pre-generated search index from elasticlunr-rs when building the book. The correct way to solve this is PR #1496. However, the original PR is not using a preferable solution besides some flaws pointed out by the core maintainers, for example, it is not using an appropriate segmenter. And also, the PR is seemingly not updating anymore. I've figured out a possibly more elegant solution, namely to use either Intl.Segmenter or jieba-wasm as a Chinese segmenter. I'd love to work on this issue, but as per the Contribution Guide, I'm not sure if this issue is grabbing any attention from the maintainers, and I won't bother making a pull request if they do not.
Apart from all these, there are still some details we need to discuss:
- Whether to use
Intl.Segmenter
orjieba-wasm
. I prefer the latter sinceIntl.Segmenter
is not supported by FireFox while WebAssembly is fully supported by almost all browsers, and usingjieba
can assure the consistency between the generated index and segment generated in the browser. However, usingjieba-wasm
requires extra file dependencies to at least two filesjieba_rs_wasm.js
andjieba_rs_wasm_bg.wasm
, I'm not sure if the maintainers will be happy with that even if we can control them with feature flags. elasticlunr-rs
's Chinese support is incomplete, in the way that the stop word filter is inconsistent with the one from lunr-languages, and there is no sign of it being solved. We can implement Language trait ourselves in mdBook, but again I'm not sure if this is appropriate as it is kind of irrelevant to mdBook itself. If the maintainers prefer not, we then have to make a PR toelasticlunr-rs
first and then wait for it to be merged.- The search result is kind of odd(showing apparently not very matching results) due to the segmenter segmenting some particular idiom-ish phrases as individual words, e.g. "换而言之" -> ["换", "而言", "之"], this also happens to uncommon terms. This could be solved by either allowing users to add a custom dictionary
or just don't use any segmenter at all, as I realize most users would just be searching for keywords, in which case matching for the whole term is more reasonable. If we accept the latter solution then we don't need to consider everything listed above at all :PWe shall not consider using no segmenter becauseelasticlunr
depends highly on a tokenizer to work, otherwise we'll have to build our own searcher or switch to another one, which I don't consider a good trade-off. Search results in the result list are not highlighted. To be exact, it seems that only those having 'space' characters(i.e. space, tab or \n, etc.) ahead of them will be highlighted. Guess we have somewhere inSolved this by making changes tosearcher.js
to modify.searcher.js
, particularly onmakeTeaser()
function.- ...and more if any
I'll be keeping an eye on this comment to see if anyone is interested.
Update
I created a fork as a proof-of-concept, and I find some problems that I didn't notice at the time I posted this comment. jieba-wasm
somehow requires async
mechanism to work. As a consequence, I was forced to change the loading method of searcher.js
from regular CJS to an import()
call in index.hbs and lunr.zh.js. Eventually, it worked well, and the ECMA module is well-supported so I don't think that's a big issue, but I note it here as mdBook did it nowhere before.
Anyone willing to test it may build the forked repository with zh
feature and use the product as usual. An expected result is produced on my machine.
Update 2
Listed some more problems.
How to use it,can you give a example about the book.toml? I fork your
project and build it,but,it can‘t work.
Any new progress on this issue?
just 3 more lines of code
It was a lie.
Аnd kind of long journey...
But anyway, here am i, and my instructions for adding non-english search by little blood:
- First of all, you need add lunr.stemmer.support.js and lunr.YOURLANG.js. You can do this by multiple ways:
1.1 Createhead.hbs
intheme
folder and add html-tag
1.2 Add scripts byadditional-js
key inbook.toml
1.3 Or just append this script's to overrided file in step 4 - Additionally, i am advice add lunr.multi.js too.
- Then, you need
override searcher.js
by putting copy of original file intosrc
folder. - The best part. Find
searchindex = elasticlunr.Index.load(config.index);
line, and replace with:
searchindex = elasticlunr(function() {
// adding (multi)language
this.use(elasticlunr.multiLanguage('en', 'ru'));
// fields to index.
this.addField('title');
this.addField('body');
this.addField('breadcrumbs');
// Identify documents field
this.setRef('id');
// Get all documents stored in prebuilded index
for (let key in config.index.documentStore.docs) {
this.addDoc(config.index.documentStore.docs[key]);
}
});
And search will be fine.
But one more word: when mdbook build searchindex
, it's completely ignore language settings in book.toml
. It's uses only english chars for creating index, while elasticlunr-rs support other languages. By this behaviour, all attempts for adding additinal language will fail. I do not write Rust code, and can't create PR, but i hope this information will help someone.