Search does not support non-English languages

Question

Search does not support non-English languages

taills opened this issue 5 years ago · 12 comments

Unable to search for Chinese Keywords

Answer 1 · 2019-10-29T16:23:35.000Z

mdBook uses elasticlunr.js for offline searching under the hood. And according to this issue weixsong/elasticlunr.js#53, it seems that there is no plan to support searching in other languages.

Answer 2 · 2020-03-27T10:45:34.000Z

it is a good tools. I love it.

We, more than 1 million peoples, have the same issue. could you help on it?
need chinese search support.

or, could we search chinese with google?
How to do it?

Answer 3 · 2020-05-10T02:26:31.000Z

For Chinese like languages, maybe it's not suitable for local search, instead could using some service like algolia, which also being used in vuepress

Answer 4 · 2020-08-16T06:08:21.000Z

Yes, it is highly possible to add searching in Chinese characters.

elasticlunr is the search engine used by mdBook
Go to elasticlunr official documentation, read the section Other Languages. With just 3 more lines of code, elasticlunr can be used with other languages.
The Chinese language support of lunr-languages is PRed but not yet merged.
Alternatively, suggested by comments on MihaiValentin/lunr-languages#32, Japanese language can be used as a workaround. That's because this line has covered 一-龠, which is a usual range including most Chinese characters on the Unicode table.

Answer 5 · 2022-04-09T03:29:02.000Z

Any new progress for this issue?

Answer 6 · 2022-04-15T10:13:12.000Z

Any new progress for this issue?

There is a PR #1496 working on it but needs help.

Answer 7 · 2023-05-14T11:05:52.000Z

Replacing elasticlunr.js with https://github.com/ajitid/fzf-for-js may allow this issue to be resolved.

Answer 8 · 2023-05-22T01:52:10.000Z

Looking forward to new progress on this MR

Answer 9 · 2023-07-05T07:16:54.000Z

@ehuss Would you please tell me if this feature would be accepted? I don't think I'm able to find out any more issues by prototyping.

In case you're too busy to read the whole comment, I will highlight some key issues for you. All modifications are feature-gated.

I intend to import lunr-languages's Chinese extension and a WebAssembly as a segmenter, which leads to:
- extra static file dependencies, not only to several .js but also to a .wasm
- usage of ES6 Module and async
I have a custom Language trait implementation created, which is sort of irrelevant to mdBook itself
I want to grant users the ability to include a custom dictionary. My plan is to add a subsection to book.toml such as [output.html.zh] and add a field additional-dict just like additional-js. I don't know if you will be comfortable with this.

All mentioned modifications except the custom dictionary are available to check in my fork. I would appreciate it a lot if you could tell me about your attitudes toward this feature and/or the issues I listed.

Importing lunr.zh.js(with slight modification to make it compatible with elasticlunr) and relevant extensions never works as mdBook is using a pre-generated search index from elasticlunr-rs when building the book. The correct way to solve this is PR #1496. However, the original PR is not using a preferable solution besides some flaws pointed out by the core maintainers, for example, it is not using an appropriate segmenter. And also, the PR is seemingly not updating anymore. I've figured out a possibly more elegant solution, namely to use either Intl.Segmenter or jieba-wasm as a Chinese segmenter. I'd love to work on this issue, but as per the Contribution Guide, I'm not sure if this issue is grabbing any attention from the maintainers, and I won't bother making a pull request if they do not.

Apart from all these, there are still some details we need to discuss:

Whether to use Intl.Segmenter or jieba-wasm. I prefer the latter since Intl.Segmenter is not supported by FireFox while WebAssembly is fully supported by almost all browsers, and using jieba can assure the consistency between the generated index and segment generated in the browser. However, using jieba-wasm requires extra file dependencies to at least two files jieba_rs_wasm.js and jieba_rs_wasm_bg.wasm, I'm not sure if the maintainers will be happy with that even if we can control them with feature flags.
elasticlunr-rs's Chinese support is incomplete, in the way that the stop word filter is inconsistent with the one from lunr-languages, and there is no sign of it being solved. We can implement Language trait ourselves in mdBook, but again I'm not sure if this is appropriate as it is kind of irrelevant to mdBook itself. If the maintainers prefer not, we then have to make a PR to elasticlunr-rs first and then wait for it to be merged.
The search result is kind of odd(showing apparently not very matching results) due to the segmenter segmenting some particular idiom-ish phrases as individual words, e.g. "换而言之" -> ["换", "而言", "之"], this also happens to uncommon terms. This could be solved by either allowing users to add a custom dictionary or just don't use any segmenter at all, as I realize most users would just be searching for keywords, in which case matching for the whole term is more reasonable. If we accept the latter solution then we don't need to consider everything listed above at all :P We shall not consider using no segmenter because elasticlunr depends highly on a tokenizer to work, otherwise we'll have to build our own searcher or switch to another one, which I don't consider a good trade-off.
Search results in the result list are not highlighted. To be exact, it seems that only those having 'space' characters(i.e. space, tab or \n, etc.) ahead of them will be highlighted. Guess we have somewhere in searcher.js to modify. Solved this by making changes to searcher.js, particularly on makeTeaser() function.
...and more if any

I'll be keeping an eye on this comment to see if anyone is interested.

Update

I created a fork as a proof-of-concept, and I find some problems that I didn't notice at the time I posted this comment. jieba-wasm somehow requires async mechanism to work. As a consequence, I was forced to change the loading method of searcher.js from regular CJS to an import() call in index.hbs and lunr.zh.js. Eventually, it worked well, and the ECMA module is well-supported so I don't think that's a big issue, but I note it here as mdBook did it nowhere before.

Anyone willing to test it may build the forked repository with zh feature and use the product as usual. An expected result is produced on my machine.

Update 2

Listed some more problems.

Answer 10 · 2024-01-13T23:22:14.000Z

How to use it，can you give a example about the book.toml? I fork your
project and build it，but，it can‘t work.

Answer 11 · 2024-03-14T11:37:24.000Z

Any new progress on this issue?

Answer 12 · 2024-04-01T17:54:21.000Z

just 3 more lines of code

It was a lie.

Аnd kind of long journey...

But anyway, here am i, and my instructions for adding non-english search by little blood:

First of all, you need add lunr.stemmer.support.js and lunr.YOURLANG.js. You can do this by multiple ways:
1.1 Create head.hbs in theme folder and add html-tag
1.2 Add scripts by additional-js key in book.toml
1.3 Or just append this script's to overrided file in step 4
Additionally, i am advice add lunr.multi.js too.
Then, you need override searcher.js by putting copy of original file into src folder.
The best part. Find searchindex = elasticlunr.Index.load(config.index); line, and replace with:

searchindex = elasticlunr(function() {
            // adding (multi)language
            this.use(elasticlunr.multiLanguage('en', 'ru')); 
            
            // fields to index.
            this.addField('title');
            this.addField('body');
            this.addField('breadcrumbs');

            // Identify documents field
            this.setRef('id');
            
            // Get all documents stored in prebuilded index
            for (let key in config.index.documentStore.docs) {
              this.addDoc(config.index.documentStore.docs[key]);
            }
        });

And search will be fine.

But one more word: when mdbook build searchindex, it's completely ignore language settings in book.toml. It's uses only english chars for creating index, while elasticlunr-rs support other languages. By this behaviour, all attempts for adding additinal language will fail. I do not write Rust code, and can't create PR, but i hope this information will help someone.