Ability to configure maxTrigramCount
ngirard opened this issue · 5 comments
I wanted to see how Zoekt behaves with web pages, especially those containing source code blocks within <code>
tags.
I downloaded and indexed a sample web page, and couldn't find it within Zoekt's query results.
Steps to reproduce:
mkdir -p ~/sandboxes/www/blog.burntsushi.net
- Save https://blog.burntsushi.net/transducers/ as
~/sandboxes/www/blog.burntsushi.net/transducers.html
using Firefox - $GOPATH/bin/zoekt-index ~/sandboxes/www
- Visit http://localhost:6070/search?q=set.fst&num=50
I expected to see transducers.html
within the results, as the page do contains set.fst
, but the query returned nothing.
Querying other terms gave the same result.
when you look for transducer and click the result, you'll see:
NOT-INDEXED: document size 198439 larger than limit 131072
there is a flag to control the max file size.
Oh, good catch, thanks !
Unfortunately, doing
$GOPATH/bin/zoekt-index -file_limit 5242880 ~/sandboxes/www
leads to another error:
NOT-INDEXED: number of trigrams exceeds 20000
If I understand correctly, maxTrigramCount
is declared as a const and cannot be changed via command-line.
fixed in 8a675eb1298df7f61916323717ab57c122678e09
Great, thanks !