google/zoekt

Ability to configure maxTrigramCount

ngirard opened this issue · 5 comments

I wanted to see how Zoekt behaves with web pages, especially those containing source code blocks within <code> tags.

I downloaded and indexed a sample web page, and couldn't find it within Zoekt's query results.

Steps to reproduce:

  1. mkdir -p ~/sandboxes/www/blog.burntsushi.net
  2. Save https://blog.burntsushi.net/transducers/ as ~/sandboxes/www/blog.burntsushi.net/transducers.html using Firefox
  3. $GOPATH/bin/zoekt-index ~/sandboxes/www
  4. Visit http://localhost:6070/search?q=set.fst&num=50

I expected to see transducers.html within the results, as the page do contains set.fst, but the query returned nothing.

Querying other terms gave the same result.

when you look for transducer and click the result, you'll see:

NOT-INDEXED: document size 198439 larger than limit 131072

there is a flag to control the max file size.

Oh, good catch, thanks !

Unfortunately, doing

$GOPATH/bin/zoekt-index -file_limit 5242880 ~/sandboxes/www

leads to another error:

NOT-INDEXED: number of trigrams exceeds 20000

If I understand correctly, maxTrigramCount is declared as a const and cannot be changed via command-line.

fixed in 8a675eb1298df7f61916323717ab57c122678e09

Great, thanks !