phiresky/ripgrep-all

Files with long paths are skipped

vejkse opened this issue · 5 comments

Files whose path’s length is 287 bytes or more are not cached. For instance, if I put the file exampledir/short.pdf in nested directories like this:

/tmp/012345678901234567890123456789012345678/012345678901234567890123456789012345678/012345678901234567890123456789012345678/012345678901234567890123456789012345678/012345678901234567890123456789012345678/012345678901234567890123456789012345678/01234567890123456789012345678901/short.pdf

and execute rga hello short.pdf in the innermost directory, I get the result followed by an error:

Page 1: hello world
short.pdf: preprocessor command failed: '"/usr/bin/rga-preproc" "short.pdf"':
-------------------------------------------------------------------------------
/tmp/012345678901234567890123456789012345678/012345678901234567890123456789012345678/012345678901234567890123456789012345678/012345678901234567890123456789012345678/012345678901234567890123456789012345678/012345678901234567890123456789012345678/01234567890123456789012345678901/short.pdf adapter: poppler
/tmp/012345678901234567890123456789012345678/012345678901234567890123456789012345678/012345678901234567890123456789012345678/012345678901234567890123456789012345678/012345678901234567890123456789012345678/012345678901234567890123456789012345678/01234567890123456789012345678901/short.pdf.txt.asciipagebreaks adapter: postprocpagebreaks
/tmp/012345678901234567890123456789012345678/012345678901234567890123456789012345678/012345678901234567890123456789012345678/012345678901234567890123456789012345678/012345678901234567890123456789012345678/012345678901234567890123456789012345678/01234567890123456789012345678901/short.pdf.txt adapter: postprocprefix
Error: copying adapter output to stdout {}

Caused by:
    0: could not write to cache
    1: unsupported size of key/DB name/data
-------------------------------------------------------------------------------

Worse, if I just do rga hello (without the name of the file, which is the normal way I would use rga for a whole directory), I just get the error without even the results. If I shorten the path by just one (ASCII) character, everything’s fine.

With non-ASCII characters, it can happen with shorter (in terms of characters) paths, but the same number of bytes, like theses cyrillic and hangeul paths:

/tmp/йфяцычувскйфяцычувскйфяцычувскйфяцычувс/йфяцычувскйфяцычувскйфяцычувскйфяцычувс/йфяцычувскйфяцычувскйфяцычувскйфяцычувс/йфяцычувскйфяцычув/short.pdf
/tmp/가나다라마바사아자하가나다라마바사아자하/가나다라마바사아자하가나다라마바사아자하/가나다라마바사아자하가나다라마바사아자하/가나다라마바사아자하가나다라마바사아자하/가나다라마바사아자하/short.pdf

Here’s a zip file with those paths: longpaths.zip

I’m using the latest commit (5fa7776).

Looks like it's caused by LMDb having a max key size of 512 bytes mozilla/rkv#49

This apparently can't be changed. So fixes would be one of

  • switch away from rkv/lmdb altogether (e.g. to sqlite which might be nicer anyways)

  • hash the full key and store the hash

    this has the disadvantage that partial cache cleaning (e.g. purging the cache of a specific path) becomes impossible

  • hash the path partially when it becomes long

    this would probably be the quickest solution, just kinda ugly

If I understand correctly, the key is not just the path. Isn’t then another solution to hash only the path, so that it’s still possible to search for the (hash of) the path?

This should be fixed in 1.0.0-alpha.4 since the cache is now sqlite