Files with long paths are skipped
vejkse opened this issue · 5 comments
Files whose path’s length is 287 bytes or more are not cached. For instance, if I put the file exampledir/short.pdf
in nested directories like this:
/tmp/012345678901234567890123456789012345678/012345678901234567890123456789012345678/012345678901234567890123456789012345678/012345678901234567890123456789012345678/012345678901234567890123456789012345678/012345678901234567890123456789012345678/01234567890123456789012345678901/short.pdf
and execute rga hello short.pdf
in the innermost directory, I get the result followed by an error:
Page 1: hello world
short.pdf: preprocessor command failed: '"/usr/bin/rga-preproc" "short.pdf"':
-------------------------------------------------------------------------------
/tmp/012345678901234567890123456789012345678/012345678901234567890123456789012345678/012345678901234567890123456789012345678/012345678901234567890123456789012345678/012345678901234567890123456789012345678/012345678901234567890123456789012345678/01234567890123456789012345678901/short.pdf adapter: poppler
/tmp/012345678901234567890123456789012345678/012345678901234567890123456789012345678/012345678901234567890123456789012345678/012345678901234567890123456789012345678/012345678901234567890123456789012345678/012345678901234567890123456789012345678/01234567890123456789012345678901/short.pdf.txt.asciipagebreaks adapter: postprocpagebreaks
/tmp/012345678901234567890123456789012345678/012345678901234567890123456789012345678/012345678901234567890123456789012345678/012345678901234567890123456789012345678/012345678901234567890123456789012345678/012345678901234567890123456789012345678/01234567890123456789012345678901/short.pdf.txt adapter: postprocprefix
Error: copying adapter output to stdout {}
Caused by:
0: could not write to cache
1: unsupported size of key/DB name/data
-------------------------------------------------------------------------------
Worse, if I just do rga hello
(without the name of the file, which is the normal way I would use rga
for a whole directory), I just get the error without even the results. If I shorten the path by just one (ASCII) character, everything’s fine.
With non-ASCII characters, it can happen with shorter (in terms of characters) paths, but the same number of bytes, like theses cyrillic and hangeul paths:
/tmp/йфяцычувскйфяцычувскйфяцычувскйфяцычувс/йфяцычувскйфяцычувскйфяцычувскйфяцычувс/йфяцычувскйфяцычувскйфяцычувскйфяцычувс/йфяцычувскйфяцычув/short.pdf
/tmp/가나다라마바사아자하가나다라마바사아자하/가나다라마바사아자하가나다라마바사아자하/가나다라마바사아자하가나다라마바사아자하/가나다라마바사아자하가나다라마바사아자하/가나다라마바사아자하/short.pdf
Here’s a zip file with those paths: longpaths.zip
Looks like it's caused by LMDb having a max key size of 512 bytes mozilla/rkv#49
This apparently can't be changed. So fixes would be one of
-
switch away from rkv/lmdb altogether (e.g. to sqlite which might be nicer anyways)
-
hash the full key and store the hash
this has the disadvantage that partial cache cleaning (e.g. purging the cache of a specific path) becomes impossible
-
hash the path partially when it becomes long
this would probably be the quickest solution, just kinda ugly
If I understand correctly, the key is not just the path. Isn’t then another solution to hash only the path, so that it’s still possible to search for the (hash of) the path?
This should be fixed in 1.0.0-alpha.4 since the cache is now sqlite