mosmeh/indexa

Some files not indexed

Closed this issue · 15 comments

Some files seem to not be getting indexed, and I'm not sure why (is there a way of getting debug info, or a dump of indexed files?)

Example output

$ pwd
/home/caj/files/reps/gap/polecat/partition/src
$ ls
 lib.rs  partition.rs  splitting.rs  tracer.rs
$ ix
<look for these files names>

There are various lib.rs which are found, but not the one in this directory, or any of the other files in this directory.

Thank you for the feedback!

is there a way of getting debug info, or a dump of indexed files?

When the query is empty, indexa shows all the files in the database, so it's not in the database if it's not listed there. (but dumping list of indexed files sounds a good feature!)

Can you check the following?

  • Can these files found by other file-finding tools (e.g. find, locate, fd)?
  • Is the directory src itself indexed?
  • Are sibling directories and their contents indexed properly?
  • What if you set dirs in the config to partition or src?

Looking carefully, it seems to just stop so far down the directory structure, so looking in another place /home/caj/files/reps/edf-clean is searched, but none of the directories in that directory are searched.

I wondered if that is because that directory contains a git repository, but other directories at a similar level are also not searched.

I think caching/searching is stopping at some point, I ran with -u -t 16 and then ix says 239962 / 239962 files, while -u -t 32 gives 427356 / 427356 files.

it seems to just stop so far down the directory structure

Are you suggesting the directory structure being too deep is the cause? Can you confirm that by setting dirs to /home/caj/files/reps/, for example?

I ran with -u -t 16 and then ix says 239962 / 239962 files, while -u -t 32 gives 427356 / 427356 files.

Since indexa also indexes inside /proc and other special directories, I guess the difference is coming from pseudo files.

I had a serious poke around. Sometimes path.read_dir from_dir_entry is failing because "Too many open files".

Thank you for the investigation. It looks like we're having the same problem as rust-lang/rust#23715. We have to limit the number of simultaneously open files.

I think it's a different problem. Consider the following code (with marked line)

use std::fs;

fn main() -> std::result::Result<(),std::io::Error> {
    let mut v : Vec<std::fs::DirEntry> = Vec::new();

    for _i in 1..10000 {
        for entry in  fs::read_dir(".")? {
        v.push(entry?); // <- this line
        }
    }
    println!("Hello, world!");
    Ok(())
}

If you run this code I get a "too many open files", which I don't get if I comment out the marked line. The problem is "DirEntry"s keep the open directory "alive". I think you need to just store the filenames (assuming that's all you need), not DirEntry objects (but I am going beyond my comfort position).

I tried reporting this ( rust-lang/rust#77658 ), I will see if people like it (I feel there should at least be a reference in the docs).

I was intentionally avoiding the "storing filenames" approach for a performance reason. (Reopening files one by one from filenames vs. getting all the children's DirEntrys from read_dir)

No, you were totally correct! I think I can take the approach. Let me try later.

Thanks, hopefully it isn't too hard to fix, and doesn't slow things down too much! I wonder if I have more files, or your OS just allows a HUGE number of open files?

I pushed the version which doesn't keep DirEntry to fix/too-many-open-files. Can you try it and see if it solves your problem?

I wonder if I have more files, or your OS just allows a HUGE number of open files?

I think I just didn't notice it because it doesn't error out.

That fixes everything!

I do now have another problem -- I have a very slow mounted hard drive, it would be nice to have a "ignore_dir" option, for directories I want to skip, but that's probably a different issue.

I'm glad to hear that!

For the "ignore_dir" option, I created an issue. (#2) Please leave comments there if you have any ideas.