simonw/symbex

Work better against huge directories

simonw opened this issue · 5 comments

I tried running symbex '*.*' in a top-level folder and there was a long delay before it started doing anything.

This is because of this code:

symbex/symbex/cli.py

Lines 81 to 83 in b002482

files = [pathlib.Path(f) for f in files]
for directory in directories:
files.extend(pathlib.Path(directory).rglob("*.py"))

There's no need to wait for the rglob("*.py") to finish running before starting work - it would be more efficient as a generator.

Spotted an extra bug working on this:

  File "/Users/simon/Dropbox/Development/symbex/symbex/cli.py", line 89, in cli
    code = file.read_text("utf-8") if hasattr(file, "read_text") else file.read()
           ^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/homebrew/Cellar/python@3.11/3.11.4/Frameworks/Python.framework/Versions/3.11/lib/python3.11/pathlib.py", line 1058, in read_text
    with self.open(mode='r', encoding=encoding, errors=errors) as f:
         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/homebrew/Cellar/python@3.11/3.11.4/Frameworks/Python.framework/Versions/3.11/lib/python3.11/pathlib.py", line 1044, in open
    return io.open(self, mode, buffering, encoding, errors, newline)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
IsADirectoryError: [Errno 21] Is a directory: 'neo4j.py'

Now when I run symbex '*.*' in a directory with many subdirectories (like ~/Dropbox/Development on my laptop) it starts outputting results straight away and streams for a long time.

@simonw

I've been watching this repo today until something like this came up, at which point it would be appropriate for me to point out the existence of https://github.com/spookylukey/pyastgrep/ . It's similar to symbex but with a less user friendly but more powerful interface - you use XPath expressions/CSS selectors to locate code. It's also much further ahead in some areas. Basically, you could implement a large amount of symbex's functionality with a bash function:

symbex() { pyastgrep ".//ClassDef[@name=\"$1\"] | .//FunctionDef[@name=\"$1\"]" --heading --context=statement }

(Except, that wouldn't quite work since I didn't implement spookylukey/pyastgrep#12 yet)

The biggest thing you'd gain at the moment is automatic handling of .gitignore files. This turns out to be quite complex, and probably about as complex than the rest of the project put together, and it would be a shame for you to discover that the same way I did! Plus there are probably other things related to file encoding - things you find when you run on really large amounts of code.

So, can I suggest a collaboration? Probably the best way to do it would be for symbex to use pyastgrep as a library. To date, I haven't been promising API stability when used as a library rather than a CLI tool, but I could easily work with you on that. pyastgrep is already structured with a clean separation of the main layers: locating files to grep, finding results within them, and printing results.

OK, pyastgrep is awesome - I'm going to link to it from the symbex README as a similar tool with a more powerful set of features.

I'm trying to keep dependencies to an absolute minimum (right now this only uses Click) so I'd rather not depend on all of pyastgrep - mainly because that would pull in a transitive dependency of lxml and I worry about that for environments like PyOdide (though it looks like they might have solved that).

I hadn't even thought about .gitignore! That's a really gnarly problem.

Have you considered extracting your .gitignore handling code out of pyastgrep into a separate library? I'd absolutely be interested in using that, and my hunch is that there are all sorts of other projects out there that would benefit from a .gitignore handling Python library too.

@simonw I certainly understand the desire to have few dependencies!

I'm also open to extracting the .gitignore stuff as a library. That kind of thing might be better done when I've got at least two projects that need the functionality, to be sure the API works, but now I do.

I'm building on top pathspec, but adding a significant amount - it is particularly nasty because of the way that .gitignore files can be in any directory, and you want to avoid even looking at them if you don't recurse into the directory, and you also need to include the ones in parent directories, without re-doing work. So this imposes a fairly tight structure on the way the directory traversal has to work.