File Search - Improve performance of file search
chrmarti opened this issue · 2 comments
Part of #55.
We will focus on simple text search. Regex text searches might be more difficult to optimize and are used less often than simple text searches. We will initially explore our options (e.g., refining our own implementation, command line tools, text search libraries/engines).
We will include full-text search libraries/engines in our investigation and see if any of these can do simple text searches while at the same time open the door for also supporting a more natural 'fuzzy' text search at a later time.
The goal is raw search performance with library size (if any) and disk usage of an index (if any) being potential trade-offs.
Subjectively, file search in VS Code is slower than in other editors, and this has been noticed by users. Here are a couple example issues:
Slow search
#14157
#14012
#8680
#63
Slow, cancelling is not discoverable, progress is not helpful
#9853
The first step is
- Admit we have a problem
To nail down the issue, I'm doing a round of performance tests in controlled conditions to better understand the perf characteristics of our current file search, and find the areas that will benefit the most from optimization. I'm doing the same search using other tools for comparison.
Goals
VS Code file search should not be, or feel, significantly slower than other editors. Being faster than some is probably achievable. Results should appear quickly and the UI should remain responsive throughout the duration of the search.
We also have the goal of investigating whether tools that might provide a very fast indexed search, and a more "fuzzy" search that goes beyond literal text, and evaluating the costs associated with it.
Setup
I'm running these tests in two workspaces, workspace A and workspace B. I'm trying to run a search in each that produces about the same number of results, where workspace A is a very large workspace with roughly evenly-distributed results, and workspace B is a very small workspace with densely packed results.
Workspace A is a clone of the chromium repo, by the instructions here. This is workspace of 234,655 files, about 1.8 GB. In this workspace I'm doing a case-insensitive search for the phrase static_library(
, producing 447 results in 282 files, which are well-distributed across the repo. This phrase shows up in BUILD.gn
files which are all over the repo.
Workspace B is a folder containing 4 files, which together have just the phrase static_library("foo")
447 times.
The regex search uses the regex /static_...rary\(/
.
Timing is done using a stopwatch from pressing 'enter' on the query, to the results being fully loaded into the UI and ready for interaction. Except for vscode, into which I injected timing instrumentation, and grep, using the 'time' command. This is pretty adhoc testing on my MBP and Windows desktop with SSDs.
The Grep search is using this command: time find . -type f -print0 | xargs -0 -P 4 fgrep -i -I "static_library(" | wc -l
.
Results
MacOS
- Mid-2015 MacBook Pro
- 2.2 GHz Intel Core i7
- 16 GB 1600 MHz DDR3
- APPLE SSD SM0512G
Tool | Workspace A | Workspace A regex | Workspace B | Notes |
---|---|---|---|---|
Webstorm | 3.5s | 8s | instant | Indexed search* |
grep | 18s | 19s | .02s | Obviously doesn't need to build a UI, etc etc |
git grep | 10s | 9.3 | n/a | Only searches files that git knows about. Multithreaded where normal grep is not. |
Sublime | 28s | 28s | instant | Streams results into a document in lexicographical order |
Atom | 2m50s | 2m47s | instant | Shows results during the search, but scrolls back to the top every time new results are added |
VS Code | 1m49s | 1m49s | .25s | No results appear until the end |
- Webstorm indexes the workspace on launch. If you search before the index is finished building (takes a minute or so for workspace A) it takes about 50s to search. Also has a search "Context" option to search just in string literals, comments, etc. I want that.
Windows
- Intel Xeon E3-1240 v3 @3.40 GHz
- 16 GB RAM
- Crucial 500GB SSD
Tool | Workspace A | Workspace A regex | Workspace B | Notes |
---|---|---|---|---|
Webstorm | 2.5s | 2.8s | instant | Indexed search, see above notes. |
findstr | 18s | 29s | instant | |
Sublime | 29s | 29s | instant | |
VS | 55s | 3m45s | 1s | No progress bar. Apparently doesn't exclude binary files like other tools, as it spent awhile searching a .mp4 file. |
VS Code | 2m18s | 2m18s | instant |
Analysis
VS Code time breakdown
What happened when VS Code spent 1m49s on a search:
- 0s - Search started
- .2s - First candidate file found in the search process
- .5s - First match found in the search process
- 108.289s - Batch of 282 FileMatch objects received in the frontend (447 matches total)
- 108.324s - Final SearchResult.add call, updating TreeView UI begins
- 109.03s - TreeView UI finished updating
The time from starting the search to finding a match is very short, but then we sit on that match for 100+ seconds, because we don't send results to the frontend until we've filled a batch of 512. Since there are fewer than a full batch of results, nothing is visible until the search is over. So we could increase the perceived performance quite a lot by adding a timeout to send intermediate results over if the batch doesn't fill up.
I changed the batch size to 20 and got results immediately, which is much nicer. But scrolling around through the list, it jumped around a lot as results were inserted into the middle. Results are returned as files are finished, which may or may not be in order. We could only show results in order (Sublime does this), or show results as they come in. This is also a problem with a batch size of 512, but there are only 4 batches in our max results limit of 2048, so less of a problem.
On the end of making search faster, we should consider using the find
command (we do this for cmd+p, but not text search) to get a list of files, and grep
to search them. We can use findstr
on Windows. I'm trying to understand how much time the filesystem walking is responsible for vs the search, but it's difficult to say exactly since the two flows are interleaved. I also want to profile the search process more to understand where it's spending time.
We can also try to run our search in more than one process. I saw a near 4x speedup when running 4 processes of grep instead of 3. Depending on disk speed, number of cores, and amount of time we spend just sending information between processes.
Fuzzy search
We also wanted to investigate options for enabling search-engine-style indexed fuzzy search in a workspace. 'Indexed' meaning an index is built for the files in the workspace, on disk or in memory, and a search in a large workspace can complete in seconds or milliseconds. 'Fuzzy' meaning a search based on not literal text or regex, but on word stemming, word proximity, etc. Here's an example use case, based on a true story:
I'm exploring the vscode code base, trying to find the code responsible for file search. I cmd+shift+F
and search search files
. No results. file search
. No results. filesearch
? That's it, there are 50 results with that literal. But it would be useful to find a comment that says // This code searches the text in files
.
There are a bunch of options for this, but any third-party tool here will increase our download size, so we plan to consider exposing search as an extension API, and implementing an optional extension that would provide this advanced search behavior for people who want it, while making sure that our built-in search is still usable and fast.
These results come from running the TextSearch.performance
test a couple times and averaging (where each run averages 3 runs of the same search). It executes the search and returns results to the frontend process, but doesn't do UI things with them.
MacOS | Win | |
---|---|---|
With Buffer.slice (original code) | 100s | 125.5s |
Replacing Buffer.slice with buffer.toString(null, start, end) | 71.1s | 97.05s |
With 4 worker processes (for a total of 5 (very very rough)) | 26.4 | 19.0 |
With code in PR | 17.6 | 16.3 |