File Search - Improve performance of file search

Question

File Search - Improve performance of file search

chrmarti opened this issue 8 years ago · 2 comments

Part of #55.

We will focus on simple text search. Regex text searches might be more difficult to optimize and are used less often than simple text searches. We will initially explore our options (e.g., refining our own implementation, command line tools, text search libraries/engines).

We will include full-text search libraries/engines in our investigation and see if any of these can do simple text searches while at the same time open the door for also supporting a more natural 'fuzzy' text search at a later time.

The goal is raw search performance with library size (if any) and disk usage of an index (if any) being potential trade-offs.

Answer 1 · 2016-11-15T23:20:09.000Z

Subjectively, file search in VS Code is slower than in other editors, and this has been noticed by users. Here are a couple example issues:

Slow search
#14157
#14012
#8680
#63

Slow, cancelling is not discoverable, progress is not helpful
#9853

The first step is

Admit we have a problem

To nail down the issue, I'm doing a round of performance tests in controlled conditions to better understand the perf characteristics of our current file search, and find the areas that will benefit the most from optimization. I'm doing the same search using other tools for comparison.

Goals

VS Code file search should not be, or feel, significantly slower than other editors. Being faster than some is probably achievable. Results should appear quickly and the UI should remain responsive throughout the duration of the search.

We also have the goal of investigating whether tools that might provide a very fast indexed search, and a more "fuzzy" search that goes beyond literal text, and evaluating the costs associated with it.

Setup

I'm running these tests in two workspaces, workspace A and workspace B. I'm trying to run a search in each that produces about the same number of results, where workspace A is a very large workspace with roughly evenly-distributed results, and workspace B is a very small workspace with densely packed results.

Workspace A is a clone of the chromium repo, by the instructions here. This is workspace of 234,655 files, about 1.8 GB. In this workspace I'm doing a case-insensitive search for the phrase static_library(, producing 447 results in 282 files, which are well-distributed across the repo. This phrase shows up in BUILD.gn files which are all over the repo.

Workspace B is a folder containing 4 files, which together have just the phrase static_library("foo") 447 times.

The regex search uses the regex /static_...rary\(/.

Timing is done using a stopwatch from pressing 'enter' on the query, to the results being fully loaded into the UI and ready for interaction. Except for vscode, into which I injected timing instrumentation, and grep, using the 'time' command. This is pretty adhoc testing on my MBP and Windows desktop with SSDs.

The Grep search is using this command: time find . -type f -print0 | xargs -0 -P 4 fgrep -i -I "static_library(" | wc -l.

Results

MacOS
- Mid-2015 MacBook Pro
- 2.2 GHz Intel Core i7
- 16 GB 1600 MHz DDR3
- APPLE SSD SM0512G

Tool	Workspace A	Workspace A regex	Workspace B	Notes
Webstorm	3.5s	8s	instant	Indexed search*
grep	18s	19s	.02s	Obviously doesn't need to build a UI, etc etc
git grep	10s	9.3	n/a	Only searches files that git knows about. Multithreaded where normal grep is not.
Sublime	28s	28s	instant	Streams results into a document in lexicographical order
Atom	2m50s	2m47s	instant	Shows results during the search, but scrolls back to the top every time new results are added
VS Code	1m49s	1m49s	.25s	No results appear until the end

Webstorm indexes the workspace on launch. If you search before the index is finished building (takes a minute or so for workspace A) it takes about 50s to search. Also has a search "Context" option to search just in string literals, comments, etc. I want that.

Windows
- Intel Xeon E3-1240 v3 @3.40 GHz
- 16 GB RAM
- Crucial 500GB SSD

Tool	Workspace A	Workspace A regex	Workspace B	Notes
Webstorm	2.5s	2.8s	instant	Indexed search, see above notes.
findstr	18s	29s	instant
Sublime	29s	29s	instant
VS	55s	3m45s	1s	No progress bar. Apparently doesn't exclude binary files like other tools, as it spent awhile searching a .mp4 file.
VS Code	2m18s	2m18s	instant

Analysis

VS Code time breakdown

What happened when VS Code spent 1m49s on a search:
- 0s - Search started
- .2s - First candidate file found in the search process
- .5s - First match found in the search process
- 108.289s - Batch of 282 FileMatch objects received in the frontend (447 matches total)
- 108.324s - Final SearchResult.add call, updating TreeView UI begins
- 109.03s - TreeView UI finished updating

The time from starting the search to finding a match is very short, but then we sit on that match for 100+ seconds, because we don't send results to the frontend until we've filled a batch of 512. Since there are fewer than a full batch of results, nothing is visible until the search is over. So we could increase the perceived performance quite a lot by adding a timeout to send intermediate results over if the batch doesn't fill up.

I changed the batch size to 20 and got results immediately, which is much nicer. But scrolling around through the list, it jumped around a lot as results were inserted into the middle. Results are returned as files are finished, which may or may not be in order. We could only show results in order (Sublime does this), or show results as they come in. This is also a problem with a batch size of 512, but there are only 4 batches in our max results limit of 2048, so less of a problem.

On the end of making search faster, we should consider using the find command (we do this for cmd+p, but not text search) to get a list of files, and grep to search them. We can use findstr on Windows. I'm trying to understand how much time the filesystem walking is responsible for vs the search, but it's difficult to say exactly since the two flows are interleaved. I also want to profile the search process more to understand where it's spending time.

We can also try to run our search in more than one process. I saw a near 4x speedup when running 4 processes of grep instead of 3. Depending on disk speed, number of cores, and amount of time we spend just sending information between processes.

Fuzzy search

We also wanted to investigate options for enabling search-engine-style indexed fuzzy search in a workspace. 'Indexed' meaning an index is built for the files in the workspace, on disk or in memory, and a search in a large workspace can complete in seconds or milliseconds. 'Fuzzy' meaning a search based on not literal text or regex, but on word stemming, word proximity, etc. Here's an example use case, based on a true story:

I'm exploring the vscode code base, trying to find the code responsible for file search. I cmd+shift+F and search search files. No results. file search. No results. filesearch? That's it, there are 50 results with that literal. But it would be useful to find a comment that says // This code searches the text in files.

There are a bunch of options for this, but any third-party tool here will increase our download size, so we plan to consider exposing search as an extension API, and implementing an optional extension that would provide this advanced search behavior for people who want it, while making sure that our built-in search is still usable and fast.

Answer 2 · 2016-11-23T19:09:46.000Z

These results come from running the TextSearch.performance test a couple times and averaging (where each run averages 3 runs of the same search). It executes the search and returns results to the frontend process, but doesn't do UI things with them.

	MacOS	Win
With Buffer.slice (original code)	100s	125.5s
Replacing Buffer.slice with buffer.toString(null, start, end)	71.1s	97.05s
With 4 worker processes (for a total of 5 (very very rough))	26.4	19.0
With code in PR	17.6	16.3