noib3/nvim-completion

honestly im interested in how this turns out

ms-jpq opened this issue ยท 14 comments

i've actually thought about writing coq in rust like half a dozen times, I probably would have if i wasn't caught up in a huge rust project for work atm.

i wonder how little / much optimization you can do to beat coq in completion speeds, recently i actually rewrote coq's tokenizer to run in randomized order, both lengthwise, and front <-> backwards so when you have an input that is too long, and it exceeds the ingestion limit, you don't have a bias to avoid things not ever getting indexed if they are in the middle of it.


honestly i think a straight up rust port of coq with minor adjustments would smash it by a large margin, the python RPC client is like, written with gevent and stitched together with asyncio is like, really weird.

anyways heres what i've learned by writing coq that i can can be also applicable here:

  • the 30000 word txt file is gonna really suck the first time you use it, you realize if your fuzzy search is any good, the dictionary will dominate any other search result, or at least fill up all the spots where other sources do not.

  • treesitter is really slow, if you dont do it right, but for each tree sitter node, you can get it's line begin, and line end, and with some event handling, you can like progressively map the buffers via indexing small slices each time

  • paths completion is really tricky, to the point where Shougo of deoplete.nvim actually gave up on implementing it for his rewrite, at least i saw him saying he gave up, no idea if he redid it or not. you have to account for both windows and posix and for windows you need to account for both / and \\ while being able to make sure the returned results can be ranked relatively sanely.

  • honestly try your best to not expose too much custom api surface to third party lua / vimscript. you will probably regret it. just make them talk to you using the LSP / omnifunc protocol. its not that bad.

  • figuring out when to invalidate cache is hard, too lax and you give people garbage results, too strict and you are not returning any results, since the LSPs expect you to have things cached, coq just takes an multiset intersection of previous symbol under cursor to current symbol under cursor, this always gives you a score between 0 and 1, and if you score above a threshold, you do not invalidate cache.


anyways looking forward to this :)

also try your hardest to not get in a serious relationship, you will have no time to program

also I guess one of my biggest fears for projects like this is how much upstream dependency you are taking on, i basically wrote every single line of code including all dependencies outside of pynvim and pyyaml for coq, because like i just dont wanna be dependent on the graces of some upstream dude who may just decide to take a hiatus from open source.

i guess with mlua + nvim-oxi being like 30k+ lines of code that potentially wont be maintained, there is a risk that this project will have to take on some really heavy maintenance burdens

noib3 commented

Hey @ms-jpq,

  • the 30k word source is already taking 7ms from the user inserting a character into the buffer to the results coming back. I'm curious to see how much that's gonna increase once I add fuzzy filtering and UI logic. For my fuzzy search I was thinking of reimplementing whatever fzf is doing. Of all the fuzzy sorters I've used that's by far the best one;

  • for treesitter (and also for other purposes like adding a source that returns the words in the buffer, doing close-to-cursor filtering, etc) I was originally thinking of recreating my own copy of the buffer and gradually updating it as the user types. I'm not sure I'll end up doing it because of a) synchronization issues, and b) memory footprint (basically 2x);

  • regarding the API, the extensibility model I have in mind is simply exposing a CompletionSource trait to implement. This would mean all sources would be written in Rust, with the users writing their configs in Lua. My only issue with that is that if a user wants to write their own completion source they'd have to modify their local copy of completion-plugin and rebuild the plugin from source;

  • I was thinking of doing something very similar for cache invalidation, but also making the logic both buffer-type and source-dependent, like for example in Python you could invalidate the cache when the user types a . (e.g. foo.bar), while for path completions you invalidate at / on unix and \\ on Windows;

  • I'm not too worried about maintenance burdens. I'm actually not using mlua (I wrote nvim-oxi specifically to avoid it), and nvim-oxi is my own project so I'm already maintaining it ;)

Anyways, thanks a lot for your advice and for writing coq!

oh shit, thats awesome, didnt know that you also maintained nvim-oxi.

man, and i thought i put in alot of work into nvim, you probably out did me by 4x haha

noib3 commented

the 30000 word txt file is gonna really suck the first time you use it

well, not as bad as I thought!

I'm going from user input to pixels on the screen in about ~30-35ms (w/ a debug build, not release) using the lipsum source which gives back 30k completions (and that's order of magnitudes more completions than real sources would provide).

About 70% of that is spent fuzzy matching the results, and I still have to implement caching which would remove the other 30% on most keystrokes.

rec.mov

Also, the fuzzy matching crate I'm using at the moment seems to not be that well optimized, being ~2x slower compared to fzf (which is written in Go).

If I where to rewrite the fzf algo in Rust (which I'm probably gonna do eventually but not right now) and add caching I could bring that down even more to around 10-20ms EDIT: see below.

the 30000 word txt file is gonna really suck the first time you use it

haha i meant more like in terms of the result of the dictionary basically overwhelming everything else, by virtue of being better matches :D

but those results are very very fast, :o. ๐Ÿฅ‡


Here's coq's results on a much much smaller input set, editing it's own source code.

Test on a M1 Max.

     | Avg Duration | Q01 Duration | Q50 Duration | Q95 Duration | Q99 Duration
โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
3P   | 7ms          | 4ms          | 7ms          | 12ms         | 17ms        
โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
BUF  | 9ms          | 4ms          | 9ms          | 15ms         | 20ms        
โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
LSP  | 11ms         | 4ms          | 7ms          | 30ms         | 57ms        
โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
PATH | 2ms          | 1ms          | 2ms          | 4ms          | 8ms         
โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
SNIP | 7ms          | 4ms          | 7ms          | 13ms         | 15ms        
โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
T9   | 50ms         | 2ms          | 73ms         | 90ms         | 91ms        
โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
TAG  | 6ms          | 3ms          | 6ms          | 12ms         | 16ms        
โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
TMUX | 4ms          | 2ms          | 3ms          | 6ms          | 12ms        
โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
TS   | 4ms          | 2ms          | 3ms          | 7ms          | 13ms        
                                                                               

This is after I rewrote the entire RPC engine from ground up ms-jpq/coq_nvim#520, yielding some improvements in responsiveness here and there.

Honestly I think this is about as fast as coq can go, unless I ported it over to a faster language.

In practice, I did not even perceive any speed improvements from the RPC client rework on my M1, until I stepped down to a slower machine. You'd probably have to do this much sooner haha, maybe develop on a raspberry pi?

Anyways tho, the slowest source just dominates the q50 time no matter what we do, we just can't make LSP / Tabnine go faster, outside of cacheing.

noib3 commented

haha i meant more like in terms of the result of the dictionary basically overwhelming everything else, by virtue of being better matches :D

got it ๐Ÿ‘. Atm I'm only testing one source so I don't have to worry about that yet. What's coq doing to score completions coming from different sources other than fuzzy matching?

Honestly I think this is about as fast as coq can go, unless I ported it over to a faster language.

I'm actually pretty impressed with the LSP times. I assume you're getting those from the Neovim LSP client and that's the time it takes to get them, sort them w/ the other completions and send them back to Neovim to be displayed? 11ms seems almost too fast to get the response from the Language server and cross the RPC boundary twice.

Anyways tho, the slowest source just dominates the q50 time no matter what we do, we just can't make LSP / Tabnine go faster, outside of cacheing

Right but are you not sending the results independently? Or are you waiting for all sources to complete before sending the sorted completions?

the measurements are taken from the moment coq sources run, not overall rpc -> rpc

the overall rpc->rpc roundtrip is about 20ish ms on my M1. this does not include what happens after vim receives the results, so no UI time is measured.

i aggregate all results for ranking before they are sent. the reason is well, for coding, not synthetic benchmarking, the slowest component is how fast you can read the results, and decide if they are useful or not.

a naive approach of shoving the cheaper sources to top i found, was actually way worse UX

the overall time also includes me calling into nvim to request information on completion context, actually i just realized i am doing that stupidly just now, could probably yield more improvements if i adjusted it

image

edit:

If i turn on tabnine this thing goes to like 80ms average, and honestly maybe eyes are not good enough, i just cant tell too much difference between 80ms and 20ms on a nobrand 60 hz monitor

noib3 commented

I see. Thanks a lot for sharing coq's benchmarks, even though we're measuring times a bit differently it'll definitely be interesting to see how nvim-completion performs.

im sure it will be ๐Ÿš€ ๐Ÿš€ ๐Ÿš€ <3

noib3 commented

If I where to rewrite the fzf algo in Rust (which I'm probably gonna do eventually but not right now) and add caching I could bring that down even more to around 10-20ms

Well nevermind, I've added caching and wanted to see how much of a difference using a release build of the plugin would make. The whole pipeline takes ~2-3ms, down from ~25-30ms with a debug build ๐Ÿคฏ

Screen Shot 2022-09-21 at 11 30 14 PM

๐Ÿค 

cowboy_hat_face

rip #16 (comment)