dense-analysis/neural

Implement safe analysis of ranges of code

w0rp opened this issue · 5 comments

w0rp commented

Everyone and their mother is writing an OpenAI/ChatGPT or similar plugin. People want to analyse their code with machine learning, but are beginning to go about it all wrong. People have started copying and pasting entire regions of code into machine learning tools, whether manually, or through plugins in editors. This approach is fundamentally flawed for the following reasons.

  1. This is a massive security risk. You could very easily leak passwords or other sensitive information to third parties, and you should never trust a third party.
  2. This presents a massive risk for leaking intellectual property. You can be sure managers will ban any plugin from a company which might potentially send information that should not be shared to a third party.
  3. The solution is only good for demos. Machine learning tools need to be prompted carefully to produce reliable results.

Instead of simply firing code or text at machine learning tools blindly, Neural will instead take the following approach.

  1. Analyse code using local tools (and later local models) so information never leaves the host machine.
  2. Produce reliable intermediate representations of code and text that can be pre-processed into safe prompts to send to machine learning tools.
  3. Send safe data to third parties and return the results.

Nothing will ever be able to stop a user manually copying and pasting whole sections of code, but no sane software should automatically or implicitly introduce these risks to unwitting users. Software should lead you in the right direction, not the wrong one. In future, Dense Analysis will be working on and integrating with local FOSS machine learning models, which will offer a lot of power. The future of machine learning is not rate-limited third party providers hosting binary blobs you cannot audit, who share your data with God knows who, but models and tools entirely controlled by you.

Speaking in practical terms, we can very quickly implement this feature pretty easily.

  1. Integrate with Neovim LSP and ALE (for example) LSP tooling that already exists.
  2. Pull out semantic information about code.
  3. Automatically remove potentially sensitive information from the semantic data analysed, producing abstract intermediate representations. (IR)
  4. Prompt machine learning tools with that IR instead of the wholesale code, yielding similar results to wholesale code copying, without the aforementioned risks.

I think this plan can be implemented relatively quickly.

no sane software should automatically or implicitly introduce these risks to unwitting users

That rules out github then i suppose, and practically every other SaaS service on the planet? Not sure how AI tools are any different in that regard?

What's important is:

  1. What does the privacy policy say about how the data is being used?
  2. How trustworthy are the service provider?
  3. That people think about the above and use their own judgement.

I think it's good to inform and warn, but let users make their own decisions. The alternative for most users is likely that they'll choose another plugin, or use the web-interface directly anyway.

That said: thank's for all contributions to the community! Any effort towards making privacy aware software is more than welcome!

w0rp commented

I think it's good to inform and warn, but let users make their own decisions. The alternative for most users is likely that they'll choose another plugin, or use the web-interface directly anyway.

Yes, this is my intention. With FOSS everyone gets to make their own decisions, including their own forks, and should not live at the behest of some author of some software. You should have control, not I.

Cool, I appreciate your dedication. I guess I was a bit misled by the (now removed) video, as well the fact that you promote the OpenAI integration. I just think it might make sense to mention that the integration is intentionally limited. I'm just not sure how useful the plugin is if you can't easily use the selection as a query.

It's your decision of course, but I'll probably look for another plugin for now. Will keep an eye on where this goes anyway! Cheers!

Perhaps a way to solve the issue and concern of privacy is to enable a configuration on the "neural source" to extend this behaviour. Off by default for OpenAI sources.

@w0rp In the future, you may have custom sources from personal/company GPT-J instances, perhaps a local GPT-neo instance etc. You may even just be working on personal side-projects and don't have concerns with sending snippets of code to an external party - let's not forget there is a prompt limit, so it is very hard to send an entire code base over.

In those cases, you would want to be able to highlight lines as context for the prompt. It also opens up the road to other functionality in the future

@teppix We need to think carefully about this, but I think there are some solutions on the horizon. You may find other plugins that can do it without consideration of the compromises, and that's where people can run into problems.

w0rp commented

Sending text to local models that don't themselves make remote connections is fine. I'm anticipating local models being commonplace in 1-2 years based on how quickly local machine learning tools for image generation have been adopted.

I and @Angelchev also discussed the possibility of "remote" tools that are hosted on a "local" instance run by a programming team. Sending data to those will also be fine.