SeanLee97/AnglE

Code embeddings

Opened this issue · 7 comments

Is there any information if this is also recommended for extracting embeddings from code snippets? In particular Javascipt and Solidity?

Hi @rragundez , Maybe you can have a try to WhereIsAI/UAE-Code-Large-V1. It was trained using the github-issue-similarity dataset, which contains some javascript code.

angle = AnglE.from_pretrained('WhereIsAI/UAE-Code-Large-V1').cuda()

angle.encode("YOUR CODE")

Let me try it and I'll comment back here the results

It did work but results over solidity code is not very good. thanks.

I am going to try with LLM trained on SOlidity code, but it has GGUF files, how would I use those in this library? for example:

https://huggingface.co/mradermacher/Solidity-Llama3-8b-GGUF

Would this work out of the box just putting the model name as the argument?

Yes. For LLM inference, you can check it document: https://angle.readthedocs.io/en/latest/notes/quickstart.html#infer-llm-based-models

Since this model hasn't been trained on sentence embedding learning, it is recommended to use some prompts to improve performance. You can specify a prompt with angle.encode(..., prompt="Here is a prompt: {text}.").

Yes. For LLM inference, you can check it document: https://angle.readthedocs.io/en/latest/notes/quickstart.html#infer-llm-based-models

Since this model hasn't been trained on sentence embedding learning, it is recommended to use some prompts to improve performance. You can specify a prompt with angle.encode(..., prompt="Here is a prompt: {text}.").

there is no need to specify a pretrained_lora_path, just directly specify the model_name_or_path to andrijdavid/Solidity-Llama3-8b