salesforce/CodeT5

Code similarity CodeT5-large/small

lyriccoder opened this issue · 2 comments

Thank you for your interest in utilizing our Codet5 model for code similarity tasks. I have a query regarding its usage in test mode, specifically when comparing only two code snippets.
As per the CodexGlue dataset format, the model expects a list of codes and returns the top n most similar examples to a given query. However, I would like to inquire about the possibility of checking the similarity between two specific code snippets. Is there a way to utilize your model for this purpose?
I kindly request guidance on obtaining a similarity score, such as a probability, or a binary output (0 or 1) indicating whether the two code snippets are similar or different. For instance, given the following two code snippets:

public void foo() { System.out.println("Hi")}
protected DecryptedEndPoint newDecryptedEndPoint()
    {
        return new DecryptedEndPoint();
    }

Can your model provide insights into their similarity or equivalence?

Hi there, to measure code similarity, I would recommend to use CodeT5+ 110m embedding model to extract the embeddings and compute their similarities, e.g., cosine distance.

Hi there, to measure code similarity, I would recommend to use CodeT5+ 110m embedding model to extract the embeddings and compute their similarities, e.g., cosine distance.

Hi, CodeT5+ 110m embedding model has a limit of 512 tokens input, is there any way to increase the input limit of the model ? I would appreciate it if you would give me some advice.