scalacenter/GoogleSummerOfCode

Idea: Train or fine-tune LLM on Scala code

Closed this issue · 2 comments

Hi all, not sure if this is the right place to discuss / ask:
Could a SoC project be proposed to train or fine-tune a LLM on Scala code?
Most LLMs are primarily trained on Python code, just because there is so much of it, leading to more generated Python code, ...
For Scala there is less code to train on, on the other hand models like Phi-2 show that a smaller set of highly curated training data (in this case well-written code) can make even smaller models compete with bigger ones trained on "the internet".
A fork of e.g. CodeLlama 70B, fine-tuned with the full Git history of successful Scala projects might close the gap between code generation capabilities for Python and Scala.

That sounds like an interesting project. However, it involves significant computational resources, and also expertise to guide a student towards that objective. If there is a mentor with said expertise who can also provide the required resources for the student, we are definitely open to it.

Closing this issue for the above mentioned reasons. If anyone is willing to mentor on this one, AND has computational resources to provide to the contributor, feel free to submit a PR with the proposal!