Are our Large Language Models good at code-generation when prompted in native languages?
We used GPT-4 to translate prompts in OpenAI's HumanEval Benchmark to Hinglish and manually verified and fixed these translations. Our benchmark, HinglishEval is available as a JSON file (HinglishEval.json
) in this repository.
Hindi is one of the most widely spoken languages in the world, and the most widely spoen in India. A majority of the population in India does not speak English as their first language, and therefore language models that can understand prompts in native languages are important for wider accessibility. Hinglish is a blend of Hindi and English, with frequent usage of English words in sentences with standard Hindi grammar. This is not representative of everyday spoken Hindi for most people, but is rather common in coversations involving technical language, especially in the context of programming.
Therefore it is most natural for Hindi speaking users to prompt LLMs in Hinglish when they want to generate code, or ask for help with programming in general (like explanations or debugging). This benchmark is an attempt to understand how well LLMs can understand and generate code when prompted in such a language.
We evaluated 18 models on the HinglishEval dataset, at temperature 0 (greedy decoding).
Model | Pass@1 |
---|---|
GPT 4 | 79.27 |
GPT 3.5 Turbo | 58.54 |
Gemma 7B | 31.71 |
Gemma 2B | 17.68 |
Codegen 6B Mono | 15.24 |
Codegen 2B Mono | 9.76 |
Codegen 6B Multi | 8.54 |
Santacoder 1.1B | 6.71 |
Codegen-2 1B | 4.88 |
Codegen 6B NL | 3.66 |
Codegen 350M Mono | 3.05 |
Codegen 350M Multi | 3.05 |
Codegen 2B NL | 2.44 |
Polycoder 2.7B | 1.83 |
Codegen 2B Multi | 1.83 |
Polycoder 0.4B | 0.00 |
Polycoder 160M | 0.00 |
Codegen 350M NL | 0.00 |
The HinglishEval benchmark contains all the problems in the HumanEval benchmark, with their prompts translated to Hinglish. The translation does not modify function signatures or doctests, and is limited to the purpose statement (supplied as a docstring in Python) of each function. The translations were manually verified and corrected to ensure that they sound like spoken Hinglish.
We have publicly released completions generated from 18 models on the prompts in the HinglishEval benchmark. These completions are available in the samples/unsanitized
directory. Sanitized versions of these completions are also available in the samples/sanitized
directory. Sanitization involves clipping the completions to only include function that was asked for, and removing any extraneous text.
We evaluated 18 models on the HinglishEval dataset and reported the results in the table above. We report only the Pass@1 metric since the models were evaluated at temperature 0 (greedy decoding).
We encourage the community to
- Interpret the results of this evaluation
- Explore different prompting strategies to improve the performance of models on HinglishEval
- Evaluate more models and with different temperature settings, particularly models for native languages
- Extend the benchmark to more native languages