gersteinlab/ML-Bench

The Official Repo of ML-Bench: Evaluating Large Language Models and Agents for Machine Learning Tasks on Repository-Level Code (https://arxiv.org/abs/2311.09835)

PythonMIT

Issues

The GPT3.5 result could not be reproduced
#3 opened 7 months ago by iiinsight
1
How to evaluate local custom models?
#2 opened a year ago by VoiceBeer
4
Any public leaderboards?
#1 opened a year ago by zhimin-z
0