scicode-bench/SciCode
A benchmark that challenges language models to code solutions for scientific problems
PythonApache-2.0
Issues
- 0
Add gpt-4.5 to the leaderboard
#37 opened by LuigiPagani - 0
Benchmark o1
#35 opened by LuigiPagani - 0
Benchmark Claude Sonnet 3.7
#36 opened by LuigiPagani - 3
- 0
add new claude-3-5-sonnet-20241022
#18 opened by Kreijstal - 2
Add o1 17/12 To the Leaderboard
#21 opened by LuigiPagani - 2
One click run
#6 opened by ofirpress - 2
Deepseek R1 Evaluation Results?
#25 opened by jasonzliang - 2
Ground Truth Code for problems_all.jsonl
#20 opened by jasonzliang - 1
o1 models results
#17 opened by stalkermustang - 2
Solution generation skips certain steps?
#12 opened by idavidrein - 2
- 2
- 1
- 0
Dependency versions for evaluation
#2 opened by zrait