Issues
- 2
请问一月榜单呢
#42 opened by kindle939393 - 1
断更了???
#48 opened by yiliangfang - 1
- 2
请问如何对自己做的大模型使用该指标进行测评呢?
#44 opened by AWangji - 1
- 1
工具的评测是什么含义? 是Function calling么,如果没有请添加此能力的评测。
#46 opened by goqw - 2
公开评测集和评测标准
#13 opened by plmsmile - 0
Claude3有评测吗?
#45 opened by Pancat007 - 1
Does it indicate using 5 shots for evaluation?
#39 opened by zhimin-z - 2
阿里的通义千问没有吗?
#22 opened by Pancat009 - 3
数据集开源吗?可以在哪里下载呢
#26 opened by vanshaw2017 - 2
c-eval是真的离谱,希望superclue能更新的稍微快一点,比如1-2周更新一次
#31 opened by iammeizu - 0
Where to locate the SuperCLUE-LYB leaderboards?
#34 opened by zhimin-z - 1
- 0
anthropic拼错了
#30 opened by JerryJiang12923 - 0
咨询一下,从测评报告来看,SuperCLUE是采用自动化方式的客观评估,是否可提供针对某一模型的可实际运行的自动化评测的python样例代码(api调用或者web)?
#40 opened by Romanzhang2024 - 0
Where to download the benchmark dataset?
#38 opened by zhimin-z - 1
- 0
- 0
想问下 角色扮演 benchmark是怎么进行的
#35 opened by xealml - 1
任务规划和工具使用的评价标准是什么样的?
#32 opened by heibaidaolx123 - 0
能否增加翻译的评估排名
#33 opened by lx0126z - 0
- 0
请问可以把vicuna-33B模型加入评测吗?
#28 opened by Mr-wang2016 - 1
排名变化的原因是什么?
#24 opened by zhaojiawen-coding - 0
测评时如何与标准答案进行匹配
#27 opened by Starry-Hu - 3
test the 智源大模型吧
#23 opened by forkyguo - 1
没有文心一言吗
#20 opened by p81sunshine - 1
关于prompt设计的问题
#25 opened by lrs1353281004 - 2
可以在superclue上测试自己的模型吗?
#18 opened by guozhiyao - 3
开始搞手机测评榜那一套了?GPT4对应苹果,国产大模型对应华米OV
#12 opened by ZhuGeRoastedFish - 1
这里"idea-jiangzhiya"应该是"idea-jiangziya"吧?
#21 opened by ilongshan - 1
什么时候回公开测试数据集?
#17 opened by wangrui6 - 1
Clarify which "Claude" is benchmarked?
#19 opened by jekbradbury - 8
我个人使用后的感受,星火大模型是真的不如文心一言。。
#3 opened by MysteryMulberry - 2
单项能力有多少道题目啊
#7 opened by leonall - 5
感谢徐亮老师团队的工作~关于评测细节 有一些疑问咨询下
#1 opened by lrs1353281004 - 1
超200人了,求拉群
#2 opened by dinglei8908 - 1
该如何引用你们的工作?
#4 opened by MikeGu721 - 3
这个superCLUE 有毒性和偏见等方面的评测吗
#6 opened by devinbai - 3
评测数据客观公正很重要
#8 opened by shichengustc - 1
作为一个测评榜,建议参考Chinese-LLaMA-Alpaca进行适度的测评说明和公开
#9 opened by shm007g - 2
这个评测的参考价值
#10 opened by liuyajun52 - 1
置信度
#14 opened by littlepan0413 - 3
人类的数值怎么来的?
#15 opened by So0ni - 1
建议补全人类的“专业能力”数据
#16 opened by Triang-jyed-driung - 4
生成与创作如何用选择题的形式测试的?
#5 opened by Howardqlz - 1
测评结果为什么全是整数?
#11 opened by ltz0120