Benchmark?

Question

Benchmark?

Closed this issue a month ago · 7 comments

This looks nice, it seems Claude must have been trained on reasoning traces.
Though It'd be better if the author can eval it against other reasoning models (and naively-prompted Claude itself) on some reasoning benchmarks like GPQA. I'm down to cover the costs via WeChat or Venmo! If you're down, please reach me at sh711@duke.edu or hsm15118082151 or on Wechat.

Answer 1 · 2024-11-14T18:01:15.000Z

Yes, I have thought about it for a while, but then gave it up.

Reasons:
Thinking Claude is not aimed for benchmarks or huge leaps in Math or something, since they are pre-determined by the base model (new Claude-3.5 Sonnet).
One of my goal is to explore how further we could reach with Claude's "deep mindset". For example, when using it in your daily tasks, you would found its inner monolog (thinking process) so much interesting.

Answer 2 · 2024-11-14T18:39:23.000Z

Yes, I have thought about it for a while, but then gave it up.

Reasons: Thinking Claude is not aimed for benchmarks or huge leaps in Math or something, since they are pre-determined by the base model (new Claude-3.5 Sonnet). One of my goal is to explore how further we could reach with Claude's "deep mindset". For example, when using it in your daily tasks, you would found its inner monolog (thinking process) so much interesting.

So, what do you think is the main function of this prompt? Is it merely to imitate the form of o1, or does it provide a better understanding of ambiguous prompts?

Answer 3 · 2024-11-15T02:14:26.000Z

Emmm... Came from many reasons, but the main one is just for experimenting - I was reviewing for my calculus so wanna see how Claude would "think"; however, I found some interesting behavior inside Claude's thinking, so I started on the initial instructions and lengthen it up.

Answer 4 · 2024-11-15T04:40:47.000Z

Yes, I have thought about it for a while, but then gave it up.

Reasons: Thinking Claude is not aimed for benchmarks or huge leaps in Math or something, since they are pre-determined by the base model (new Claude-3.5 Sonnet). One of my goal is to explore how further we could reach with Claude's "deep mindset". For example, when using it in your daily tasks, you would found its inner monolog (thinking process) so much interesting.

Prompting techniques (CoT etc.) has been proven to improve reasoning over naively prompted base model. All that (any) prompting does is to tease out some learned distribution during training, and so by the result of this prompt I think it is very likely Claude is trained on o1-like reasoning traces, but for some reason (maybe during post-training) Claude does not surface such capability with naive prompts. So I think eval the surfaced capability against base model is of value, but it's all up to you :) this project is great regardless.

Answer 5 · 2024-11-15T04:45:03.000Z

Yes, I have thought about it for a while, but then gave it up.
Reasons: Thinking Claude is not aimed for benchmarks or huge leaps in Math or something, since they are pre-determined by the base model (new Claude-3.5 Sonnet). One of my goal is to explore how further we could reach with Claude's "deep mindset". For example, when using it in your daily tasks, you would found its inner monolog (thinking process) so much interesting.

Prompting techniques (CoT etc.) has been proven to improve reasoning over naively prompted base model. All that (any) prompting does is to tease out some learned distribution during training, and so by the result of this prompt I think it is very likely Claude is trained on o1-like reasoning traces, but for some reason (maybe during post-training) Claude does not surface such capability with naive prompts. So I think eval the surfaced capability against base model is of value, but it's all up to you :) this project is great regardless.

I think as long as you can cover the API costs, it would be easy for you to run your own evaluation. I’m also very interested and look forward to you sharing your results.

Answer 6 · 2024-11-15T06:31:39.000Z

Yes, I have thought about it for a while, but then gave it up.
Reasons: Thinking Claude is not aimed for benchmarks or huge leaps in Math or something, since they are pre-determined by the base model (new Claude-3.5 Sonnet). One of my goal is to explore how further we could reach with Claude's "deep mindset". For example, when using it in your daily tasks, you would found its inner monolog (thinking process) so much interesting.

Prompting techniques (CoT etc.) has been proven to improve reasoning over naively prompted base model. All that (any) prompting does is to tease out some learned distribution during training, and so by the result of this prompt I think it is very likely Claude is trained on o1-like reasoning traces, but for some reason (maybe during post-training) Claude does not surface such capability with naive prompts. So I think eval the surfaced capability against base model is of value, but it's all up to you :) this project is great regardless.

I think as long as you can cover the API costs, it would be easy for you to run your own evaluation. I’m also very interested and look forward to you sharing your results.

yea I probably will, but the thing is I suppose I can't share the results in this repo due to authorship? It's best & easiest if the author would do it, but ofc I don't wanna put any pressure on anyone.

Answer 7 · 2024-11-15T10:37:12.000Z

thanks for your both enthusiasm @eshoyuan @Dariushuangg !!! i will plan to test it, but may have no specific time frame due to my school work =( but anyway, i'm grateful that ppl are really focus on the technical THING! That's good sign!