[Bug]: ClassifierOutput's 'prediction' has incorrect type

Question

[Bug]: ClassifierOutput's 'prediction' has incorrect type

hijohnnylin opened this issue 10 months ago · 4 comments

The ClassifierOutput's prediction type shows bool, but when it is used, it's assigned an int of either -1, 0, or 1. The -1 case I think means that there was some error in predicting.

We should decide on:

Make it a bool or not
What to do with the error case (consider it false? maybe create a specific type with 3 states?)

Would be good for the solution to be somewhat backward compatible.

https://github.com/EleutherAI/sae-auto-interp/blob/3659ff3bfefbe2628d37484e5bcc0087a5b10a27/sae_auto_interp/scorers/classifier/sample.py#L32

Answer 1 · 2025-01-23T17:03:21.000Z

I don't think considering it false is good because it will change the score (where now I just filter when prediction is -1).
It is not the prettiest, but we could switch its type to a int - and have the -1 explicitly stated to be an error option?

Answer 2 · 2025-01-24T06:48:43.000Z

Hmm, I think that might be a bit confusing. In Python any non-zero number is "true-ish", so if you run if -1 then it will evaluate as true.

Answer 3 · 2025-02-11T02:42:06.000Z

We've updated the error value to None and the ClassifierOutput prediction type to bool | None, let me know if you have any issues with the updated library and I'll resolve them ASAP.

Answer 4 · 2025-02-12T10:12:11.000Z

I'm closing this and Johnny can make a new one if anything comes up