open-spaced-repetition/fsrs-optimizer

[BUG] Initial stability for "Good" will be larger than for "Easy" if "Good" has more datapoint

L-M-Sherlock opened this issue · 24 comments

@L-M-Sherlock I think in the current version of the optimizer it's possible that a value for "Good" will be larger than for "Easy" if "Good" has more datapoints.
params, _ = curve_fit(power_forgetting_curve, delta_t, recall, sigma=1/np.sqrt(count), bounds=((0.1), (30 if total_count < 1000 else 365)))
You should probably add some kind of extra cap to ensure that S0 for "Good" cannot be greater than S0 for "Easy" even if total_count is greater than 1000 for "Good" and less than 1000 for "Easy".

Originally posted by @Expertium in open-spaced-repetition/fsrs4anki#348 (comment)

There are 2 simple ways to solve this:

if S0_good > S0_easy:
    S0_good = S0_easy

or

if S0_good > S0_easy:
    S0_easy = S0_good 

In the first method we artificially decrease S0 for Good, in the second method we artificially increase S0 for Easy. I don't know which one makes more sense, but probably the latter. If S0 for Good is based on a larger number of reviews, then it is calculated more accurately than S0 for Easy, and therefore we shouldn't change it and instead we should change the less accurate S0 for Easy.

In my opinion, the second approach makes more sense.

Maybe decide this based on the number of datapoints in each case?

if S0_good > S0_easy:
    if n_datapoints_good > n_datapoints_easy:
        S0_easy = S0_good
    else:
        S0_good = S0_easy

However if you look at the table in #5 (comment), there are cases where S0_again > S0_hard or S0_hard > S0_good so this issue is not limited to the pair of Good and Easy.

I suppose the idea above should be applied to all pairs: Again-Hard, Hard-Good and Good-Easy.

@L-M-Sherlock here are some good ideas:

  1. The one above by nb9618, but apply it to all pairs: Again-Hard, Hard-Good and Good-Easy. There will likely be issues with that, though. I don't expect it to work on the first try without creating new problems.
  2. When using additive smoothing, instead of using retention of the entire collection/deck, only use retention based on second reviews to calculate p0 (the initial guess).
  3. When using the outlier filter based on IQR, use ln(delta_t) rather than delta_t itself. Filtering based on IQR doesn't work well on data that isn't normally distributed, and delta_t certainly isn't.

Of course, all of these changes should be evaluated using statistical significance tests, I hope by now you have set up an automated system to run tests on all 66 collections.

Oh, also: in the scheduler code change // recommended setting: 0.8 ~ 0.9 to // recommended values: 0.75 ~ 0.97

@L-M-Sherlock you've been inactive for a couple of days, so there is a good chance you missed my comment above. I'm pinging you just to remind you about it.

I am just tired to maintain the optimizer module. You can check these parameters in the batch training in collected data: open-spaced-repetition/fsrs4anki#351 (comment). There are some cases where the initial stability of again is large than the initial stability or the initial stability of good is large than the initial stability of easy. These cases would have different reason. We should deal with these problem according to the concrete cases.

Ok, forget about 1, but I would still ask you to test 2 and 3.

For 2, here is a extreme case:

image

The user always remember in the next review when he pressed easy in the first learning. In this case, the retention is 100%. If we use this value, the additive smoothing will be useless.

I think you misunderstood my idea a little bit. I didn't mean "use four different initial guesses for each grade", I meant "use the same initial guess for each grade". So just calculate average retention for all second reviews.

By the way, have you automated running statistical significance tests on all collections?

3. When using the outlier filter based on IQR, use ln(delta_t) rather than delta_t itself. Filtering based on IQR doesn't work well on data that isn't normally distributed, and delta_t certainly isn't.

I'm testing this in all 66 collections.

I think you misunderstood my idea a little bit. I didn't mean "use four different initial guesses for each grade", I meant "use the same initial guess for each grade". So just calculate average retention for all second reviews.

OK. I will test it after above test. It will cost nearly 3 hours.

I'm testing this in all 66 collections.

Before:

Weighted RMSE: 0.04149183369953192
Weighted Log loss: 0.3815897150075234
Weighted MAE: 0.02342977913950602
Weighted R-squared: 0.7697902622572932

After:

Weighted RMSE: 0.04174954832152736
Weighted Log loss: 0.38212856042129156
Weighted MAE: 0.02374078044685508
Weighted R-squared: 0.7672438581669868

p = 0.0045 (for RMSE)

3. When using the outlier filter based on IQR, use ln(delta_t) rather than delta_t itself. Filtering based on IQR doesn't work well on data that isn't normally distributed, and delta_t certainly isn't.

It's worse than the current version with statistical significance.

Here is the code:

        def remove_outliers(group: pd.DataFrame) -> pd.DataFrame:
            # threshold = np.mean(group['delta_t']) * 1.5
            # threshold = group['delta_t'].quantile(0.95)
            Q1 = group['delta_t'].map(np.log).quantile(0.25)
            Q3 = group['delta_t'].map(np.log).quantile(0.75)
            IQR = Q3 - Q1
            threshold = Q3 + 1.5 * IQR
            group = group[group['delta_t'].map(np.log) <= threshold]
            return group

Huh, I'm surprised. Maybe the more data is removed, the easier it is for FSRS to fit the remaining data well? In other words, what if we cannot rely on RMSE when removing outliers because, between two methods that both aim at removing outliers, the one that removes more data will always result in a lower RMSE?

Removing more data not always results in a lower RMSE. Removing too many data might lead to underfitting, where the model fails to capture the underlying trend of the data. This can also increase the RMSE.

Alright, then test the idea with p0 for additive smoothing, and that's it.
After that I would like you to benchmark all 5 algorithms, I'll explain it in a bit more detail in the relevant issue.

additive smoothing:

Weighted RMSE: 0.04147353655819303
Weighted Log loss: 0.3815885589708383
Weighted MAE: 0.023376754517799636
Weighted R-squared: 0.7699164899424069

p=0.38

It is slightly better but not statistically significant.

Removing more data not always results in a lower RMSE. Removing too many data might lead to underfitting, where the model fails to capture the underlying trend of the data. This can also increase the RMSE.

I agree that removing more data would not always result in a lower RMSE.

But here, we are selectively removing the data which lies at the right-hand-side of the curve (and not any random data). So, the remaining data is more homogenous and this might explain why the RMSE is lower.

So, the remaining data is more homogenous and this might explain why the RMSE is lower.

Yeah, I'm just surprised that my approach is somehow worse, even though in theory IQR should work better with normally distributed data.

I think that the increase in RMSE that we saw when using log of delta_t is just an artifact.

For example, when the optimizer filtered out all the cards with first rating = Again in my collection, the RMSE got a crazy low value (0.0056). I first mentioned this here: open-spaced-repetition/fsrs4anki#348 (comment)

I think that the increase in RMSE that we saw when using log of delta_t is just an artifact.

So we should not only consider the RMSE, right? We should have other criterion to make decision whether a idea should be employed in FSRS.

Maybe decide this based on the number of datapoints in each case?

I will adopt this idea, not for the sake of enhancing the model's accuracy, but to alleviate users' confusion. Therefore, I would not to run evaluation tests.

I think that the increase in RMSE that we saw when using log of delta_t is just an artifact.

So we should not only consider the RMSE, right? We should have other criterion to make decision whether a idea should be employed in FSRS.

Yes, but I don't know which metric would be appropriate in this case.

Also, let's discuss this further in #16.