Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Discussing the new dataset and benchmark #129

Open
Expertium opened this issue Oct 29, 2024 · 26 comments
Open

Discussing the new dataset and benchmark #129

Expertium opened this issue Oct 29, 2024 · 26 comments

Comments

@Expertium
Copy link
Contributor

Expertium commented Oct 29, 2024

ankitects/anki#3511 (comment)

I have a few questions regarding that

  1. Will we keep using the default parameters based on the old dataset or on the new one? I think it's better to use the default parameters from the old one, since it has 20k users, and the new one will have 10k. So theoretically, the default parameters based on the old one should be slightly more accurate.
  2. How will you make tables with the metrics? Since we want to compare optimization on the entire collection vs optimization on every deck, I assume you will make the current two tables longer? Or add two more tables?
  3. Regarding siblings. It would be interesting to analyze how much using sibling reviews as "pseudoreviews" could help, assuming you aren't too burned out for that. I see two ways:
  • add a new column, like sibling_review, with values being 0 or 1
  • add new grades. Like this: Again = 1, Hard = 2, Good = 3, Easy = 4, Again (sibling) = 5, Hard (sibling) = 6, Good (sibling) = 7, Easy (sibling) = 8.
    Choose whichever is more convenient. Then insert these pseudoreviews into cards' histories. That way, when running the optimizer, we can use those pseudoreviews to update the memory state. I'll add new parameters.
    Of course, we can't use this in Anki, but it's interesting from a theoretical perspective.

Also, please add the total number of decks used for optimization to the .jsonl output file. I want to plot RMSE as a function of the number of decks used for optimization, to see if there is a magical number of decks such that splitting them any further is not beneficial, or even detrimental.

@user1823 you are welcome to participate

@L-M-Sherlock
Copy link
Member

The parameters are very similar, so I think the first question doesn't matter.

$ python evaluate.py --fast
Model: FSRS-5-dev
Total number of users: 9999
Total number of reviews: 349923850
Weighted average by reviews:
FSRS-5-dev LogLoss (mean±std): 0.3271±0.1521
FSRS-5-dev RMSE(bins) (mean±std): 0.0511±0.0331
FSRS-5-dev AUC (mean±std): 0.7015±0.0787

Weighted average by log(reviews):
FSRS-5-dev LogLoss (mean±std): 0.3528±0.1690
FSRS-5-dev RMSE(bins) (mean±std): 0.0704±0.0461
FSRS-5-dev AUC (mean±std): 0.7000±0.0888

Weighted average by users:
FSRS-5-dev LogLoss (mean±std): 0.3561±0.1716
FSRS-5-dev RMSE(bins) (mean±std): 0.0733±0.0478
FSRS-5-dev AUC (mean±std): 0.6990±0.0909

parameters: [0.4026, 1.1495, 3.1455, 15.8164, 7.1329, 0.5388, 1.7808, 0.0087, 1.5174, 0.1203, 1.0013, 1.9055, 0.11, 0.2961, 2.325, 0.2262, 3.0157, 0.5121, 0.6506]
Model: FSRS-5
Total number of users: 9995
Total number of reviews: 355281295
Weighted average by reviews:
FSRS-5 LogLoss (mean±std): 0.3154±0.1485
FSRS-5 RMSE(bins) (mean±std): 0.0489±0.0319
FSRS-5 AUC (mean±std): 0.7009±0.0783

Weighted average by log(reviews):
FSRS-5 LogLoss (mean±std): 0.3427±0.1666
FSRS-5 RMSE(bins) (mean±std): 0.0681±0.0455
FSRS-5 AUC (mean±std): 0.6988±0.0915

Weighted average by users:
FSRS-5 LogLoss (mean±std): 0.3468±0.1695
FSRS-5 RMSE(bins) (mean±std): 0.0712±0.0475
FSRS-5 AUC (mean±std): 0.6972±0.0942

parameters: [0.402, 1.182, 3.1332, 15.8757, 7.131, 0.5479, 1.769, 0.0085, 1.5236, 0.1174, 1.0077, 1.905, 0.11, 0.2978, 2.3352, 0.2315, 3.025, 0.5166, 0.6641]

@Expertium
Copy link
Contributor Author

Oh, and please upload the dataset here: https://huggingface.co/open-spaced-repetition
With the same license as the previous one

@L-M-Sherlock
Copy link
Member

Oh, and please upload the dataset here

I have asked for the permission of Dae. But I haven't received the reply. I have uploaded it, but the repo is still private now.

@L-M-Sherlock
Copy link
Member

The dataset is hosted at: https://huggingface.co/datasets/open-spaced-repetition/anki-revlogs-10k

@Expertium
Copy link
Contributor Author

Nice. When you will benchmark optimizing parameters on separate decks/presets, please don't foregt to add the number of decks/presets to the .jsonl output file.

@Expertium
Copy link
Contributor Author

Copying what I said on Discord

I'm just thinking how we should approach benchmarking, because there are a lot of options

  1. Optimize parameters based on every level 1 deck
  2. Optimize parameters based on every level 2/3/n deck (aka subdecks)
  3. Optimize parameters based on every deck (any level), meaning that even tiny subdecks get their own parameters
  4. Optimize parameters based on every preset

@Expertium
Copy link
Contributor Author

Also @L-M-Sherlock I would really appreciate if you implemented number 3 from my original comment in this issue. I want to see how much we can improve FSRS with sibling information.
Here's how I imagine the dataset will look like (simplified):
image
sibling_review is either 0 or 1, it indicates whether it's a real review or a "pseudoreview". When sibling_review=1, delta_t and grade come from the sibling, not from the card itself.
This will require traversing the entire dataset and inserting new rows. I imagine coding this won't be easy. Still, I really hope that you will do it.

@L-M-Sherlock
Copy link
Member

L-M-Sherlock commented Dec 1, 2024

I find a problem when I try to optimize parameters on separate decks/presets. If the user deleted their old cards, we will cannot know the siblings and preset of cards from old review logs. So we have to discard these review logs in the dataset.

Edit: I will fill -1 for this case.

@L-M-Sherlock
Copy link
Member

The result shows that optimization partitioned by preset is worse than optimizing on whole collection:

Model: FSRS-5-dev
Model: FSRS-5-1
Total number of users: 9999
Total number of reviews: 349923850
Weighted average by reviews:
FSRS-5-1 LogLoss (mean±std): 0.3271±0.1521
FSRS-5-1 RMSE(bins) (mean±std): 0.0511±0.0331
FSRS-5-1 AUC (mean±std): 0.7015±0.0787

Weighted average by log(reviews):
FSRS-5-1 LogLoss (mean±std): 0.3528±0.1690
FSRS-5-1 RMSE(bins) (mean±std): 0.0704±0.0461
FSRS-5-1 AUC (mean±std): 0.7000±0.0888

Weighted average by users:
FSRS-5-1 LogLoss (mean±std): 0.3561±0.1716
FSRS-5-1 RMSE(bins) (mean±std): 0.0733±0.0478
FSRS-5-1 AUC (mean±std): 0.6990±0.0909

parameters: [0.4026, 1.1495, 3.1455, 15.8164, 7.1329, 0.5388, 1.7808, 0.0087, 1.5174, 0.1203, 1.0013, 1.9055, 0.11, 0.2961, 2.325, 0.2262, 3.0157, 0.5121, 0.6506]
Model: FSRS-5-preset
Total number of users: 9999
Total number of reviews: 349923850
Weighted average by reviews:
FSRS-5-preset LogLoss (mean±std): 0.3293±0.1541
FSRS-5-preset RMSE(bins) (mean±std): 0.0527±0.0351
FSRS-5-preset AUC (mean±std): 0.6999±0.0799

Weighted average by log(reviews):
FSRS-5-preset LogLoss (mean±std): 0.3564±0.1718
FSRS-5-preset RMSE(bins) (mean±std): 0.0731±0.0487
FSRS-5-preset AUC (mean±std): 0.6975±0.0890

Weighted average by users:
FSRS-5-preset LogLoss (mean±std): 0.3600±0.1746
FSRS-5-preset RMSE(bins) (mean±std): 0.0762±0.0508
FSRS-5-preset AUC (mean±std): 0.6967±0.0909

parameters: [0.4026, 1.1839, 3.173, 15.691, 7.1908, 0.5345, 1.8675, 0.0046, 1.5458, 0.1192, 1.0193, 1.903, 0.11, 0.2961, 2.2698, 0.2315, 2.9898, 0.5166, 0.6621]

@Expertium
Copy link
Contributor Author

Expertium commented Dec 2, 2024

Welp...
I still want that data to analyze how RMSE depends on the number of presets. For that, I want you to add the number of presets of each user to the .json output file

@L-M-Sherlock
Copy link
Member

Here you are: https://github.com/open-spaced-repetition/srs-benchmark/blob/main/result/FSRS-5-preset.jsonl

@L-M-Sherlock
Copy link
Member

L-M-Sherlock commented Dec 2, 2024

sibling_review is either 0 or 1, it indicates whether it's a real review or a "pseudoreview". When sibling_review=1, delta_t and grade come from the sibling, not from the card itself.

Your description is unclear to me. Could you tell me what the expected dataset for those rows is?

card_id day_offset rating state duration elapsed_days elapsed_seconds review_th note_id deck_id parent_id preset_id
2703 994 3 0 6689 -1 -1 10472 3839 13 169 0
2703 994 4 1 6173 0 655 10554 3839 13 169 0
2702 994 3 0 17802 -1 -1 11089 3839 13 169 0
2704 994 3 0 7748 -1 -1 11134 3839 13 169 0
2702 994 3 1 4216 0 727 11157 3839 13 169 0
2704 994 4 1 1297 0 401 11178 3839 13 169 0
2702 995 3 2 5651 1 83624 11534 3839 13 169 0
2703 1004 3 2 6017 10 843309 12133 3839 13 169 0
2704 1009 3 2 3000 15 1258804 12269 3839 13 169 0
2702 1019 3 2 12000 24 2037746 12821 3839 13 169 0
2703 1026 3 2 8781 22 1943942 13925 3839 13 169 0
2704 1155 3 2 7893 146 12631189 21014 3839 13 169 0

@Expertium
Copy link
Contributor Author

Expertium commented Dec 2, 2024

Everything is copied from the sibling. Card ID, interval length, grade, everything. Just insert the sibling data into the card's data. In order to do that, you will need to calculate the order in which reviews happened.
And, of course, add a new column that contains either 1 ("true" review) or 0 (sibling review). And then this number must be passed into FSRS.
Then I'll add new parameters so that sibling reviews are treated differently compared to "true" reviews.

@L-M-Sherlock
Copy link
Member

L-M-Sherlock commented Dec 2, 2024

I guess we don't need to insert the sibling data. We just need to copy the note's data and add a new column to mark the sibling_review. If the note has three cards, we can make three copies.

@Expertium
Copy link
Contributor Author

I have a hard time imagining it. But if you know how to do it, ok. Remember, we need the grade and the interval length from the sibling review. And the order must be correct: if card B (sibling) was reviewed between, for example the 2nd and the 3rd reviews of card A, FSRS must process review 1, then review 2, then the sibling pseudoreview, then review 3.

@L-M-Sherlock
Copy link
Member

Did you check my #129 (comment)? They have been sorted.

@L-M-Sherlock
Copy link
Member

I did it by hand:

card 2703

rating elapsed_days sibling_review
3 -1 0
4 0 0
3 -1 1
3 -1 1
3 0 1
4 0 1
3 1 1
3 10 0
3 15 1
3 24 1
3 22 0
3 146 1

card 2702

rating elapsed_days sibling_review
3 -1 1
4 0 1
3 -1 0
3 -1 1
3 0 0
4 0 1
3 1 0
3 10 1
3 15 1
3 24 0
3 22 1
3 146 1

card 2704

rating elapsed_days sibling_review
3 -1 1
4 0 1
3 -1 1
3 -1 0
3 0 1
4 0 0
3 1 1
3 10 1
3 15 0
3 24 1
3 22 1
3 146 0

Are they expected?

@Expertium
Copy link
Contributor Author

Yes, looks correct.

@Expertium
Copy link
Contributor Author

Seems like analysis.py isn't working with this new .jsonl file

    if abs(result["parameters"][i] - DEFAULT_PARAMETER[i]) <= 1e-4:
KeyError: 0

@L-M-Sherlock
Copy link
Member

L-M-Sherlock commented Dec 2, 2024

9a22585 supports the new .jsonl file.

@Expertium
Copy link
Contributor Author

Expertium commented Dec 2, 2024

Thank you, although I already analyzed what I wanted anyway
@DerIshmaelite here's what you wanted to know so badly:
Figure_1
Correlation coefficient=-0.056

The correlation coefficient between the average RMSE and the number of presets is virtually 0. Visually, I was expecting to see a U-shaped curve with a minimum that corresponds to the best number of presets, but nope. And according to the benchmark, RMSE is actually performs worse by 3-4% (relative) when FSRS is optimized on several presets rather than on the entire collection.

Note that I can't extract the number of reviews per each preset from the file Jarrett gave me, only the total number of reviews across all presets.

P.S. Out of 9999 collections, the maximum number of presets is 130. So DerIshmaelite, your 273 (or whatever number it was) is literally off the charts.

@brishtibheja
Copy link

If you calculate RMSE on lower number of reviews do you not get lower RMSE too? (Not following this conversation too closely so might have misunderstood the fundamentals).

@Expertium
Copy link
Contributor Author

You get higher RMSE on low number of reviews

@brishtibheja
Copy link

Yeah, I meant that one. Actually, in layman's terms, I'm asking whether the following is still valid or not?

1. Create one collection-preset.
2. Optimise on everything.
3. Select a deck.
4. Clone the current preset.
5. In the new preset, click Optimise.
6. If you get new parameters, you're better off.

This works every single time if you have enough reviews in a singular deck.

@Expertium
Copy link
Contributor Author

That I don't know.

@Expertium
Copy link
Contributor Author

Expertium commented Dec 2, 2024

@L-M-Sherlock please add FSRS-5-presets to the results table

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants