Discussing the new dataset and benchmark #129

Expertium · 2024-10-29T12:44:37Z

I have a few questions regarding that

Will we keep using the default parameters based on the old dataset or on the new one? I think it's better to use the default parameters from the old one, since it has 20k users, and the new one will have 10k. So theoretically, the default parameters based on the old one should be slightly more accurate.
How will you make tables with the metrics? Since we want to compare optimization on the entire collection vs optimization on every deck, I assume you will make the current two tables longer? Or add two more tables?
Regarding siblings. It would be interesting to analyze how much using sibling reviews as "pseudoreviews" could help, assuming you aren't too burned out for that. I see two ways:

add a new column, like sibling_review, with values being 0 or 1
add new grades. Like this: Again = 1, Hard = 2, Good = 3, Easy = 4, Again (sibling) = 5, Hard (sibling) = 6, Good (sibling) = 7, Easy (sibling) = 8.
Choose whichever is more convenient. Then insert these pseudoreviews into cards' histories. That way, when running the optimizer, we can use those pseudoreviews to update the memory state. I'll add new parameters.
Of course, we can't use this in Anki, but it's interesting from a theoretical perspective.

Also, please add the total number of decks used for optimization to the .jsonl output file. I want to plot RMSE as a function of the number of decks used for optimization, to see if there is a magical number of decks such that splitting them any further is not beneficial, or even detrimental.

@user1823 you are welcome to participate

The text was updated successfully, but these errors were encountered:

L-M-Sherlock · 2024-10-30T01:33:31Z

The parameters are very similar, so I think the first question doesn't matter.

$ python evaluate.py --fast
Model: FSRS-5-dev
Total number of users: 9999
Total number of reviews: 349923850
Weighted average by reviews:
FSRS-5-dev LogLoss (mean±std): 0.3271±0.1521
FSRS-5-dev RMSE(bins) (mean±std): 0.0511±0.0331
FSRS-5-dev AUC (mean±std): 0.7015±0.0787

Weighted average by log(reviews):
FSRS-5-dev LogLoss (mean±std): 0.3528±0.1690
FSRS-5-dev RMSE(bins) (mean±std): 0.0704±0.0461
FSRS-5-dev AUC (mean±std): 0.7000±0.0888

Weighted average by users:
FSRS-5-dev LogLoss (mean±std): 0.3561±0.1716
FSRS-5-dev RMSE(bins) (mean±std): 0.0733±0.0478
FSRS-5-dev AUC (mean±std): 0.6990±0.0909

parameters: [0.4026, 1.1495, 3.1455, 15.8164, 7.1329, 0.5388, 1.7808, 0.0087, 1.5174, 0.1203, 1.0013, 1.9055, 0.11, 0.2961, 2.325, 0.2262, 3.0157, 0.5121, 0.6506]
Model: FSRS-5
Total number of users: 9995
Total number of reviews: 355281295
Weighted average by reviews:
FSRS-5 LogLoss (mean±std): 0.3154±0.1485
FSRS-5 RMSE(bins) (mean±std): 0.0489±0.0319
FSRS-5 AUC (mean±std): 0.7009±0.0783

Weighted average by log(reviews):
FSRS-5 LogLoss (mean±std): 0.3427±0.1666
FSRS-5 RMSE(bins) (mean±std): 0.0681±0.0455
FSRS-5 AUC (mean±std): 0.6988±0.0915

Weighted average by users:
FSRS-5 LogLoss (mean±std): 0.3468±0.1695
FSRS-5 RMSE(bins) (mean±std): 0.0712±0.0475
FSRS-5 AUC (mean±std): 0.6972±0.0942

parameters: [0.402, 1.182, 3.1332, 15.8757, 7.131, 0.5479, 1.769, 0.0085, 1.5236, 0.1174, 1.0077, 1.905, 0.11, 0.2978, 2.3352, 0.2315, 3.025, 0.5166, 0.6641]

Expertium · 2024-10-30T13:11:48Z

Oh, and please upload the dataset here: https://huggingface.co/open-spaced-repetition
With the same license as the previous one

L-M-Sherlock · 2024-10-30T13:33:35Z

Oh, and please upload the dataset here

I have asked for the permission of Dae. But I haven't received the reply. I have uploaded it, but the repo is still private now.

L-M-Sherlock · 2024-10-31T13:04:27Z

The dataset is hosted at: https://huggingface.co/datasets/open-spaced-repetition/anki-revlogs-10k

Expertium · 2024-10-31T13:07:32Z

Nice. When you will benchmark optimizing parameters on separate decks/presets, please don't foregt to add the number of decks/presets to the .jsonl output file.

Expertium · 2024-11-04T11:47:48Z

Copying what I said on Discord

I'm just thinking how we should approach benchmarking, because there are a lot of options

Optimize parameters based on every level 1 deck

Optimize parameters based on every level 2/3/n deck (aka subdecks)

Optimize parameters based on every deck (any level), meaning that even tiny subdecks get their own parameters

Optimize parameters based on every preset

Expertium · 2024-11-09T17:44:54Z

Also @L-M-Sherlock I would really appreciate if you implemented number 3 from my original comment in this issue. I want to see how much we can improve FSRS with sibling information.
Here's how I imagine the dataset will look like (simplified):

sibling_review is either 0 or 1, it indicates whether it's a real review or a "pseudoreview". When sibling_review=1, delta_t and grade come from the sibling, not from the card itself.
This will require traversing the entire dataset and inserting new rows. I imagine coding this won't be easy. Still, I really hope that you will do it.

L-M-Sherlock · 2024-12-01T08:12:58Z

I find a problem when I try to optimize parameters on separate decks/presets. If the user deleted their old cards, we will cannot know the siblings and preset of cards from old review logs. So we have to discard these review logs in the dataset.

Edit: I will fill -1 for this case.

L-M-Sherlock · 2024-12-02T03:12:13Z

The result shows that optimization partitioned by preset is worse than optimizing on whole collection:

Model: FSRS-5-dev
Model: FSRS-5-1
Total number of users: 9999
Total number of reviews: 349923850
Weighted average by reviews:
FSRS-5-1 LogLoss (mean±std): 0.3271±0.1521
FSRS-5-1 RMSE(bins) (mean±std): 0.0511±0.0331
FSRS-5-1 AUC (mean±std): 0.7015±0.0787

Weighted average by log(reviews):
FSRS-5-1 LogLoss (mean±std): 0.3528±0.1690
FSRS-5-1 RMSE(bins) (mean±std): 0.0704±0.0461
FSRS-5-1 AUC (mean±std): 0.7000±0.0888

Weighted average by users:
FSRS-5-1 LogLoss (mean±std): 0.3561±0.1716
FSRS-5-1 RMSE(bins) (mean±std): 0.0733±0.0478
FSRS-5-1 AUC (mean±std): 0.6990±0.0909

parameters: [0.4026, 1.1495, 3.1455, 15.8164, 7.1329, 0.5388, 1.7808, 0.0087, 1.5174, 0.1203, 1.0013, 1.9055, 0.11, 0.2961, 2.325, 0.2262, 3.0157, 0.5121, 0.6506]
Model: FSRS-5-preset
Total number of users: 9999
Total number of reviews: 349923850
Weighted average by reviews:
FSRS-5-preset LogLoss (mean±std): 0.3293±0.1541
FSRS-5-preset RMSE(bins) (mean±std): 0.0527±0.0351
FSRS-5-preset AUC (mean±std): 0.6999±0.0799

Weighted average by log(reviews):
FSRS-5-preset LogLoss (mean±std): 0.3564±0.1718
FSRS-5-preset RMSE(bins) (mean±std): 0.0731±0.0487
FSRS-5-preset AUC (mean±std): 0.6975±0.0890

Weighted average by users:
FSRS-5-preset LogLoss (mean±std): 0.3600±0.1746
FSRS-5-preset RMSE(bins) (mean±std): 0.0762±0.0508
FSRS-5-preset AUC (mean±std): 0.6967±0.0909

parameters: [0.4026, 1.1839, 3.173, 15.691, 7.1908, 0.5345, 1.8675, 0.0046, 1.5458, 0.1192, 1.0193, 1.903, 0.11, 0.2961, 2.2698, 0.2315, 2.9898, 0.5166, 0.6621]

Expertium · 2024-12-02T06:11:21Z

Welp...
I still want that data to analyze how RMSE depends on the number of presets. For that, I want you to add the number of presets of each user to the .json output file

L-M-Sherlock · 2024-12-02T06:32:49Z

Here you are: https://github.com/open-spaced-repetition/srs-benchmark/blob/main/result/FSRS-5-preset.jsonl

L-M-Sherlock · 2024-12-02T09:16:19Z

sibling_review is either 0 or 1, it indicates whether it's a real review or a "pseudoreview". When sibling_review=1, delta_t and grade come from the sibling, not from the card itself.

Your description is unclear to me. Could you tell me what the expected dataset for those rows is?

card_id	day_offset	rating	state	duration	elapsed_days	elapsed_seconds	review_th	note_id	deck_id	parent_id
2703	994	3	0	6689	-1	-1	10472	3839	13	169
2703	994	4	1	6173	0	655	10554	3839	13	169
2702	994	3	0	17802	-1	-1	11089	3839	13	169
2704	994	3	0	7748	-1	-1	11134	3839	13	169
2702	994	3	1	4216	0	727	11157	3839	13	169
2704	994	4	1	1297	0	401	11178	3839	13	169
2702	995	3	2	5651	1	83624	11534	3839	13	169
2703	1004	3	2	6017	10	843309	12133	3839	13	169
2704	1009	3	2	3000	15	1258804	12269	3839	13	169
2702	1019	3	2	12000	24	2037746	12821	3839	13	169
2703	1026	3	2	8781	22	1943942	13925	3839	13	169
2704	1155	3	2	7893	146	12631189	21014	3839	13	169

Expertium · 2024-12-02T10:07:50Z

Everything is copied from the sibling. Card ID, interval length, grade, everything. Just insert the sibling data into the card's data. In order to do that, you will need to calculate the order in which reviews happened.
And, of course, add a new column that contains either 1 ("true" review) or 0 (sibling review). And then this number must be passed into FSRS.
Then I'll add new parameters so that sibling reviews are treated differently compared to "true" reviews.

L-M-Sherlock · 2024-12-02T10:16:19Z

I guess we don't need to insert the sibling data. We just need to copy the note's data and add a new column to mark the sibling_review. If the note has three cards, we can make three copies.

Expertium · 2024-12-02T10:20:17Z

I have a hard time imagining it. But if you know how to do it, ok. Remember, we need the grade and the interval length from the sibling review. And the order must be correct: if card B (sibling) was reviewed between, for example the 2nd and the 3rd reviews of card A, FSRS must process review 1, then review 2, then the sibling pseudoreview, then review 3.

L-M-Sherlock · 2024-12-02T10:23:39Z

Did you check my #129 (comment)? They have been sorted.

L-M-Sherlock · 2024-12-02T10:31:05Z

I did it by hand:

card 2703

rating	elapsed_days	sibling_review
3	-1	0
4	0	0
3	-1	1
3	-1	1
3	0	1
4	0	1
3	1	1
3	10	0
3	15	1
3	24	1
3	22	0
3	146	1

card 2702

rating	elapsed_days	sibling_review
3	-1	1
4	0	1
3	-1	0
3	-1	1
3	0	0
4	0	1
3	1	0
3	10	1
3	15	1
3	24	0
3	22	1
3	146	1

card 2704

rating	elapsed_days	sibling_review
3	-1	1
4	0	1
3	-1	1
3	-1	0
3	0	1
4	0	0
3	1	1
3	10	1
3	15	0
3	24	1
3	22	1
3	146	0

Are they expected?

Expertium · 2024-12-02T10:47:54Z

Yes, looks correct.

Expertium · 2024-12-02T10:51:10Z

Seems like analysis.py isn't working with this new .jsonl file

    if abs(result["parameters"][i] - DEFAULT_PARAMETER[i]) <= 1e-4:
KeyError: 0

L-M-Sherlock · 2024-12-02T11:06:37Z

9a22585 supports the new .jsonl file.

Expertium · 2024-12-02T11:09:53Z

Thank you, although I already analyzed what I wanted anyway
@DerIshmaelite here's what you wanted to know so badly:

Correlation coefficient=-0.056

The correlation coefficient between the average RMSE and the number of presets is virtually 0. Visually, I was expecting to see a U-shaped curve with a minimum that corresponds to the best number of presets, but nope. And according to the benchmark, RMSE is actually performs worse by 3-4% (relative) when FSRS is optimized on several presets rather than on the entire collection.

Note that I can't extract the number of reviews per each preset from the file Jarrett gave me, only the total number of reviews across all presets.

P.S. Out of 9999 collections, the maximum number of presets is 130. So DerIshmaelite, your 273 (or whatever number it was) is literally off the charts.

brishtibheja · 2024-12-02T11:24:12Z

If you calculate RMSE on lower number of reviews do you not get lower RMSE too? (Not following this conversation too closely so might have misunderstood the fundamentals).

Expertium · 2024-12-02T11:26:11Z

You get higher RMSE on low number of reviews

brishtibheja · 2024-12-02T11:30:22Z

Yeah, I meant that one. Actually, in layman's terms, I'm asking whether the following is still valid or not?

1. Create one collection-preset.
2. Optimise on everything.
3. Select a deck.
4. Clone the current preset.
5. In the new preset, click Optimise.
6. If you get new parameters, you're better off.

This works every single time if you have enough reviews in a singular deck.

Expertium · 2024-12-02T11:34:08Z

That I don't know.

Expertium · 2024-12-02T17:43:02Z

@L-M-Sherlock please add FSRS-5-presets to the results table

Expertium mentioned this issue Dec 4, 2024

I need 3 modifications of Anki 10k #136

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Discussing the new dataset and benchmark #129

Discussing the new dataset and benchmark #129

Expertium commented Oct 29, 2024 •

edited

Loading

L-M-Sherlock commented Oct 30, 2024

Expertium commented Oct 30, 2024

L-M-Sherlock commented Oct 30, 2024

L-M-Sherlock commented Oct 31, 2024

Expertium commented Oct 31, 2024

Expertium commented Nov 4, 2024

Expertium commented Nov 9, 2024

L-M-Sherlock commented Dec 1, 2024 •

edited

Loading

L-M-Sherlock commented Dec 2, 2024

Expertium commented Dec 2, 2024 •

edited

Loading

L-M-Sherlock commented Dec 2, 2024

L-M-Sherlock commented Dec 2, 2024 •

edited

Loading

Expertium commented Dec 2, 2024 •

edited

Loading

L-M-Sherlock commented Dec 2, 2024 •

edited

Loading

Expertium commented Dec 2, 2024

L-M-Sherlock commented Dec 2, 2024

L-M-Sherlock commented Dec 2, 2024

Expertium commented Dec 2, 2024

Expertium commented Dec 2, 2024

L-M-Sherlock commented Dec 2, 2024 •

edited

Loading

Expertium commented Dec 2, 2024 •

edited

Loading

brishtibheja commented Dec 2, 2024

Expertium commented Dec 2, 2024

brishtibheja commented Dec 2, 2024

Expertium commented Dec 2, 2024

Expertium commented Dec 2, 2024 •

edited

Loading

Discussing the new dataset and benchmark #129

Discussing the new dataset and benchmark #129

Comments

Expertium commented Oct 29, 2024 • edited Loading

L-M-Sherlock commented Oct 30, 2024

Expertium commented Oct 30, 2024

L-M-Sherlock commented Oct 30, 2024

L-M-Sherlock commented Oct 31, 2024

Expertium commented Oct 31, 2024

Expertium commented Nov 4, 2024

Expertium commented Nov 9, 2024

L-M-Sherlock commented Dec 1, 2024 • edited Loading

L-M-Sherlock commented Dec 2, 2024

Expertium commented Dec 2, 2024 • edited Loading

L-M-Sherlock commented Dec 2, 2024

L-M-Sherlock commented Dec 2, 2024 • edited Loading

Expertium commented Dec 2, 2024 • edited Loading

L-M-Sherlock commented Dec 2, 2024 • edited Loading

Expertium commented Dec 2, 2024

L-M-Sherlock commented Dec 2, 2024

L-M-Sherlock commented Dec 2, 2024

card 2703

card 2702

card 2704

Expertium commented Dec 2, 2024

Expertium commented Dec 2, 2024

L-M-Sherlock commented Dec 2, 2024 • edited Loading

Expertium commented Dec 2, 2024 • edited Loading

brishtibheja commented Dec 2, 2024

Expertium commented Dec 2, 2024

brishtibheja commented Dec 2, 2024

Expertium commented Dec 2, 2024

Expertium commented Dec 2, 2024 • edited Loading

Expertium commented Oct 29, 2024 •

edited

Loading

L-M-Sherlock commented Dec 1, 2024 •

edited

Loading

Expertium commented Dec 2, 2024 •

edited

Loading

L-M-Sherlock commented Dec 2, 2024 •

edited

Loading

Expertium commented Dec 2, 2024 •

edited

Loading

L-M-Sherlock commented Dec 2, 2024 •

edited

Loading

L-M-Sherlock commented Dec 2, 2024 •

edited

Loading

Expertium commented Dec 2, 2024 •

edited

Loading

Expertium commented Dec 2, 2024 •

edited

Loading