-
Notifications
You must be signed in to change notification settings - Fork 14
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Stratified sampling #83
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good job, seems to be working with MADOS on my end.
One nitpick, since the stratification itself takes quite long, and one of the usecases for limiting dataset size is to have a quick test run on a few samples, could you add a toggle to have the old behavior of just random sampling?
pangaea/run.py
Outdated
range(n_train_samples), int(n_train_samples * cfg.limited_label) | ||
) | ||
train_dataset = Subset(train_dataset, indices) | ||
# n_train_samples = len(train_dataset) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can remove these, that's why we have git :)
Hi before merging, please consider the following changes (already discussed in private- but I report them here for everyone else).
|
To address the comments, some modifications are made:
|
Added Regression stratification. @RituYadav92 is it possible to validate that it works on your side with biomastters? |
…tion of labels from each bin Previous code: A fraction of labels were selected from the sorted values. Specifically, for biomass, it was selecting samples with the lowest biomass.
I made two modifications to the code:
Please update the same for classification if it fits well. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The code looks fine now except line 68 in subset_sampler.py. It should be "if bin_id in indices_per_bin:" instead of "if bin_id not in indices_per_bin:"
Please check.
I think it is fine, what do you suspect? |
I see now, you didn't adapt the initialization from regression but did it other way. Np problem. Resolved. |
No description provided.