Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Stratified sampling #83

Merged
merged 22 commits into from
Oct 10, 2024
Merged

Stratified sampling #83

merged 22 commits into from
Oct 10, 2024

Conversation

alishibli97
Copy link
Collaborator

No description provided.

Copy link
Collaborator

@KerekesDavid KerekesDavid left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good job, seems to be working with MADOS on my end.

One nitpick, since the stratification itself takes quite long, and one of the usecases for limiting dataset size is to have a quick test run on a few samples, could you add a toggle to have the old behavior of just random sampling?

pangaea/run.py Outdated
range(n_train_samples), int(n_train_samples * cfg.limited_label)
)
train_dataset = Subset(train_dataset, indices)
# n_train_samples = len(train_dataset)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can remove these, that's why we have git :)

@KerekesDavid KerekesDavid linked an issue Oct 4, 2024 that may be closed by this pull request
@VMarsocci
Copy link
Owner

Hi before merging, please consider the following changes (already discussed in private- but I report them here for everyone else).

  1. add the possibility to choose the stratified sampling on a single set (either train or val)
  2. check the last version of HLSBurnScars (in the one you want to merge it is not updated)
  3. replace the "calculate class distribution" function with just reading this distribution in the dataset config
    Thanks a lot :)

@yurujaja
Copy link
Collaborator

yurujaja commented Oct 7, 2024

To address the comments, some modifications are made:

  1. configure sampling for both train and val
  2. configure the limited label strategy: random or stratified
  3. the unused function in hlsburn scar is removed
  4. the "calculate class distribution" function is still needed to calculate image-wise distribution to enable the selection of images

@alishibli97
Copy link
Collaborator Author

Added Regression stratification.
In train.yaml, one can specify the stratification method using the param:
limited_label_strategy: stratified_classification # or stratified_regression, or random

@RituYadav92 is it possible to validate that it works on your side with biomastters?

@yurujaja yurujaja requested a review from RituYadav92 October 8, 2024 15:08
…tion of labels from each bin

Previous code: A fraction of labels were selected from the sorted values. Specifically, for biomass, it was selecting samples with the lowest biomass.
@RituYadav92
Copy link
Collaborator

I made two modifications to the code:

  1. Updated the code to select a fraction of labels from each bin.
  2. Updated variable names "labeled_idx, unlabeled_idx" to "selected_idx, other_idx"

Please update the same for classification if it fits well.

Copy link
Collaborator

@RituYadav92 RituYadav92 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code looks fine now except line 68 in subset_sampler.py. It should be "if bin_id in indices_per_bin:" instead of "if bin_id not in indices_per_bin:"

Please check.

@alishibli97
Copy link
Collaborator Author

I think it is fine, what do you suspect?
It is only to initialize the dict key if not present

@RituYadav92
Copy link
Collaborator

I see now, you didn't adapt the initialization from regression but did it other way. Np problem. Resolved.

@yurujaja yurujaja merged commit 5794b4c into main Oct 10, 2024
1 check passed
@yurujaja yurujaja deleted the stratified_sampling branch October 10, 2024 15:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Limited label problem and startified sampling
7 participants