Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add sub-unf02 and sub-unf03 datasets #7

Merged
merged 8 commits into from
Jan 13, 2021
Merged

Add sub-unf02 and sub-unf03 datasets #7

merged 8 commits into from
Jan 13, 2021

Conversation

ahill187
Copy link
Contributor

@ahill187 ahill187 commented Jan 5, 2021

Currently, during testing of the ivadomed package, the script called test_script.py makes two copies of the sub-unf01 test data. However this raises two issues:

  1. That script isn't always run first, so some of the tests are reliant on these additional datasets and fail if the test_script.py script hasn't run yet.
  2. Simplicity. I think the tradeoff for downloading a bit of extra data is probably better than making copies of test data during testing; it makes the code simpler to follow, and minimizes room for error.

This pull request saves these copies in the testing data so they do not have to be recreated.

Note: The last merge into master of ivadomed is failing (https://github.com/ivadomed/ivadomed/actions/runs/461790903) due to this issue

Copy link
Member

@jcohenadad jcohenadad left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i'm not a big fan of duplicating those datasets under names XX02 and XX03, because according to the BIDS convention, which the lab follows, this implies 02 and 03 are two other subjects, whereas in reality they are the same subject (duplicated). So, if we decide to explicitly store unf-02 and unf-03, i would instead upload the actual unf-02 and unf-03 datasets (which actually exist)

@ahill187
Copy link
Contributor Author

ahill187 commented Jan 5, 2021

@jcohenadad Oh I see, I didn't realize that. If I wanted to use the actual unf-02 and unf-03 datasets, where would I find them?

@jcohenadad
Copy link
Member

@jcohenadad Oh I see, I didn't realize that. If I wanted to use the actual unf-02 and unf-03 datasets, where would I find them?

https://github.com/spine-generic/data-multi-subject#spine-generic-public-database-multi-subject 😊

@ahill187
Copy link
Contributor Author

ahill187 commented Jan 5, 2021

@jcohenadad Okay, I have used the data for sub-unf02 and sub-unf03, from here: https://github.com/spine-generic/data-multi-subject, so they are no longer copies. Should be fixed in this commit: a83212e.

@jcohenadad
Copy link
Member

@jcohenadad Okay, I have used the data for sub-unf02 and sub-unf03, from here: https://github.com/spine-generic/data-multi-subject, so they are no longer copies. Should be fixed in this commit: a83212e.

are you sure you committed the physical file? (not the symlink from the annexed file: /annex/objects/SHA256E-s139692--53de14288ff8d1211cd55c63f426e27f7122a155e451c930a7f9a233e4ebe036.nii.gz) ?

@ahill187
Copy link
Contributor Author

ahill187 commented Jan 5, 2021

are you sure you committed the physical file? (not the symlink from the annexed file: /annex/objects/SHA256E-s139692--53de14288ff8d1211cd55c63f426e27f7122a155e451c930a7f9a233e4ebe036.nii.gz) ?

Oh thanks, I missed the derivatives folder! Fixed here: 92ed86c

df_ref.csv Show resolved Hide resolved
Copy link
Member

@jcohenadad jcohenadad left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah... we're hitting another problem here: the size. Data for unf-01 were modified by reducing their size, see details here #2

the reason is to have a dataset as light as possible, because it gets downloaded each time we need to run CI. Ideally the testing dataset should only be a few MB...

@jcohenadad
Copy link
Member

@ahill187 about e4063c3, why not 1mm iso as specified in #2 (comment)?

@ahill187
Copy link
Contributor Author

ahill187 commented Jan 8, 2021

@jcohenadad No reason in particular, I looked at the file sizes in sub-unf02 and sub-unf03 and compared them with sub-unf01, and determined that if I downsampled by a factor of 1/10 I would achieve about the same file size.
I used the command:

sct_resample -i sub-unf03_T2w_labels-disc-manual.nii.gz -f 0.1x0.1x0.1

Will it make a difference? Sorry I'm not so familiar with these files yet :)

@jcohenadad
Copy link
Member

I think the first question is: what do we need these extra data for-- if we need them only for "crash test", then the content doesn't really matter, and we can squeeze the size even more-- if content matters, then the other question is: how many data and what data do we need for integrity testing? if we only need one subject (unf01), then we can squeeze unf2 and unf3 much more-- however if we need "information" in more than unf01, then we should keep some content-- @lrouhier @andreanne-lemay @ivadomed/editors could you pls advise? thanks

@ahill187
Copy link
Contributor Author

ahill187 commented Jan 8, 2021

I believe some of the tests rely on the extra data, although to what extent I am unsure. In test_script.py, these extra copies are created, presumably to be used for some of the tests? Perhaps someone else might have a better idea though.

def test_creation_dataset():
    # Add new file as needed (no empty test/validation)
    # create empty directory for our new files
    os.makedirs("testing_data/sub-test002/anat/", exist_ok=True)
    os.makedirs("testing_data/sub-test003/anat/", exist_ok=True)
    os.makedirs("testing_data/derivatives/labels/sub-test002/anat/", exist_ok=True)
    os.makedirs("testing_data/derivatives/labels/sub-test003/anat/", exist_ok=True)

    # sub-test002 and sub-test003 will just be copy of our only real testing subject
    command = "cp testing_data/sub-unf01/anat/sub-unf01_T2w.nii.gz testing_data/sub-test002/anat/sub-test002" + \
              "_T2w.nii.gz"
    subprocess.check_output(command, shell=True)

    command = "cp testing_data/sub-unf01/anat/sub-unf01_T2w.nii.gz testing_data/sub-test003/anat/sub-test003" + \
              "_T2w.nii.gz"
    subprocess.check_output(command, shell=True)

    # populate derivatives for sub-test002
    derivatives = "testing_data/derivatives/labels/"
    command = "cp " + derivatives + "sub-unf01/anat/sub-unf01_T2w_seg-manual.nii.gz " + \
              derivatives + "sub-test002/anat/sub-test002" + \
              "_T2w_seg-manual.nii.gz"
    subprocess.check_output(command, shell=True)

    command = "cp " + derivatives + "sub-unf01/anat/sub-unf01_T2w_lesion-manual.nii.gz " + \
              derivatives + "sub-test002/anat/sub-test002" + \
              "_T2w_lesion-manual.nii.gz"
    subprocess.check_output(command, shell=True)

    # populate derivatives for sub-test003
    command = "cp " + derivatives + "sub-unf01/anat/sub-unf01_T2w_seg-manual.nii.gz " + \
              derivatives + "sub-test003/anat/sub-test003" + \
              "_T2w_seg-manual.nii.gz"
    subprocess.check_output(command, shell=True)

    command = "cp " + derivatives + "sub-unf01/anat/sub-unf01_T2w_lesion-manual.nii.gz " + \
              derivatives + "sub-test003/anat/sub-test003" + \
              "_T2w_lesion-manual.nii.gz"
    subprocess.check_output(command, shell=True)

    # Model needs to be inside the log_directory since we use a config file.
    command = "cp testing_data/model_unet_test.pt testing_script/best_model.pt"
    subprocess.check_output(command, shell=True)

    list1 = ["sub-test002", "-"]
    list2 = ["sub-test003", "-"]

    # add subjects to participants.tsv
    append_list_as_row("testing_data/participants.tsv", list1)
    append_list_as_row("testing_data/participants.tsv", list2)

@andreanne-lemay
Copy link
Member

The content of the subject doesn't matter (for now) for the test we are doing. We need at least 3 subjects to have one for each set: testing, validation, and training. I personally would go with adding 2 subjects in the testing data with minimal size. This way we don't need a test to duplicate the subjects which is not very clean.

@ahill187
Copy link
Contributor Author

Okay cool! I think that is what we have implemented in this pull request currently. @jcohenadad what are your thoughts?

@ahill187 ahill187 requested a review from jcohenadad January 12, 2021 18:41
@jcohenadad
Copy link
Member

Okay cool! I think that is what we have implemented in this pull request currently. @jcohenadad what are your thoughts?

Could you please try running ivadomed's testing framework on this testing-data branch to see if it passes?

@ahill187
Copy link
Contributor Author

Could you please try running ivadomed's testing framework on this testing-data branch to see if it passes?

Yes, I've run it locally in Python 3.8 and it passes, so long as we remove the test_script.py file. That file is the one that was creating all these duplicated files in the first place, and it also runs CLI tests. The former functionality is made redundant by this pull request, and the latter functionality will be made redundant by the funtional tests PR. I've created a PR here to remove the file:
ivadomed/ivadomed#634

Copy link
Member

@jcohenadad jcohenadad left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for your contribution @ahill187 🙏
Please make sure to click "squash and merge" (to make sure we don't accumulate big files on the main branch)

@ahill187 ahill187 merged commit 8a29a0d into master Jan 13, 2021
@ahill187 ahill187 deleted the ah/update-data branch January 13, 2021 16:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants