-
Notifications
You must be signed in to change notification settings - Fork 6
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support download of manifests that refer to prior IDC versions #100
Comments
regarding 2: We do have https://github.com/ImagingDataCommons/idc-index/blob/main/idc_index/index.py#L1166 I was thinking about about enabling users to work with any idc-version they would want. For that I thought about creating a table that shows which idc-index contains which idc-index-data. We could simply add idc_version as another argument to download functions. Fetch latest idc-index-data for requested idc version I don't know if we should encourage research on cross-versioned data unless necessary like working with a new collection introduced in newer idc releases. |
It is definitely necessary. In the general case, we will not know what version of IDC was used to generate a specific manifest. There is also nothing wrong, in principle, in having a manifest that refers to items from different versions. |
There is a hierarchy of BQ tables, version/collection/study/series/instance, in each idc_vX_dev that captures all idc-collected/generated metadata up through that version. So the hierarchy in idc_v19_dev will capture all such metadata up through v19. (It does not include any dicom metadata. ) |
Re ImagingDataCommons#100 Whenever a crdc_series_uuid from the manifest is not matched to those known to the index, provide error message informing the user of what could be the reasons. Fixed error checking for unrecognized items in the validation function. Report unrecognized items independently of whether validation is requested or not.
@bcli4d this is the query that I understand would give me what I need - details for all of the series that are not available in the latest version. Am I correct? I will also need your advice on how to add additional columns to this table form WITH
cross_series AS (
SELECT
DISTINCT collection_id,
submitter_case_id AS PatientID,
study_instance_uid AS StudyInstanceUID,
series_instance_uid AS SeriesInstanceUID,
se_uuid AS crdc_series_uuid,
IF
(se_sources.tcia, pub_gcs_tcia_url, pub_gcs_idc_url) gcs_bucket,
IF
(se_sources.tcia, pub_aws_tcia_url, pub_aws_idc_url) aws_bucket,
se_init_idc_version,
se_rev_idc_version,
se_final_idc_version
FROM
`idc-dev-etl.idc_v18_dev.all_joined_public`
ORDER BY
collection_id,
PatientID,
StudyInstanceUID,
SeriesInstanceUID)
SELECT
cross_series.*
FROM
cross_series
LEFT JOIN
`bigquery-public-data.idc_v18.dicom_all` AS dicom_all
ON
cross_series.crdc_series_uuid = dicom_all.crdc_series_uuid
WHERE
dicom_all.crdc_series_uuid IS NULL @vkt1414 once we sort this out, I am thinking to add this as a version delta table (right now it is just about 55K rows) to the pypi release, and then use it to try to resolve unmatched UUIDs from the manifest in |
@fedorov, I don't understand why you want to exclude series in the current release, rather that a table of series across all releases. Anyway, the above SQL appears to do what you want. |
Let's say someone cares about reproducibility and was interested in knowing the version of data they are working with. Per our documentation, this answer is currently answered by calling get_idc_version(). so my point again is to absolutely allow the users to work with any version of data, but not cross versions. In my opinion, that is the only way to truly reproduce results. Now, I'm not completely opposed to the solutions we are trying to find, but we must at least notify the user that we are referring to other versions of idc. As for the query, I would simply have one table. i.e we could change our idc-index main query to capture every series, pointing to the highest idc version, wherever applicable. I'll write the query and update this |
The se_rev_idc_version, and se_final_idc_version tell you to which IDC versions a series belongs. I.E. it is in IDC versions se_rev_idc_version through se_final_idc_version if se_final_idc_version != 0. Else it is in IDC versions idc_rev_idc_version through the current IDC version. |
Thanks @bcli4d. That's very helpful info. |
@bcli4d the information about the series in the current release is included already in the main index within this package. The purpose of this query is to be able to resolve
Yes, that's what I thought. If you don't have that ready, I can figure it out.
@vkt1414 the main purpose of the additional delta table is to support download of items referenced from manifests from prior releases. I do not understand what you are referring in the context of reproducibility. For all purposes, the user will work with the main index, which is aligned with the IDC data release in a given package version. The delta index will be separate and will only be useful in case the user wants to know what changed across the versions.
No, I definitely do not want to have a single table that mixes multiple versions. This goes against the way IDC data is organized in BigQuery. |
enh: switch to dict for index overview and fix tests bug fixed test to fix bug on macOS test next test enh: add contributing guidelines Based on https://github.com/Slicer/Slicer/blob/main/CONTRIBUTING.md ENH: improve error reporting for unrecognized items from the manifest Re ImagingDataCommons#100 Whenever a crdc_series_uuid from the manifest is not matched to those known to the index, provide error message informing the user of what could be the reasons. Fixed error checking for unrecognized items in the validation function. Report unrecognized items independently of whether validation is requested or not. ENH: add test of the manifest that has mismatches replaced github release access with hardcoded indices overview wip wip
…ained table. enh: switch to dict for index overview and fix tests bug fixed test to fix bug on macOS test next test enh: add contributing guidelines Based on https://github.com/Slicer/Slicer/blob/main/CONTRIBUTING.md ENH: improve error reporting for unrecognized items from the manifest Re ImagingDataCommons#100 Whenever a crdc_series_uuid from the manifest is not matched to those known to the index, provide error message informing the user of what could be the reasons. Fixed error checking for unrecognized items in the validation function. Report unrecognized items independently of whether validation is requested or not. ENH: add test of the manifest that has mismatches replaced github release access with hardcoded indices overview wip wip
You can only exclude the series that are new in the current release. Any series that were in the previous release but unchanged in the current release must be included in your table. So the table you have in mind will have most, but not all, of the series in the current release. Seems like that could be confusing. |
Interesting. The result of that query contains only 56285 rows, which seems to mean that it does not include most or all of the items from the current release. Maybe we should sit together to define the query to fit my needs? |
|
I'm able to generate the list of seriesInstanceUIDs along with strictly necessary metadata completely using public facing tables. Here are my observations during this process:
this table did not seem to have been updated to v18. As for implementing this feature, I submitted a PR to idc-index-data repo to write the results of this query ( see here). Once we have it available, we can use union_by_name in duckdb to union the main index, and then we can download data from any public facing idc-version ever existed. query to find series not present in v18:
query to find series along with necessary metadata for implementing this feature:
|
@bcli4d can you confirm if this is correct? |
It turns out that seriesInstanceUID is not reliable as multiple crdc_series_uuid's have the same seriesInstanceUID. so there are now a total of 45004 crdc_series_uuids not present in current dicom_all a/c to public tables, but there are 45016 crdc_series_uuids a/c to I'm investigating about those 12 discrepant series uuids. ccb73a3d-d65b-442e-91c0-212621480366
|
There is a lot to digest here, but a couple of quick comments:
I don't have time now to study the SQL above. In general, I recommend using all_joined_public joined with a table of (crdc_series_uuid, Modality, whatever). |
Once the aforementioned PR is integrated, I propose the following. We need to support download by either IDs that may correspond to content that can change across versions (collection ID, patient ID, study/series UID) (DICOM identifiers), or by series_instance_uuid from the manifest (UUID). I think we should treat the new |
@vkt1414 can you please let me know if you have any concerns or questions about the above? |
I don't have any problem with the overall idea. Implementation is more convoluted than one may think, as this touches a lot of methods. I'm working on this in my fork, and have made progress. Downloading a manifest with previous idc-data works, under default settings. Working on s5cmd sync now, and then plan to work on download by uuid. Current implementation of filtering is not ideal, https://github.com/ImagingDataCommons/idc-index/blob/main/idc_index/index.py#L114 does not care if some values passed are invalid. Will fix that and query previous index when needed. |
We don't necessarily need to update all methods at once. I would first update the calls that process download from manifest and resolve CRDC UUIDs. In the following improvements we can handle DICOM UIDs. This would also make code review easier. |
As experienced by @bcli4d, perfectly valid IDC manifests that point to items available in prior versions of IDC would not work with
idc-index
download functionality.I think we should:
The text was updated successfully, but these errors were encountered: