From 1d6982380c3378818c98e6ce288609b8c9349e39 Mon Sep 17 00:00:00 2001
From: Andrey Fedorov <andrey.fedorov@gmail.com>
Date: Fri, 26 Jul 2024 10:21:51 -0400
Subject: [PATCH] bug: fix check for installed index

enh: switch to dict for index overview and fix tests

bug fixed

test to fix bug on macOS

test

next test

enh: add contributing guidelines

Based on https://github.com/Slicer/Slicer/blob/main/CONTRIBUTING.md

ENH: improve error reporting for unrecognized items from the manifest

Re #100

Whenever a crdc_series_uuid from the manifest is not matched to those
known to the index, provide error message informing the user of what
could be the reasons. Fixed error checking for unrecognized items in
the validation function. Report unrecognized items independently of
whether validation is requested or not.

ENH: add test of the manifest that has mismatches

replaced github release access with hardcoded indices overview

wip

wip
---
 CONTRIBUTING.md                    | 122 ++++++++
 idc_index/index.py                 | 121 +++-----
 pyproject.toml                     |   1 +
 tests/idcindex.py                  | 476 -----------------------------
 tests/prior_version_manifest.s5cmd |   5 +
 5 files changed, 169 insertions(+), 556 deletions(-)
 create mode 100644 CONTRIBUTING.md
 delete mode 100644 tests/idcindex.py
 create mode 100644 tests/prior_version_manifest.s5cmd
diff --git a/CONTRIBUTING.md b/CONTRIBUTING.md
new file mode 100644
index 00000000..5106ab32
--- /dev/null
+++ b/CONTRIBUTING.md
@@ -0,0 +1,122 @@
+# Contributing to idc-index
+
+There are many ways to contribute to idc-index, with varying levels of effort.
+Do try to look through the [documentation](idc-index-docs) first if something is
+unclear, and let us know how we can do better.
+
+- Ask a question on the [IDC forum][idc-forum]
+- Use [idc-index issues][idc-index-issues] to submit a feature request or bug,
+  or add to the discussion on an existing issue
+- Submit a [Pull Request](https://github.com/ImagingDataCommons/idc-index/pulls)
+  to improve idc-index or its documentation
+
+We encourage a range of Pull Requests, from patches that include passing tests
+and documentation, all the way down to half-baked ideas that launch discussions.
+
+## The PR Process, Circle CI, and Related Gotchas
+
+### How to submit a PR ?
+
+If you are new to idc-index development and you don't have push access to the
+repository, here are the steps:
+
+1. [Fork and clone](https://docs.github.com/get-started/quickstart/fork-a-repo)
+   the repository.
+2. Create a branch dedicated to the feature/bugfix you plan to implement (do not
+   use `main` branch - this will complicate further development and
+   collaboration)
+3. [Push](https://docs.github.com/get-started/using-git/pushing-commits-to-a-remote-repository)
+   the branch to your GitHub fork.
+4. Create a
+   [Pull Request](https://github.com/ImagingDataCommons/idc-index/pulls).
+
+This corresponds to the `Fork & Pull Model` described in the
+[GitHub collaborative development](https://docs.github.com/pull-requests/collaborating-with-pull-requests/getting-started/about-collaborative-development-models)
+documentation.
+
+When submitting a PR, the developers following the project will be notified.
+That said, to engage specific developers, you can add `Cc: @<username>` comment
+to notify them of your awesome contributions. Based on the comments posted by
+the reviewers, you may have to revisit your patches.
+
+### How to efficiently contribute ?
+
+We encourage all developers to:
+
+- set up pre-commit hooks so that you can remedy various formatting and other
+  issues early, without waiting for the continuous integration (CI) checks to
+  complete: `pre-commit install`
+
+- add or update tests. You can see current tests
+  [here](https://github.com/ImagingDataCommons/idc-index/tree/main/tests). If
+  you contribute new functionality, adding test(s) covering it is mandatory!
+
+- you can run individual tests from the root repository using the following
+  command: `python -m unittest -vv tests.idcindex.TestIDCClient.<test_name>`
+
+### How to write commit messages ?
+
+Write your commit messages using the standard prefixes for commit messages:
+
+- `BUG:` Fix for runtime crash or incorrect result
+- `COMP:` Compiler error or warning fix
+- `DOC:` Documentation change
+- `ENH:` New functionality
+- `PERF:` Performance improvement
+- `STYLE:` No logic impact (indentation, comments)
+- `WIP:` Work In Progress not ready for merge
+
+The body of the message should clearly describe the motivation of the commit
+(**what**, **why**, and **how**). In order to ease the task of reviewing
+commits, the message body should follow the following guidelines:
+
+1. Leave a blank line between the subject and the body. This helps `git log` and
+   `git rebase` work nicely, and allows to smooth generation of release notes.
+2. Try to keep the subject line below 72 characters, ideally 50.
+3. Capitalize the subject line.
+4. Do not end the subject line with a period.
+5. Use the imperative mood in the subject line (e.g.
+   `BUG: Fix spacing not being considered.`).
+6. Wrap the body at 80 characters.
+7. Use semantic line feeds to separate different ideas, which improves the
+   readability.
+8. Be concise, but honor the change: if significant alternative solutions were
+   available, explain why they were discarded.
+9. If the commit refers to a topic discussed on the [IDC forum][idc-forum], or
+   fixes a regression test, provide the link. If it fixes a compiler error,
+   provide a minimal verbatim message of the compiler error. If the commit
+   closes an issue, use the
+   [GitHub issue closing keywords](https://docs.github.com/issues/tracking-your-work-with-issues/linking-a-pull-request-to-an-issue).
+
+Keep in mind that the significant time is invested in reviewing commits and
+_pull requests_, so following these guidelines will greatly help the people
+doing reviews.
+
+These guidelines are largely inspired by Chris Beam's
+[How to Write a Commit Message](https://chris.beams.io/posts/git-commit/) post.
+
+### How to integrate a PR ?
+
+Getting your contributions integrated is relatively straightforward, here is the
+checklist:
+
+- All tests pass
+- Consensus is reached. This usually means that at least two reviewers approved
+  the changes (or added a `LGTM` comment) and at least one business day passed
+  without anyone objecting. `LGTM` is an acronym for _Looks Good to Me_.
+- To accommodate developers explicitly asking for more time to test the proposed
+  changes, integration time can be delayed by few more days.
+- If you do NOT have push access, a core developer will integrate your PR. If
+  you would like to speed up the integration, do not hesitate to add a reminder
+  comment to the PR
+
+### Automatic testing of pull requests
+
+Every pull request is tested automatically using GitHub Actions each time you
+push a commit to it. The GitHub UI will restrict users from merging pull
+requests until the CI build has returned with a successful result indicating
+that all tests have passed.
+
+[idc-forum]: https://discourse.canceridc.dev
+[idc-index-issues]: https://github.com/ImagingDataCommons/idc-index/issues
+[idc-index-docs]: https://idc-index.readthedocs.io/
diff --git a/idc_index/index.py b/idc_index/index.py
index 8a5e33f7..4089342c 100644
--- a/idc_index/index.py
+++ b/idc_index/index.py
@@ -21,6 +21,7 @@
 
 aws_endpoint_url = "https://s3.amazonaws.com"
 gcp_endpoint_url = "https://storage.googleapis.com"
+asset_endpoint_url = f"https://api.github.com/repos/ImagingDataCommons/idc-index-data/releases/tags/{idc_index_data.__version__}"
 
 logging.basicConfig(format="%(asctime)s - %(message)s", level=logging.INFO)
 logger = logging.getLogger(__name__)
@@ -67,7 +68,24 @@ def __init__(self):
         self.collection_summary = self.index.groupby("collection_id").agg(
             {"Modality": pd.Series.unique, "series_size_MB": "sum"}
         )
-        self.indices_overview = self.list_indices()
+
+        self.indices_overview = pd.DataFrame(
+            {
+                "index": {"description": None, "installed": True, "url": None},
+                "sm_index": {
+                    "description": None,
+                    "installed": True,
+                    "url": os.path.join(asset_endpoint_url, "sm_index.parquet"),
+                },
+                "sm_instance_index": {
+                    "description": None,
+                    "installed": True,
+                    "url": os.path.join(
+                        asset_endpoint_url, "sm_instance_index.parquet"
+                    ),
+                },
+            }
+        )
 
         # Lookup s5cmd
         self.s5cmdPath = shutil.which("s5cmd")
@@ -172,33 +190,6 @@ def get_idc_version():
         idc_version = Version(idc_index_data.__version__).major
         return f"v{idc_version}"
 
-    @staticmethod
-    def _get_latest_idc_index_data_release_assets():
-        """
-        Retrieves a list of the latest idc-index-data release assets.
-
-        Returns:
-            release_assets (list): List of tuples (asset_name, asset_url).
-        """
-        release_assets = []
-        url = f"https://api.github.com/repos/ImagingDataCommons/idc-index-data/releases/tags/{idc_index_data.__version__}"
-        try:
-            response = requests.get(url, timeout=30)
-            if response.status_code == 200:
-                release_data = response.json()
-                assets = release_data.get("assets", [])
-                for asset in assets:
-                    release_assets.append(
-                        (asset["name"], asset["browser_download_url"])
-                    )
-            else:
-                logger.error(f"Failed to fetch releases: {response.status_code}")
-
-        except FileNotFoundError:
-            logger.error(f"Failed to fetch releases: {response.status_code}")
-
-        return release_assets
-
     def list_indices(self):
         """
         Lists all available indices including their installation status.
@@ -207,40 +198,6 @@ def list_indices(self):
             indices_overview (pd.DataFrame): DataFrame containing information per index.
         """
 
-        if "indices_overview" not in locals():
-            indices_overview = {}
-            # Find installed indices
-            for file in distribution("idc-index-data").files:
-                if str(file).endswith("index.parquet"):
-                    index_name = os.path.splitext(
-                        str(file).rsplit("/", maxsplit=1)[-1]
-                    )[0]
-
-                    indices_overview[index_name] = {
-                        "description": None,
-                        "installed": True,
-                        "local_path": os.path.join(
-                            idc_index_data.IDC_INDEX_PARQUET_FILEPATH.parents[0],
-                            f"{index_name}.parquet",
-                        ),
-                    }
-
-            # Find available indices from idc-index-data
-            release_assets = self._get_latest_idc_index_data_release_assets()
-            for asset_name, asset_url in release_assets:
-                if asset_name.endswith(".parquet"):
-                    asset_name = os.path.splitext(asset_name)[0]
-                    if asset_name not in indices_overview:
-                        indices_overview[asset_name] = {
-                            "description": None,
-                            "installed": False,
-                            "url": asset_url,
-                        }
-
-            self.indices_overview = pd.DataFrame.from_dict(
-                indices_overview, orient="index"
-            )
-
         return self.indices_overview
 
     def fetch_index(self, index) -> None:
@@ -251,14 +208,14 @@ def fetch_index(self, index) -> None:
             index (str): Name of the index to be downloaded.
         """
 
-        if index not in self.indices_overview.index.tolist():
+        if index not in self.indices_overview.keys():
             logger.error(f"Index {index} is not available and can not be fetched.")
-        elif self.indices_overview.loc[index, "installed"]:
+        elif self.indices_overview[index]["installed"]:
             logger.warning(
                 f"Index {index} already installed and will not be fetched again."
             )
         else:
-            response = requests.get(self.indices_overview.loc[index, "url"], timeout=30)
+            response = requests.get(self.indices_overview[index]["url"], timeout=30)
             if response.status_code == 200:
                 filepath = os.path.join(
                     idc_index_data.IDC_INDEX_PARQUET_FILEPATH.parents[0],
@@ -266,8 +223,7 @@ def fetch_index(self, index) -> None:
                 )
                 with open(filepath, mode="wb") as file:
                     file.write(response.content)
-                self.indices_overview.loc[index, "installed"] = True
-                self.indices_overview.loc[index, "local_path"] = filepath
+                self.indices_overview[index]["installed"] = True
             else:
                 logger.error(f"Failed to fetch index: {response.status_code}")
 
@@ -668,8 +624,8 @@ def _validate_update_manifest_and_get_download_size(
         # create a copy of the index
         index_df_copy = self.index
 
-        # Extract s3 url and crdc_instance_uuid from the manifest copy commands
-        # Next, extract crdc_instance_uuid from aws_series_url in the index and
+        # Extract s3 url and crdc_series_uuid from the manifest copy commands
+        # Next, extract crdc_series_uuid from aws_series_url in the index and
         # try to verify if every series in the manifest is present in the index
 
         # TODO: need to remove the assumption that manifest commands will have 'cp'
@@ -697,8 +653,9 @@ def _validate_update_manifest_and_get_download_size(
                 seriesInstanceuid,
                 s3_url,
                 series_size_MB,
-                index_crdc_series_uuid==manifest_crdc_series_uuid AS crdc_series_uuid_match,
+                index_crdc_series_uuid is not NULL as crdc_series_uuid_match,
                 s3_url==series_aws_url AS s3_url_match,
+                manifest_temp.manifest_cp_cmd,
             CASE
                 WHEN s3_url==series_aws_url THEN 'aws'
             ELSE
@@ -717,19 +674,23 @@ def _validate_update_manifest_and_get_download_size(
 
         endpoint_to_use = None
 
-        if validate_manifest:
-            # Check if crdc_instance_uuid is found in the index
-            if not all(merged_df["crdc_series_uuid_match"]):
-                missing_manifest_cp_cmds = merged_df.loc[
-                    ~merged_df["crdc_series_uuid_match"], "manifest_cp_cmd"
-                ]
-                missing_manifest_cp_cmds_str = f"The following manifest copy commands do not have any associated series in the index: {missing_manifest_cp_cmds.tolist()}"
-                raise ValueError(missing_manifest_cp_cmds_str)
+        # Check if any crdc_series_uuid are not found in the index
+        if not all(merged_df["crdc_series_uuid_match"]):
+            missing_manifest_cp_cmds = merged_df.loc[
+                ~merged_df["crdc_series_uuid_match"], "manifest_cp_cmd"
+            ]
+            logger.error(
+                "The following manifest copy commands are not recognized as referencing any associated series in the index.\n"
+                "This means either these commands are invalid, or they may correspond to files available in a release of IDC\n"
+                f"different from {self.get_idc_version()} used in this version of idc-index. The corresponding files will not be downloaded.\n"
+            )
+            logger.error("\n" + "\n".join(missing_manifest_cp_cmds.tolist()))
 
-            # Check if there are more than one endpoints
+        if validate_manifest:
+            # Check if there is more than one endpoint
             if len(merged_df["endpoint"].unique()) > 1:
                 raise ValueError(
-                    "Either GCS bucket path is invalid or manifest has a mix of GCS and AWS urls. If so, please use urls from one provider only"
+                    "Either GCS bucket path is invalid or manifest has a mix of GCS and AWS urls. "
                 )
 
             if (
diff --git a/pyproject.toml b/pyproject.toml
index a4d8c825..92304920 100644
--- a/pyproject.toml
+++ b/pyproject.toml
@@ -125,6 +125,7 @@ disallow_incomplete_defs = true
 
 [tool.ruff]
 src = ["idc_index"]
+extend-exclude = ["./CONTRIBUTING.md"]
 
 [tool.ruff.lint]
 extend-select = [
diff --git a/tests/idcindex.py b/tests/idcindex.py
deleted file mode 100644
index c0806134..00000000
--- a/tests/idcindex.py
+++ /dev/null
@@ -1,476 +0,0 @@
-from __future__ import annotations
-
-import logging
-import os
-import tempfile
-import unittest
-from itertools import product
-from pathlib import Path
-
-import pandas as pd
-import pytest
-from click.testing import CliRunner
-from idc_index import IDCClient, cli
-
-# Run tests using the following command from the root of the repository:
-# python -m unittest -vv tests/idcindex.py
-
-logging.basicConfig(level=logging.DEBUG)
-
-
-@pytest.fixture(autouse=True)
-def _change_test_dir(request, monkeypatch):
-    monkeypatch.chdir(request.fspath.dirname)
-
-
-class TestIDCClient(unittest.TestCase):
-    def setUp(self):
-        self.client = IDCClient()
-        self.download_from_manifest = cli.download_from_manifest
-        self.download_from_selection = cli.download_from_selection
-        self.download = cli.download
-
-        logger = logging.getLogger("idc_index")
-        logger.setLevel(logging.DEBUG)
-
-    def test_get_collections(self):
-        collections = self.client.get_collections()
-        self.assertIsNotNone(collections)
-
-    def test_get_idc_version(self):
-        idc_version = self.client.get_idc_version()
-        self.assertIsNotNone(idc_version)
-        self.assertTrue(idc_version.startswith("v"))
-
-    def test_get_patients(self):
-        # Define the values for each optional parameter
-        output_format_values = ["list", "dict", "df"]
-        collection_id_values = [
-            "htan_ohsu",
-            ["ct_phantom4radiomics", "cmb_gec"],
-        ]
-
-        # Test each combination
-        for collection_id in collection_id_values:
-            for output_format in output_format_values:
-                patients = self.client.get_patients(
-                    collection_id=collection_id, outputFormat=output_format
-                )
-
-                # Check if the output format matches the expected type
-                if output_format == "list":
-                    self.assertIsInstance(patients, list)
-                    self.assertTrue(bool(patients))  # Check that the list is not empty
-                elif output_format == "dict":
-                    self.assertTrue(
-                        isinstance(patients, dict)
-                        or (
-                            isinstance(patients, list)
-                            and all(isinstance(i, dict) for i in patients)
-                        )
-                    )  # Check that the output is either a dictionary or a list of dictionaries
-                    self.assertTrue(
-                        bool(patients)
-                    )  # Check that the output is not empty
-                elif output_format == "df":
-                    self.assertIsInstance(patients, pd.DataFrame)
-                    self.assertFalse(
-                        patients.empty
-                    )  # Check that the DataFrame is not empty
-
-    def test_get_studies(self):
-        # Define the values for each optional parameter
-        output_format_values = ["list", "dict", "df"]
-        patient_id_values = ["PCAMPMRI-00001", ["PCAMPMRI-00001", "NoduleLayout_1"]]
-
-        # Test each combination
-        for patient_id in patient_id_values:
-            for output_format in output_format_values:
-                studies = self.client.get_dicom_studies(
-                    patientId=patient_id, outputFormat=output_format
-                )
-
-                # Check if the output format matches the expected type
-                if output_format == "list":
-                    self.assertIsInstance(studies, list)
-                    self.assertTrue(bool(studies))  # Check that the list is not empty
-                elif output_format == "dict":
-                    self.assertTrue(
-                        isinstance(studies, dict)
-                        or (
-                            isinstance(studies, list)
-                            and all(isinstance(i, dict) for i in studies)
-                        )
-                    )  # Check that the output is either a dictionary or a list of dictionaries
-                    self.assertTrue(bool(studies))  # Check that the output is not empty
-                elif output_format == "df":
-                    self.assertIsInstance(studies, pd.DataFrame)
-                    self.assertFalse(
-                        studies.empty
-                    )  # Check that the DataFrame is not empty
-
-    def test_get_series(self):
-        """
-        Query used for selecting the smallest series/studies:
-
-        SELECT
-            StudyInstanceUID,
-            ARRAY_AGG(DISTINCT(collection_id)) AS collection,
-            ARRAY_AGG(DISTINCT(series_aws_url)) AS aws_url,
-            ARRAY_AGG(DISTINCT(series_gcs_url)) AS gcs_url,
-            COUNT(DISTINCT(SOPInstanceUID)) AS num_instances,
-            SUM(instance_size) AS series_size
-        FROM
-            `bigquery-public-data.idc_current.dicom_all`
-        GROUP BY
-            StudyInstanceUID
-        HAVING
-            num_instances > 2
-        ORDER BY
-            series_size asc
-        LIMIT
-            10
-        """
-        # Define the values for each optional parameter
-        output_format_values = ["list", "dict", "df"]
-        study_instance_uid_values = [
-            "1.3.6.1.4.1.14519.5.2.1.6279.6001.175012972118199124641098335511",
-            [
-                "1.3.6.1.4.1.14519.5.2.1.1239.1759.691327824408089993476361149761",
-                "1.3.6.1.4.1.14519.5.2.1.1239.1759.272272273744698671736205545239",
-            ],
-        ]
-
-        # Test each combination
-        for study_instance_uid in study_instance_uid_values:
-            for output_format in output_format_values:
-                series = self.client.get_dicom_series(
-                    studyInstanceUID=study_instance_uid, outputFormat=output_format
-                )
-
-                # Check if the output format matches the expected type
-                if output_format == "list":
-                    self.assertIsInstance(series, list)
-                    self.assertTrue(bool(series))  # Check that the list is not empty
-                elif output_format == "dict":
-                    self.assertTrue(
-                        isinstance(series, dict)
-                        or (
-                            isinstance(series, list)
-                            and all(isinstance(i, dict) for i in series)
-                        )
-                    )  # Check that the output is either a dictionary or a list of dictionaries
-                elif output_format == "df":
-                    self.assertIsInstance(series, pd.DataFrame)
-                    self.assertFalse(
-                        series.empty
-                    )  # Check that the DataFrame is not empty
-
-    def test_download_dicom_series(self):
-        with tempfile.TemporaryDirectory() as temp_dir:
-            self.client.download_dicom_series(
-                seriesInstanceUID="1.3.6.1.4.1.14519.5.2.1.7695.1700.153974929648969296590126728101",
-                downloadDir=temp_dir,
-            )
-            self.assertEqual(sum([len(files) for r, d, files in os.walk(temp_dir)]), 3)
-
-    def test_download_with_template(self):
-        dirTemplateValues = [
-            None,
-            "%collection_id_%PatientID/%Modality-%StudyInstanceUID%SeriesInstanceUID",
-            "%collection_id%PatientID-%Modality_%StudyInstanceUID/%SeriesInstanceUID",
-            "%collection_id-%PatientID_%Modality/%StudyInstanceUID-%SeriesInstanceUID",
-            "%collection_id_%PatientID/%Modality/%StudyInstanceUID_%SeriesInstanceUID",
-        ]
-        for template in dirTemplateValues:
-            with tempfile.TemporaryDirectory() as temp_dir:
-                self.client.download_from_selection(
-                    downloadDir=temp_dir,
-                    studyInstanceUID="1.3.6.1.4.1.14519.5.2.1.7695.1700.114861588187429958687900856462",
-                    dirTemplate=template,
-                )
-                self.assertEqual(
-                    sum([len(files) for r, d, files in os.walk(temp_dir)]), 3
-                )
-
-    def test_download_from_selection(self):
-        # Define the values for each optional parameter
-        dry_run_values = [True, False]
-        quiet_values = [True, False]
-        show_progress_bar_values = [True, False]
-        use_s5cmd_sync_values = [True, False]
-
-        # Generate all combinations of optional parameters
-        combinations = product(
-            dry_run_values,
-            quiet_values,
-            show_progress_bar_values,
-            use_s5cmd_sync_values,
-        )
-
-        # Test each combination
-        for (
-            dry_run,
-            quiet,
-            show_progress_bar,
-            use_s5cmd_sync,
-        ) in combinations:
-            with tempfile.TemporaryDirectory() as temp_dir:
-                self.client.download_from_selection(
-                    downloadDir=temp_dir,
-                    dry_run=dry_run,
-                    patientId=None,
-                    studyInstanceUID="1.3.6.1.4.1.14519.5.2.1.7695.1700.114861588187429958687900856462",
-                    seriesInstanceUID=None,
-                    quiet=quiet,
-                    show_progress_bar=show_progress_bar,
-                    use_s5cmd_sync=use_s5cmd_sync,
-                )
-
-                if not dry_run:
-                    self.assertNotEqual(len(os.listdir(temp_dir)), 0)
-
-    def test_sql_queries(self):
-        df = self.client.sql_query("SELECT DISTINCT(collection_id) FROM index")
-
-        self.assertIsNotNone(df)
-
-    def test_download_from_aws_manifest(self):
-        # Define the values for each optional parameter
-        quiet_values = [True, False]
-        validate_manifest_values = [True, False]
-        show_progress_bar_values = [True, False]
-        use_s5cmd_sync_values = [True, False]
-        dirTemplateValues = [
-            None,
-            "%collection_id/%PatientID/%Modality/%StudyInstanceUID/%SeriesInstanceUID",
-            "%collection_id%PatientID%Modality%StudyInstanceUID%SeriesInstanceUID",
-        ]
-        # Generate all combinations of optional parameters
-        combinations = product(
-            quiet_values,
-            validate_manifest_values,
-            show_progress_bar_values,
-            use_s5cmd_sync_values,
-            dirTemplateValues,
-        )
-        # Test each combination
-        for (
-            quiet,
-            validate_manifest,
-            show_progress_bar,
-            use_s5cmd_sync,
-            dirTemplate,
-        ) in combinations:
-            with tempfile.TemporaryDirectory() as temp_dir:
-                self.client.download_from_manifest(
-                    manifestFile="./study_manifest_aws.s5cmd",
-                    downloadDir=temp_dir,
-                    quiet=quiet,
-                    validate_manifest=validate_manifest,
-                    show_progress_bar=show_progress_bar,
-                    use_s5cmd_sync=use_s5cmd_sync,
-                    dirTemplate=dirTemplate,
-                )
-
-                if sum([len(files) for _, _, files in os.walk(temp_dir)]) != 9:
-                    print(
-                        f"Failed for {quiet} {validate_manifest} {show_progress_bar} {use_s5cmd_sync} {dirTemplate}"
-                    )
-                    self.assertFalse(True)
-
-    def test_download_from_gcp_manifest(self):
-        # Define the values for each optional parameter
-        quiet_values = [True, False]
-        validate_manifest_values = [True, False]
-        show_progress_bar_values = [True, False]
-        use_s5cmd_sync_values = [True, False]
-        dirTemplateValues = [
-            None,
-            "%collection_id/%PatientID/%Modality/%StudyInstanceUID/%SeriesInstanceUID",
-            "%collection_id_%PatientID_%Modality_%StudyInstanceUID_%SeriesInstanceUID",
-        ]
-        # Generate all combinations of optional parameters
-        combinations = product(
-            quiet_values,
-            validate_manifest_values,
-            show_progress_bar_values,
-            use_s5cmd_sync_values,
-            dirTemplateValues,
-        )
-
-        # Test each combination
-        for (
-            quiet,
-            validate_manifest,
-            show_progress_bar,
-            use_s5cmd_sync,
-            dirTemplate,
-        ) in combinations:
-            with tempfile.TemporaryDirectory() as temp_dir:
-                self.client.download_from_manifest(
-                    manifestFile="./study_manifest_gcs.s5cmd",
-                    downloadDir=temp_dir,
-                    quiet=quiet,
-                    validate_manifest=validate_manifest,
-                    show_progress_bar=show_progress_bar,
-                    use_s5cmd_sync=use_s5cmd_sync,
-                    dirTemplate=dirTemplate,
-                )
-
-                self.assertEqual(
-                    sum([len(files) for r, d, files in os.walk(temp_dir)]), 9
-                )
-
-    def test_download_from_bogus_manifest(self):
-        # Define the values for each optional parameter
-        quiet_values = [True, False]
-        validate_manifest_values = [True, False]
-        show_progress_bar_values = [True, False]
-        use_s5cmd_sync_values = [True, False]
-
-        # Generate all combinations of optional parameters
-        combinations = product(
-            quiet_values,
-            validate_manifest_values,
-            show_progress_bar_values,
-            use_s5cmd_sync_values,
-        )
-
-        # Test each combination
-        for (
-            quiet,
-            validate_manifest,
-            show_progress_bar,
-            use_s5cmd_sync,
-        ) in combinations:
-            with tempfile.TemporaryDirectory() as temp_dir:
-                self.client.download_from_manifest(
-                    manifestFile="./study_manifest_bogus.s5cmd",
-                    downloadDir=temp_dir,
-                    quiet=quiet,
-                    validate_manifest=validate_manifest,
-                    show_progress_bar=show_progress_bar,
-                    use_s5cmd_sync=use_s5cmd_sync,
-                )
-
-                self.assertEqual(len(os.listdir(temp_dir)), 0)
-
-    """
-    disabling these tests due to a consistent server timeout issue
-    def test_citations(self):
-        citations = self.client.citations_from_selection(
-            collection_id="tcga_gbm",
-            citation_format=index.IDCClient.CITATION_FORMAT_APA,
-        )
-        self.assertIsNotNone(citations)
-
-        citations = self.client.citations_from_selection(
-            seriesInstanceUID="1.3.6.1.4.1.14519.5.2.1.7695.4164.588007658875211151397302775781",
-            citation_format=index.IDCClient.CITATION_FORMAT_BIBTEX,
-        )
-        self.assertIsNotNone(citations)
-
-        citations = self.client.citations_from_selection(
-            studyInstanceUID="1.2.840.113654.2.55.174144834924218414213677353968537663991",
-            citation_format=index.IDCClient.CITATION_FORMAT_BIBTEX,
-        )
-        self.assertIsNotNone(citations)
-
-        citations = self.client.citations_from_manifest("./study_manifest_aws.s5cmd")
-        self.assertIsNotNone(citations)
-    """
-
-    def test_cli_download_from_selection(self):
-        runner = CliRunner()
-        with tempfile.TemporaryDirectory() as temp_dir:
-            result = runner.invoke(
-                self.download_from_selection,
-                [
-                    "--download-dir",
-                    temp_dir,
-                    "--dry-run",
-                    False,
-                    "--quiet",
-                    True,
-                    "--show-progress-bar",
-                    True,
-                    "--use-s5cmd-sync",
-                    False,
-                    "--study-instance-uid",
-                    "1.3.6.1.4.1.14519.5.2.1.7695.1700.114861588187429958687900856462",
-                ],
-            )
-            assert len(os.listdir(temp_dir)) != 0
-
-    def test_cli_download_from_manifest(self):
-        runner = CliRunner()
-        with tempfile.TemporaryDirectory() as temp_dir:
-            result = runner.invoke(
-                self.download_from_manifest,
-                [
-                    "--manifest-file",
-                    "./study_manifest_aws.s5cmd",
-                    "--download-dir",
-                    temp_dir,
-                    "--quiet",
-                    True,
-                    "--show-progress-bar",
-                    True,
-                    "--use-s5cmd-sync",
-                    False,
-                ],
-            )
-            assert len(os.listdir(temp_dir)) != 0
-
-    def test_singleton_attribute(self):
-        # singleton, initialized on first use
-        i1 = IDCClient.client()
-        i2 = IDCClient.client()
-
-        # new instances created via constructor (through init)
-        i3 = IDCClient()
-        i4 = self.client
-
-        # all must be not none
-        assert i1 is not None
-        assert i2 is not None
-        assert i3 is not None
-        assert i4 is not None
-
-        # singletons must return the same instance
-        assert i1 == i2
-
-        # new instances must be different
-        assert i1 != i3
-        assert i1 != i4
-        assert i3 != i4
-
-        # all must be instances of IDCClient
-        assert isinstance(i1, IDCClient)
-        assert isinstance(i2, IDCClient)
-        assert isinstance(i3, IDCClient)
-        assert isinstance(i4, IDCClient)
-
-    def test_cli_download(self):
-        runner = CliRunner()
-        with runner.isolated_filesystem():
-            result = runner.invoke(
-                self.download,
-                ["1.3.6.1.4.1.14519.5.2.1.7695.1700.114861588187429958687900856462"],
-            )
-            assert len(os.listdir(Path.cwd())) != 0
-
-    def test_list_indices(self):
-        i = IDCClient()
-        assert not i.indices_overview.empty  # assert that df was created
-
-    def test_fetch_index(self):
-        i = IDCClient()
-        assert i.indices_overview["sm_index", "installed"] is False
-        i.fetch_index("sm_index")
-        assert i.indices_overview["sm_index", "installed"] is True
-
-
-if __name__ == "__main__":
-    unittest.main()
diff --git a/tests/prior_version_manifest.s5cmd b/tests/prior_version_manifest.s5cmd
new file mode 100644
index 00000000..1c91a450
--- /dev/null
+++ b/tests/prior_version_manifest.s5cmd
@@ -0,0 +1,5 @@
+cp s3://idc-open-data/040fd3e1-0088-4bfd-8439-55e3c5d80a56/*  .
+cp s3://idc-open-data/04553d0f-1af9-414d-b631-cc31624aced5/*  .
+cp s3://idc-open-data/068346bf-16ef-4e45-87bf-87feb576a21c/*  .
+cp s3://idc-open-data/07908d47-5e85-45f3-9649-79c15f606f52/*  .
+cp s3://idc-open-data/099d180f-1d79-402d-abad-bfd8e2736b04/*  .