bug: fix check for installed index

enh: switch to dict for index overview and fix tests bug fixed test to fix bug on macOS test next test enh: add contributing guidelines Based on https://github.com/Slicer/Slicer/blob/main/CONTRIBUTING.md ENH: improve error reporting for unrecognized items from the manifest Re ImagingDataCommons#100 Whenever a crdc_series_uuid from the manifest is not matched to those known to the index, provide error message informing the user of what could be the reasons. Fixed error checking for unrecognized items in the validation function. Report unrecognized items independently of whether validation is requested or not. ENH: add test of the manifest that has mismatches replaced github release access with hardcoded indices overview wip wip
DanielaSchacherer · Jul 31, 2024 · 1d69823 · 1d69823
1 parent c52a2a1
commit 1d69823
Show file tree

Hide file tree

Showing 5 changed files with 169 additions and 556 deletions.
diff --git a/CONTRIBUTING.md b/CONTRIBUTING.md
@@ -0,0 +1,122 @@
+# Contributing to idc-index
+
+There are many ways to contribute to idc-index, with varying levels of effort.
+Do try to look through the [documentation](idc-index-docs) first if something is
+unclear, and let us know how we can do better.
+
+- Ask a question on the [IDC forum][idc-forum]
+- Use [idc-index issues][idc-index-issues] to submit a feature request or bug,
+  or add to the discussion on an existing issue
+- Submit a [Pull Request](https://github.com/ImagingDataCommons/idc-index/pulls)
+  to improve idc-index or its documentation
+
+We encourage a range of Pull Requests, from patches that include passing tests
+and documentation, all the way down to half-baked ideas that launch discussions.
+
+## The PR Process, Circle CI, and Related Gotchas
+
+### How to submit a PR ?
+
+If you are new to idc-index development and you don't have push access to the
+repository, here are the steps:
+
+1. [Fork and clone](https://docs.github.com/get-started/quickstart/fork-a-repo)
+   the repository.
+2. Create a branch dedicated to the feature/bugfix you plan to implement (do not
+   use `main` branch - this will complicate further development and
+   collaboration)
+3. [Push](https://docs.github.com/get-started/using-git/pushing-commits-to-a-remote-repository)
+   the branch to your GitHub fork.
+4. Create a
+   [Pull Request](https://github.com/ImagingDataCommons/idc-index/pulls).
+
+This corresponds to the `Fork & Pull Model` described in the
+[GitHub collaborative development](https://docs.github.com/pull-requests/collaborating-with-pull-requests/getting-started/about-collaborative-development-models)
+documentation.
+
+When submitting a PR, the developers following the project will be notified.
+That said, to engage specific developers, you can add `Cc: @<username>` comment
+to notify them of your awesome contributions. Based on the comments posted by
+the reviewers, you may have to revisit your patches.
+
+### How to efficiently contribute ?
+
+We encourage all developers to:
+
+- set up pre-commit hooks so that you can remedy various formatting and other
+  issues early, without waiting for the continuous integration (CI) checks to
+  complete: `pre-commit install`
+
+- add or update tests. You can see current tests
+  [here](https://github.com/ImagingDataCommons/idc-index/tree/main/tests). If
+  you contribute new functionality, adding test(s) covering it is mandatory!
+
+- you can run individual tests from the root repository using the following
+  command: `python -m unittest -vv tests.idcindex.TestIDCClient.<test_name>`
+
+### How to write commit messages ?
+
+Write your commit messages using the standard prefixes for commit messages:
+
+- `BUG:` Fix for runtime crash or incorrect result
+- `COMP:` Compiler error or warning fix
+- `DOC:` Documentation change
+- `ENH:` New functionality
+- `PERF:` Performance improvement
+- `STYLE:` No logic impact (indentation, comments)
+- `WIP:` Work In Progress not ready for merge
+
+The body of the message should clearly describe the motivation of the commit
+(**what**, **why**, and **how**). In order to ease the task of reviewing
+commits, the message body should follow the following guidelines:
+
+1. Leave a blank line between the subject and the body. This helps `git log` and
+   `git rebase` work nicely, and allows to smooth generation of release notes.
+2. Try to keep the subject line below 72 characters, ideally 50.
+3. Capitalize the subject line.
+4. Do not end the subject line with a period.
+5. Use the imperative mood in the subject line (e.g.
+   `BUG: Fix spacing not being considered.`).
+6. Wrap the body at 80 characters.
+7. Use semantic line feeds to separate different ideas, which improves the
+   readability.
+8. Be concise, but honor the change: if significant alternative solutions were
+   available, explain why they were discarded.
+9. If the commit refers to a topic discussed on the [IDC forum][idc-forum], or
+   fixes a regression test, provide the link. If it fixes a compiler error,
+   provide a minimal verbatim message of the compiler error. If the commit
+   closes an issue, use the
+   [GitHub issue closing keywords](https://docs.github.com/issues/tracking-your-work-with-issues/linking-a-pull-request-to-an-issue).
+
+Keep in mind that the significant time is invested in reviewing commits and
+_pull requests_, so following these guidelines will greatly help the people
+doing reviews.
+
+These guidelines are largely inspired by Chris Beam's
+[How to Write a Commit Message](https://chris.beams.io/posts/git-commit/) post.
+
+### How to integrate a PR ?
+
+Getting your contributions integrated is relatively straightforward, here is the
+checklist:
+
+- All tests pass
+- Consensus is reached. This usually means that at least two reviewers approved
+  the changes (or added a `LGTM` comment) and at least one business day passed
+  without anyone objecting. `LGTM` is an acronym for _Looks Good to Me_.
+- To accommodate developers explicitly asking for more time to test the proposed
+  changes, integration time can be delayed by few more days.
+- If you do NOT have push access, a core developer will integrate your PR. If
+  you would like to speed up the integration, do not hesitate to add a reminder
+  comment to the PR
+
+### Automatic testing of pull requests
+
+Every pull request is tested automatically using GitHub Actions each time you
+push a commit to it. The GitHub UI will restrict users from merging pull
+requests until the CI build has returned with a successful result indicating
+that all tests have passed.
+
+[idc-forum]: https://discourse.canceridc.dev
+[idc-index-issues]: https://github.com/ImagingDataCommons/idc-index/issues
+[idc-index-docs]: https://idc-index.readthedocs.io/
diff --git a/idc_index/index.py b/idc_index/index.py
@@ -21,6 +21,7 @@
 
 aws_endpoint_url = "https://s3.amazonaws.com"
 gcp_endpoint_url = "https://storage.googleapis.com"
+asset_endpoint_url = f"https://api.github.com/repos/ImagingDataCommons/idc-index-data/releases/tags/{idc_index_data.__version__}"
 
 logging.basicConfig(format="%(asctime)s - %(message)s", level=logging.INFO)
 logger = logging.getLogger(__name__)
@@ -67,7 +68,24 @@ def __init__(self):
         self.collection_summary = self.index.groupby("collection_id").agg(
             {"Modality": pd.Series.unique, "series_size_MB": "sum"}
         )
-        self.indices_overview = self.list_indices()
+
+        self.indices_overview = pd.DataFrame(
+            {
+                "index": {"description": None, "installed": True, "url": None},
+                "sm_index": {
+                    "description": None,
+                    "installed": True,
+                    "url": os.path.join(asset_endpoint_url, "sm_index.parquet"),
+                },
+                "sm_instance_index": {
+                    "description": None,
+                    "installed": True,
+                    "url": os.path.join(
+                        asset_endpoint_url, "sm_instance_index.parquet"
+                    ),
+                },
+            }
+        )
 
         # Lookup s5cmd
         self.s5cmdPath = shutil.which("s5cmd")
@@ -172,33 +190,6 @@ def get_idc_version():
         idc_version = Version(idc_index_data.__version__).major
         return f"v{idc_version}"
 
-    @staticmethod
-    def _get_latest_idc_index_data_release_assets():
-        """
-        Retrieves a list of the latest idc-index-data release assets.
-
-        Returns:
-            release_assets (list): List of tuples (asset_name, asset_url).
-        """
-        release_assets = []
-        url = f"https://api.github.com/repos/ImagingDataCommons/idc-index-data/releases/tags/{idc_index_data.__version__}"
-        try:
-            response = requests.get(url, timeout=30)
-            if response.status_code == 200:
-                release_data = response.json()
-                assets = release_data.get("assets", [])
-                for asset in assets:
-                    release_assets.append(
-                        (asset["name"], asset["browser_download_url"])
-                    )
-            else:
-                logger.error(f"Failed to fetch releases: {response.status_code}")
-
-        except FileNotFoundError:
-            logger.error(f"Failed to fetch releases: {response.status_code}")
-
-        return release_assets
-
     def list_indices(self):
         """
         Lists all available indices including their installation status.
@@ -207,40 +198,6 @@ def list_indices(self):
             indices_overview (pd.DataFrame): DataFrame containing information per index.
         """
 
-        if "indices_overview" not in locals():
-            indices_overview = {}
-            # Find installed indices
-            for file in distribution("idc-index-data").files:
-                if str(file).endswith("index.parquet"):
-                    index_name = os.path.splitext(
-                        str(file).rsplit("/", maxsplit=1)[-1]
-                    )[0]
-
-                    indices_overview[index_name] = {
-                        "description": None,
-                        "installed": True,
-                        "local_path": os.path.join(
-                            idc_index_data.IDC_INDEX_PARQUET_FILEPATH.parents[0],
-                            f"{index_name}.parquet",
-                        ),
-                    }
-
-            # Find available indices from idc-index-data
-            release_assets = self._get_latest_idc_index_data_release_assets()
-            for asset_name, asset_url in release_assets:
-                if asset_name.endswith(".parquet"):
-                    asset_name = os.path.splitext(asset_name)[0]
-                    if asset_name not in indices_overview:
-                        indices_overview[asset_name] = {
-                            "description": None,
-                            "installed": False,
-                            "url": asset_url,
-                        }
-
-            self.indices_overview = pd.DataFrame.from_dict(
-                indices_overview, orient="index"
-            )
-
         return self.indices_overview
 
     def fetch_index(self, index) -> None:
@@ -251,23 +208,22 @@ def fetch_index(self, index) -> None:
             index (str): Name of the index to be downloaded.
         """
 
-        if index not in self.indices_overview.index.tolist():
+        if index not in self.indices_overview.keys():
             logger.error(f"Index {index} is not available and can not be fetched.")
-        elif self.indices_overview.loc[index, "installed"]:
+        elif self.indices_overview[index]["installed"]:
             logger.warning(
                 f"Index {index} already installed and will not be fetched again."
             )
         else:
-            response = requests.get(self.indices_overview.loc[index, "url"], timeout=30)
+            response = requests.get(self.indices_overview[index]["url"], timeout=30)
             if response.status_code == 200:
                 filepath = os.path.join(
                     idc_index_data.IDC_INDEX_PARQUET_FILEPATH.parents[0],
                     f"{index}.parquet",
                 )
                 with open(filepath, mode="wb") as file:
                     file.write(response.content)
-                self.indices_overview.loc[index, "installed"] = True
-                self.indices_overview.loc[index, "local_path"] = filepath
+                self.indices_overview[index]["installed"] = True
             else:
                 logger.error(f"Failed to fetch index: {response.status_code}")
 
@@ -668,8 +624,8 @@ def _validate_update_manifest_and_get_download_size(
         # create a copy of the index
         index_df_copy = self.index
 
-        # Extract s3 url and crdc_instance_uuid from the manifest copy commands
-        # Next, extract crdc_instance_uuid from aws_series_url in the index and
+        # Extract s3 url and crdc_series_uuid from the manifest copy commands
+        # Next, extract crdc_series_uuid from aws_series_url in the index and
         # try to verify if every series in the manifest is present in the index
 
         # TODO: need to remove the assumption that manifest commands will have 'cp'
@@ -697,8 +653,9 @@ def _validate_update_manifest_and_get_download_size(
                 seriesInstanceuid,
                 s3_url,
                 series_size_MB,
-                index_crdc_series_uuid==manifest_crdc_series_uuid AS crdc_series_uuid_match,
+                index_crdc_series_uuid is not NULL as crdc_series_uuid_match,
                 s3_url==series_aws_url AS s3_url_match,
+                manifest_temp.manifest_cp_cmd,
             CASE
                 WHEN s3_url==series_aws_url THEN 'aws'
             ELSE
@@ -717,19 +674,23 @@ def _validate_update_manifest_and_get_download_size(
 
         endpoint_to_use = None
 
-        if validate_manifest:
-            # Check if crdc_instance_uuid is found in the index
-            if not all(merged_df["crdc_series_uuid_match"]):
-                missing_manifest_cp_cmds = merged_df.loc[
-                    ~merged_df["crdc_series_uuid_match"], "manifest_cp_cmd"
-                ]
-                missing_manifest_cp_cmds_str = f"The following manifest copy commands do not have any associated series in the index: {missing_manifest_cp_cmds.tolist()}"
-                raise ValueError(missing_manifest_cp_cmds_str)
+        # Check if any crdc_series_uuid are not found in the index
+        if not all(merged_df["crdc_series_uuid_match"]):
+            missing_manifest_cp_cmds = merged_df.loc[
+                ~merged_df["crdc_series_uuid_match"], "manifest_cp_cmd"
+            ]
+            logger.error(
+                "The following manifest copy commands are not recognized as referencing any associated series in the index.\n"
+                "This means either these commands are invalid, or they may correspond to files available in a release of IDC\n"
+                f"different from {self.get_idc_version()} used in this version of idc-index. The corresponding files will not be downloaded.\n"
+            )
+            logger.error("\n" + "\n".join(missing_manifest_cp_cmds.tolist()))
 
-            # Check if there are more than one endpoints
+        if validate_manifest:
+            # Check if there is more than one endpoint
             if len(merged_df["endpoint"].unique()) > 1:
                 raise ValueError(
-                    "Either GCS bucket path is invalid or manifest has a mix of GCS and AWS urls. If so, please use urls from one provider only"
+                    "Either GCS bucket path is invalid or manifest has a mix of GCS and AWS urls. "
                 )
 
             if (

diff --git a/pyproject.toml b/pyproject.toml
@@ -125,6 +125,7 @@ disallow_incomplete_defs = true
 
 [tool.ruff]
 src = ["idc_index"]
+extend-exclude = ["./CONTRIBUTING.md"]
 
 [tool.ruff.lint]
 extend-select = [