Issue/120/check estimator columns #164

hangqianjun · 2024-09-06T15:54:32Z

Problem & Solution Description (including issue #)

This PR addresses #120 and implement the check_columns function in tables_io into the rail base classes.

Code Quality

My code follows the code style of this project
I have written unit tests or justified all instances of #pragma: no cover; in the case of a bugfix, a new test that breaks as a result of the bug has been added
My code contains relevant comments and necessary documentation for future maintainers; the change is reflected in applicable demos/tutorials (with output cleared!) and added/updated docstrings use the NumPy docstring format
Any breaking changes, described above, are accompanied by backwards compatibility and deprecation warnings

codecov · 2024-09-18T15:14:06Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 98.38%. Comparing base (30a570f) to head (ee912d9).

Additional details and impacted files

@@            Coverage Diff             @@
##             main     #164      +/-   ##
==========================================
+ Coverage   98.35%   98.38%   +0.02%     
==========================================
  Files          45       45              
  Lines        2497     2536      +39     
==========================================
+ Hits         2456     2495      +39     
  Misses         41       41

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

hangqianjun · 2024-09-19T14:02:24Z

Some notes with the PR (and my confusions):

The check_columns functions (they are all called similar names) defined in tables_io as well as inrail.core.data and rail.core.stages all have **kwargs which are unspecified parameters needed when reading in the files, dependent on the file type. These parameters for now are not included, registered, or propagated through RAIL when running interactively or from a ceci pipeline. So, when these check_columns functions are called in a specific stage (informer & estimator), the **kwargs are omitted. This should be considered later on.
The _check_column_names() function in rail.core.stages allows the input data to be a) a datahandle with just path, b) a datahandle with data, and c) an actual table of data. When running interactive mode, all stages will call set_data(), which will always give case b). I'm not sure when will cases a) and c) be required, perhaps when running in pipeline mode?
I have added methods for check_columns for TableHandle, but not for qp files yet. Would be great if someone can give some suggestions on whether this is needed for qp files.
Thenvalidate() function doens't exist in the ceci public release yet, so the dependence is changed to the ceci main branch in order to have the tests pass. We can wait until ceci has made a release to merge this.

JaimeRZP

Some minor comments a more senior member should review the bigger picture.

JaimeRZP · 2024-09-19T13:57:29Z

src/rail/core/data.py

+
+    @classmethod
+    def _check_data_columns(cls, path, columns_to_check, parent_groupname=None, **kwargs):
+        raise NotImplementedError # pragma: no cover


some documentation of what the fields are meant to be would be useful:
(what are the cls, path to what, ... etc)

Added explanation of these parameters.

JaimeRZP · 2024-09-19T13:59:07Z

src/rail/core/stage.py

@@ -481,3 +481,43 @@ def _finalize_tag(self, tag):
        final_name = PipelineStage._finalize_tag(self, tag)
        handle.path = final_name
        return final_name
+
+    def _check_column_names(self, data, columns_to_check, **kwargs):
+        try:


Why do we need the try here?
If get fails, it will always default to self.config.hdf5_groupname, right?

I think maybe the try is in case self.config.hdf5_groupname is not defined

JaimeRZP · 2024-09-19T14:00:20Z

src/rail/core/stage.py

+                else:
+                    col_list = list(data[groupname].keys())
+            # check columns
+            intersection = set(columns_to_check).intersection(col_list)


JaimeRZP · 2024-09-19T14:04:14Z

src/rail/estimation/algos/random_gauss.py

+        self._check_column_names(data, self.stage_columns)
+
+    def _get_stage_columns(self):
+        self.stage_columns=[]


why not just:
self.stage_columns = [self.config.column_name]

Also, this might lead to confusion between the names and the columns themselves.

Changed to above. Indeed the names are confusing...

empEvil

Looks fine from my end, the valiate() calls looks to make sense. I might suggest adding more comments on the functions, this will make it easier for the future to understand the ideas implemented

empEvil · 2024-09-19T15:19:42Z

src/rail/estimation/algos/random_gauss.py

@@ -71,3 +71,16 @@ def _process_chunk(self, start, end, data, first):
        )
        qp_d.set_ancil(dict(zmode=zmode))
        self._do_chunk_output(qp_d, start, end, first)
+
+
+    def validate(self):


A bit more descriptive of what you intend to validate with this function might be good for future code readers

Added a description.

empEvil · 2024-09-19T15:20:58Z

src/rail/estimation/informer.py

        self.set_data("input", training_data)
+        self.validate()


might be worth adding to line 102/104 that validate also needs to be defined in the sub-class

eacharles · 2024-09-19T17:07:21Z

pyproject.toml

@@ -20,7 +20,8 @@ dependencies = [
    "pyyaml",
    "numpy<2.0.0",
    "click",
-    "ceci>=2.0.1",
+    #"ceci>=2.0.1",
+    "ceci@git+https://github.com/LSSTDESC/ceci",


I think that this will prevent you from pushing this to pypi. I think you will need to tag & release a version of ceci to be able to do ath.

Yes, I will change this once ceci has rolled a new release (sometime soon maybe? @empEvil)

eacharles

Changes look good, but I think you will need to fix the pyproject.toml file to include a version of ceci before you can push this to pypi

hangqianjun added 8 commits September 6, 2024 08:49

adding relevant core functions

ff98394

adding relevant core functions in informer

7d17829

fix typo

d74000a

Adding some functionalities for column name checks in informer

3e4ac34

point to ceci github main branch rather than pypi

4033cbe

point to ceci github main branch rather than pypi

6574065

adding validate() in random_gauss

da0e459

making things right...

14d1772

hangqianjun added 11 commits September 18, 2024 08:27

adding validate() in estimator

6b36c8d

adding a new test for coverage

dca7c86

adding import os in test_algos

e2780ab

adding import RAILDIR

a9feb43

adding import PqHandle

b17a1ae

adding import pytest

0d5c9f8

adding comment

058e5fe

add test

6f2c5f5

split tests

6836a7a

adding another test for coverage

af2b439

fix type

3173b7f

hangqianjun marked this pull request as ready for review September 19, 2024 13:42

hangqianjun requested review from eacharles, aimalz and sschmidt23 September 19, 2024 13:42

JaimeRZP approved these changes Sep 19, 2024

View reviewed changes

empEvil approved these changes Sep 19, 2024

View reviewed changes

eacharles reviewed Sep 19, 2024

View reviewed changes

eacharles approved these changes Sep 19, 2024

View reviewed changes

Changes required by the comments

ee912d9

tms-epcc mentioned this pull request Oct 10, 2024

Continue leadership of RAIL topical team lsst-uk/photo-redshift-WP3.6#61

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Issue/120/check estimator columns #164

Issue/120/check estimator columns #164

hangqianjun commented Sep 6, 2024 •

edited

Loading

codecov bot commented Sep 18, 2024 •

edited

Loading

hangqianjun commented Sep 19, 2024 •

edited

Loading

JaimeRZP left a comment

JaimeRZP Sep 19, 2024

hangqianjun Sep 20, 2024

JaimeRZP Sep 19, 2024

eacharles Sep 19, 2024

JaimeRZP Sep 19, 2024

JaimeRZP Sep 19, 2024

hangqianjun Sep 20, 2024 •

edited

Loading

empEvil left a comment

empEvil Sep 19, 2024

hangqianjun Sep 20, 2024

empEvil Sep 19, 2024

hangqianjun Sep 20, 2024

eacharles Sep 19, 2024

hangqianjun Sep 20, 2024

eacharles left a comment

Issue/120/check estimator columns #164

Are you sure you want to change the base?

Issue/120/check estimator columns #164

Conversation

hangqianjun commented Sep 6, 2024 • edited Loading

Problem & Solution Description (including issue #)

Code Quality

codecov bot commented Sep 18, 2024 • edited Loading

Codecov Report

hangqianjun commented Sep 19, 2024 • edited Loading

JaimeRZP left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

hangqianjun Sep 20, 2024 • edited Loading

Choose a reason for hiding this comment

empEvil left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

eacharles left a comment

Choose a reason for hiding this comment

hangqianjun commented Sep 6, 2024 •

edited

Loading

codecov bot commented Sep 18, 2024 •

edited

Loading

hangqianjun commented Sep 19, 2024 •

edited

Loading

hangqianjun Sep 20, 2024 •

edited

Loading