Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issue/120/check estimator columns #164

Open
wants to merge 20 commits into
base: main
Choose a base branch
from

Conversation

hangqianjun
Copy link
Contributor

@hangqianjun hangqianjun commented Sep 6, 2024

Problem & Solution Description (including issue #)

This PR addresses #120 and implement the check_columns function in tables_io into the rail base classes.

Code Quality

  • My code follows the code style of this project
  • I have written unit tests or justified all instances of #pragma: no cover; in the case of a bugfix, a new test that breaks as a result of the bug has been added
  • My code contains relevant comments and necessary documentation for future maintainers; the change is reflected in applicable demos/tutorials (with output cleared!) and added/updated docstrings use the NumPy docstring format
  • Any breaking changes, described above, are accompanied by backwards compatibility and deprecation warnings

Copy link

codecov bot commented Sep 18, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 98.38%. Comparing base (30a570f) to head (ee912d9).

Additional details and impacted files
@@            Coverage Diff             @@
##             main     #164      +/-   ##
==========================================
+ Coverage   98.35%   98.38%   +0.02%     
==========================================
  Files          45       45              
  Lines        2497     2536      +39     
==========================================
+ Hits         2456     2495      +39     
  Misses         41       41              

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@hangqianjun hangqianjun marked this pull request as ready for review September 19, 2024 13:42
@hangqianjun
Copy link
Contributor Author

hangqianjun commented Sep 19, 2024

Some notes with the PR (and my confusions):

  • The check_columns functions (they are all called similar names) defined in tables_io as well as inrail.core.data and rail.core.stages all have **kwargs which are unspecified parameters needed when reading in the files, dependent on the file type. These parameters for now are not included, registered, or propagated through RAIL when running interactively or from a ceci pipeline. So, when these check_columns functions are called in a specific stage (informer & estimator), the **kwargs are omitted. This should be considered later on.
  • The _check_column_names() function in rail.core.stages allows the input data to be a) a datahandle with just path, b) a datahandle with data, and c) an actual table of data. When running interactive mode, all stages will call set_data(), which will always give case b). I'm not sure when will cases a) and c) be required, perhaps when running in pipeline mode?
  • I have added methods for check_columns for TableHandle, but not for qp files yet. Would be great if someone can give some suggestions on whether this is needed for qp files.
  • Thenvalidate() function doens't exist in the ceci public release yet, so the dependence is changed to the ceci main branch in order to have the tests pass. We can wait until ceci has made a release to merge this.

Copy link

@JaimeRZP JaimeRZP left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some minor comments a more senior member should review the bigger picture.


@classmethod
def _check_data_columns(cls, path, columns_to_check, parent_groupname=None, **kwargs):
raise NotImplementedError # pragma: no cover

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

some documentation of what the fields are meant to be would be useful:
(what are the cls, path to what, ... etc)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added explanation of these parameters.

@@ -481,3 +481,43 @@ def _finalize_tag(self, tag):
final_name = PipelineStage._finalize_tag(self, tag)
handle.path = final_name
return final_name

def _check_column_names(self, data, columns_to_check, **kwargs):
try:

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need the try here?
If get fails, it will always default to self.config.hdf5_groupname, right?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think maybe the try is in case self.config.hdf5_groupname is not defined

else:
col_list = list(data[groupname].keys())
# check columns
intersection = set(columns_to_check).intersection(col_list)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice!

self._check_column_names(data, self.stage_columns)

def _get_stage_columns(self):
self.stage_columns=[]

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why not just:
self.stage_columns = [self.config.column_name]

Also, this might lead to confusion between the names and the columns themselves.

Copy link
Contributor Author

@hangqianjun hangqianjun Sep 20, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed to above. Indeed the names are confusing...

Copy link

@empEvil empEvil left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks fine from my end, the valiate() calls looks to make sense. I might suggest adding more comments on the functions, this will make it easier for the future to understand the ideas implemented

@@ -71,3 +71,16 @@ def _process_chunk(self, start, end, data, first):
)
qp_d.set_ancil(dict(zmode=zmode))
self._do_chunk_output(qp_d, start, end, first)


def validate(self):
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A bit more descriptive of what you intend to validate with this function might be good for future code readers

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a description.

self.set_data("input", training_data)
self.validate()
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

might be worth adding to line 102/104 that validate also needs to be defined in the sub-class

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now added.

@@ -20,7 +20,8 @@ dependencies = [
"pyyaml",
"numpy<2.0.0",
"click",
"ceci>=2.0.1",
#"ceci>=2.0.1",
"ceci@git+https://github.com/LSSTDESC/ceci",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that this will prevent you from pushing this to pypi. I think you will need to tag & release a version of ceci to be able to do ath.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I will change this once ceci has rolled a new release (sometime soon maybe? @empEvil)

Copy link
Collaborator

@eacharles eacharles left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changes look good, but I think you will need to fix the pyproject.toml file to include a version of ceci before you can push this to pypi

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants