Skip to content
This repository has been archived by the owner on Dec 18, 2023. It is now read-only.

Add helper function to container base class to replace NaNs with NoneType to accommodate JSON outputs #3

Merged
merged 27 commits into from
Nov 28, 2022

Conversation

esherman-credo
Copy link
Contributor

JSON cannot support NaN type. Need recast all pandas dataframes with Nones rather than NaNs before converting to evidence.

This changes does NOT check for NaNs in non-DataFrame evidences (e.g. deepchecks) and that may pose an issue down the road (not an issue at the moment, as we are hand-crafting a return DataFrame for deepchecks results)

Change description

Added a helper function to the container base class to replace NaNs.

Modified each EvidenceContainer class to call the helper at the start of get_evidence() function call

Type of change

  • [ X] Bug fix (fixes an issue)

…Type to accommodate JSON outputs. JSON cannot support NaN type
@@ -14,6 +14,7 @@ def __init__(self, data, labels: dict = None, metadata: dict = None):
super().__init__(DataProfilerEvidence, data, labels, metadata)

def to_evidence(self, **metadata):
self.remove_NaNs()
Copy link
Collaborator

@IanAtCredo IanAtCredo Nov 18, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

DataProfiler data is not a pandas dataframe. It's a pandas profiler. get_description returns a dictionary.

You can do something like:
scrubbed_data = self.remove_NaNs(self._data.get_description())

then pass the scrubbed_data.

@@ -61,6 +62,15 @@ def _validate_inputs(self, data):
def _validate(self, data):
pass

def remove_NaNs(self):
Copy link
Collaborator

@IanAtCredo IanAtCredo Nov 18, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this may need to be more robust. What happens when this is a dictionary? Or a dictionary of dictionaries? Or a list of dictionaries? For those non-pandas cases, I'd probably use a recursive function. Not sure if this is the best way, but this is how I coded up the check_subset function

Also, the way you use it (always calling it in to_evidence) you don't need to change _data in place. Just have it do the transformation on the data, and return the cleaned data for export.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still working on the data profiler sanitization. It is complicated because it contains nested dictionaries and then some DataFrames/Series within those dictionaries. It's probably safe to assume those DataFrames aren't further nested (i.e. they just contain elementary types) but if we don't want to assume that we'll need a very complicated sanitizer.

On further thought, I changed the paradigm of the call. This should be a forced sanitization so I've moved the call to the base class EvidenceContainer's init function. The function is now abstract and forces subclasses to implement it. This will help prevent some future developer from implementing a new evidence while forgetting to sanitize for JSONs.

I don't really see the harm in having self._data reflect a de-NaNified structure. As currently written, Evidence always gets converted to a JSON. The prior implementation isn't really "in place" --> I was passing the _data object on the RHS and the return was a copy assigned to the same variable name. The way I'm doing it now precludes that by just sanitizing at the start (sanitizing a copy so we don't have to worry about deep copy issues).

Copy link
Collaborator

@IanAtCredo IanAtCredo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comments with more details but:

  • Believe this function can be more robust
  • Not applied properly to PandasProfiler
  • Doesn't need to modify the data in place.

…class. Each class must implement its own NaN sanitization function to ensure future evidences don't forget to do so.
@esherman-credo
Copy link
Contributor Author

Not sure why the Lint rule is failing. Some Node.js error?

@esherman-credo
Copy link
Contributor Author

New solution for data_profiler uses dictionary helper in the containers base file. It's...not pretty.

Similarity between dictionary helper and list helper is unfortunate. Not clear if there's a workaround though, since one relies on a class function (data.items()) and the other relies on a function applied to a list (enumerate(data)). Tried thinking about ways to store or pass those as a function object func but the latter requires a call when it's declared and it seems like either way you'd end up with some stupid-large if-statement

@esherman-credo esherman-credo marked this pull request as draft November 21, 2022 15:56
@esherman-credo
Copy link
Contributor Author

Encountered a bug. Converting to draft.

@esherman-credo
Copy link
Contributor Author

Now fixes #4

@esherman-credo esherman-credo marked this pull request as ready for review November 21, 2022 17:36
@esherman-credo esherman-credo requested review from IanAtCredo and removed request for IanAtCredo November 28, 2022 15:15
Copy link
Collaborator

@IanAtCredo IanAtCredo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me. Please review my changes and see if they still work and I didn't miss anything. I've tested on your integration notebook.

Copy link
Contributor Author

@esherman-credo esherman-credo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me.

It won't let me approve since I'm original PR author Nevermind once you approved it's good.

@esherman-credo esherman-credo merged commit c816b2c into develop Nov 28, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants