-
Notifications
You must be signed in to change notification settings - Fork 1
Add helper function to container base class to replace NaNs with NoneType to accommodate JSON outputs #3
Conversation
…Type to accommodate JSON outputs. JSON cannot support NaN type
@@ -14,6 +14,7 @@ def __init__(self, data, labels: dict = None, metadata: dict = None): | |||
super().__init__(DataProfilerEvidence, data, labels, metadata) | |||
|
|||
def to_evidence(self, **metadata): | |||
self.remove_NaNs() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
DataProfiler
data is not a pandas dataframe. It's a pandas profiler. get_description
returns a dictionary.
You can do something like:
scrubbed_data = self.remove_NaNs(self._data.get_description())
then pass the scrubbed_data.
connect/evidence/containers.py
Outdated
@@ -61,6 +62,15 @@ def _validate_inputs(self, data): | |||
def _validate(self, data): | |||
pass | |||
|
|||
def remove_NaNs(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this may need to be more robust. What happens when this is a dictionary? Or a dictionary of dictionaries? Or a list of dictionaries? For those non-pandas cases, I'd probably use a recursive function. Not sure if this is the best way, but this is how I coded up the check_subset
function
Also, the way you use it (always calling it in to_evidence
) you don't need to change _data
in place. Just have it do the transformation on the data, and return the cleaned data for export.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Still working on the data profiler sanitization. It is complicated because it contains nested dictionaries and then some DataFrames/Series within those dictionaries. It's probably safe to assume those DataFrames aren't further nested (i.e. they just contain elementary types) but if we don't want to assume that we'll need a very complicated sanitizer.
On further thought, I changed the paradigm of the call. This should be a forced sanitization so I've moved the call to the base class EvidenceContainer
's init function. The function is now abstract and forces subclasses to implement it. This will help prevent some future developer from implementing a new evidence while forgetting to sanitize for JSONs.
I don't really see the harm in having self._data reflect a de-NaNified structure. As currently written, Evidence always gets converted to a JSON. The prior implementation isn't really "in place" --> I was passing the _data object on the RHS and the return was a copy assigned to the same variable name. The way I'm doing it now precludes that by just sanitizing at the start (sanitizing a copy so we don't have to worry about deep copy issues).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Comments with more details but:
- Believe this function can be more robust
- Not applied properly to PandasProfiler
- Doesn't need to modify the data in place.
…class. Each class must implement its own NaN sanitization function to ensure future evidences don't forget to do so.
Not sure why the Lint rule is failing. Some Node.js error? |
…ather than internal _data object
New solution for data_profiler uses dictionary helper in the containers base file. It's...not pretty. Similarity between dictionary helper and list helper is unfortunate. Not clear if there's a workaround though, since one relies on a class function ( |
Encountered a bug. Converting to draft. |
Now fixes #4 |
…which overwrites data
…ather than internal _data object
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me. Please review my changes and see if they still work and I didn't miss anything. I've tested on your integration notebook.
1bdaa02
to
a344080
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me.
It won't let me approve since I'm original PR author Nevermind once you approved it's good.
JSON cannot support NaN type. Need recast all pandas dataframes with Nones rather than NaNs before converting to evidence.
This changes does NOT check for NaNs in non-DataFrame evidences (e.g. deepchecks) and that may pose an issue down the road (not an issue at the moment, as we are hand-crafting a return DataFrame for deepchecks results)
Change description
Added a helper function to the container base class to replace NaNs.
Modified each EvidenceContainer class to call the helper at the start of
get_evidence()
function callType of change