-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Reimplement Kolmogorov Smirnov query logic with sqlalchemy's Language Expression API #44
Changes from 3 commits
94ae647
6d6cd33
47034c0
5c8952b
bf19817
ddb6441
92afc22
c0a1efd
ada70bc
2387817
6df75e8
44c0cdd
73e7e99
7e9f44f
61eb97e
d641748
1353488
f396053
3c9408c
397443f
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -288,7 +288,13 @@ def get_column(self, engine): | |
f"Trying to access column of DataReference " | ||
f"{self.get_string()} yet none is given." | ||
) | ||
return self.get_columns(engine)[0] | ||
columns = self.get_columns(engine) | ||
if len(columns) > 1: | ||
raise ValueError( | ||
"DataReference was expected to only have a single column but had multiple: " | ||
f"{columns}" | ||
) | ||
return columns[0] | ||
|
||
def get_columns(self, engine): | ||
"""Fetch all relevant columns of a DataReference.""" | ||
|
@@ -905,6 +911,12 @@ def get_column_array_agg( | |
|
||
|
||
def _cdf_selection(engine, ref: DataReference, cdf_label: str, value_label: str): | ||
"""Create an empirical cumulative distribution function values. | ||
|
||
Concretely, create a selection with values from ``value_label`` as well as | ||
the empirical cumulative didistribution function values, labeled as | ||
``cdf_label``. | ||
""" | ||
col = ref.get_column(engine) | ||
selection = ref.get_selection(engine).subquery() | ||
|
||
|
@@ -927,7 +939,6 @@ def _cdf_selection(engine, ref: DataReference, cdf_label: str, value_label: str) | |
.group_by(cdf_selection.c[value_label]) | ||
.subquery() | ||
) | ||
|
||
return grouped_cdf_selection | ||
|
||
|
||
|
@@ -1026,7 +1037,10 @@ def get_ks_2sample( | |
ref2: DataReference, | ||
): | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Do you not annotate return types? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. See above. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Am I missing a comment somewhere? What's the idea? Shouldn't we annotate as much as possible? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Have you read this [0]?
It's not clear to me why we don't care. [0] #44 (comment) There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I'll take the liberty to merge for now. Yet, if you consider this an open topic still, happy to further discuss this and address it as a follow-up @YYYasin19 . |
||
""" | ||
Runs the query for the two-sample Kolmogorov-Smirnov test and returns the test statistic d. | ||
Run the query for the two-sample Kolmogorov-Smirnov test and return the test statistic d. | ||
|
||
For a raw-sql version of this query, please see this PR: | ||
https://github.com/Quantco/datajudge/pull/28/ | ||
""" | ||
cdf_label = "cdf" | ||
value_label = "val" | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like this one