Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reimplement Kolmogorov Smirnov query logic with sqlalchemy's Language Expression API #44

Merged
merged 20 commits into from
Aug 1, 2022

Conversation

kklein
Copy link
Collaborator

@kklein kklein commented Jul 28, 2022

See linked issues for context.

Importantly, I have high hopes that the isolation and explicit testing of _cross_cdf_selection will now be useful for other kinds of Constraints, e.g. #45.

@kklein kklein linked an issue Jul 28, 2022 that may be closed by this pull request
@codecov
Copy link

codecov bot commented Jul 28, 2022

Codecov Report

Merging #44 (397443f) into main (da059c5) will increase coverage by 0.05%.
The diff coverage is 97.87%.

@@            Coverage Diff             @@
##             main      #44      +/-   ##
==========================================
+ Coverage   93.84%   93.90%   +0.05%     
==========================================
  Files          15       15              
  Lines        1577     1607      +30     
==========================================
+ Hits         1480     1509      +29     
- Misses         97       98       +1     
Impacted Files Coverage Δ
src/datajudge/db_access.py 94.10% <96.96%> (+0.11%) ⬆️
src/datajudge/constraints/stats.py 94.00% <100.00%> (+0.66%) ⬆️

Help us with your feedback. Take ten seconds to tell us how you rate us.

@kklein kklein linked an issue Jul 29, 2022 that may be closed by this pull request
@@ -63,9 +63,10 @@ def check_acceptance(
def c(alpha: float):
return math.sqrt(-math.log(alpha / 2.0 + 1e-10) * 0.5)

return d_statistic <= c(accepted_level) * math.sqrt(
threshold = c(accepted_level) * math.sqrt(
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Slightly more convenient to debug.

@kklein kklein changed the title [WIP] Reimplement KS test with sqlalchemy's Language Expression API Reimplement KS test with sqlalchemy's Language Expression API Jul 29, 2022
@kklein kklein changed the title Reimplement KS test with sqlalchemy's Language Expression API Reimplement Kolmogorov Smirnov test with sqlalchemy's Language Expression API Jul 29, 2022
("value_0_1", "value_1_1", 0.3924, 0.0),
],
)
def test_ks_2sample_implementation(engine, random_normal_table, configuration):
Copy link
Collaborator Author

@kklein kklein Jul 29, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moved this test to a different test suite. Not a hard constraint but the general idea so far was:
Everything that needs a database goes into tests/integration. Tests operating on a TestResult - obtained via the test method of a Requirement - go into tests/integration/test_integration.py.

@@ -1925,31 +1913,41 @@ def test_ks_2sample_constraint_perfect_between(engine, int_table1, data):
assert operation(test_result.outcome), test_result.failure_message


# TODO: Enable this test once the bug is fixed.
@pytest.mark.skip(reason="This is a known bug and unintended behaviour.")
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No longer skipping this test (as well as adding further examples) should indicate that this PR solves #42

@kklein kklein marked this pull request as ready for review July 29, 2022 17:07
@kklein kklein changed the title Reimplement Kolmogorov Smirnov test with sqlalchemy's Language Expression API Reimplement Kolmogorov Smirnov query logic with sqlalchemy's Language Expression API Jul 29, 2022
@kklein kklein requested review from jonashaag and YYYasin19 July 29, 2022 17:25
Copy link
Contributor

@jonashaag jonashaag left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I find the SQLAlchemy code incredibly difficult to follow. Not saying that it is because you are writing complicated SQLAlchemy. Maybe it's just more difficult for me (with little SQLAlchemy experience) to follow than to follow SQL. Maybe it would help people like me if we had a simplified version of the query in SQL somewhere in a docstring to help get an overview of the code.

col = ref.get_column(engine)
selection = ref.get_selection(engine).subquery()

# Step 1: Calculate the CDF over the value column.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just curious: Is possible to merge the two steps? Like so

sa.select([
  selection.c[col],
  sa.func.max(sa.func.cume_dist().over(order_by=col)),
])
.group_by(selection.c[col])

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wondered the same and doing it that way leads to an error:
sqlalchemy.exc.ProgrammingError: (psycopg2.errors.GroupingError) aggregate function calls cannot contain window function calls

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you add in the docstring a bit more information about the objective/idea behind this method?

It's great to have the comments on the step-by-step like in the SQL version before, but a summary would be a great addition to it, particularly clarifying and being explicit about the meaning of the arguments.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea.
f396053


@staticmethod
def calculate_statistic(
engine,
ref1: DataReference,
ref2: DataReference,
) -> Tuple[float, Optional[float], int, int]:
) -> Tuple[float, Optional[float], int, int, List]:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

List of what?

Copy link
Collaborator Author

@kklein kklein Jul 30, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Quite frankly we/I haven't figured out yet what the latest common SQLAlchemy ancestor type yet. Throughout almost all of db_access we don't annotate the type of the selections because the types, iirc, differ. Quite a few are simply sqlalchemy.sql.selectable.Select or sqlalchemy.sql.selectable.Subquery. Yet, some aren't and still support the necessary method interfaces.

Now there certainly are remedies to this situation but we haven't considered this to be 'sufficiently important' up until now.

def get_ks_2sample(
engine: sa.engine.Engine,
ref1: DataReference,
ref2: DataReference,
) -> float:
):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you not annotate return types?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See above.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Am I missing a comment somewhere? What's the idea? Shouldn't we annotate as much as possible?
Even using Any makes sense because you are proactively declaring that you don't care while not annotating leaves the user to guess where it's (1) unknown (2) not important (3) missing

Copy link
Collaborator Author

@kklein kklein Aug 1, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Am I missing a comment somewhere?

Have you read this [0]?

because you are proactively declaring that you don't care

It's not clear to me why we don't care.

[0] #44 (comment)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll take the liberty to merge for now. Yet, if you consider this an open topic still, happy to further discuss this and address it as a follow-up @YYYasin19 .

@kklein
Copy link
Collaborator Author

kklein commented Jul 30, 2022

I find the SQLAlchemy code incredibly difficult to follow. Not saying that it is because you are writing complicated SQLAlchemy. Maybe it's just more difficult for me (with little SQLAlchemy experience) to follow than to follow SQL.

I definitely understand where you're coming from. A couple of hopefully related comments from my side:

  • Personally, I see multiple upsides of using the language expression API.

    • Firstly, it conveniently enables the use of abstractions such as ref.get_selection(engine). IIrc not being able to use this abstraction caused a lot of the development effort in the raw-sql version of this query logic. Moreover, it caused a bug in relation to Conditions.
    • Secondly, some raw sql queries might actually run just fine against multiple dialects. Yet, this is definitely not the case for all of our queries. When it is not possible, the language expression API actually allows us to 'write once, run everywhere™'. I'm not sure whether having a patchwork of raw sql and language expression api would be desirable in the long run.
    • Thirdly, I find it easier to share logic between different queries than with raw sql queries - see comment about reusing _cross_cdf_selection for another hypothesis test.
  • As you gently allude to, I think I have indeed seen a development on my end when it comes to reading language expression code. I believe to find it easier to read now than two years ago.

  • I still don't find it easy to read.

  • To me, this very example seems to be on one side of the query logic complexity spectrum, at least in how far datajudge is concerned.

  • Personally - even though I admire the rigid structure and invaluable comments introduced by @YYYasin19 - I didn't find the raw sql code easy to follow either. It took me some debugging and reimplementing to understand some of the intricacies.

Maybe it would help people like me if we had a simplified version of the query in SQL somewhere in a docstring to help get an overview of the code.

I'm definitely open to the idea.

I made a point out of mostly preserving @YYYasin19's structure of iterative steps - in contrast to recursion - to allow the curious developer to create, read and run these intermediate steps. What I often do is to set breakpoints and run str(intermediate_selection) or even engine.connect().execute(intermediate_selection) via pdb while executing the integration tests. Moreover, I hoped to have made the query building a little more clear with this[0] example as well as the dedicated test case[1].

Nevertheless, if you have a concrete suggestion I'm happy to follow your lead. :)

[0] https://github.com/Quantco/datajudge/pull/44/files#diff-74c4ccf9b9cba732c4562df6e1e05a5ecdee4e9c04f38276b54e85f850155660R943-R945
[1] https://github.com/Quantco/datajudge/pull/44/files#diff-294af7c1d98bcac87ae111bebc4c13a1b1d8155cce13ea48446397e950f9db32R7-R28

Copy link
Collaborator

@ivergara ivergara left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great job!

col = ref.get_column(engine)
selection = ref.get_selection(engine).subquery()

# Step 1: Calculate the CDF over the value column.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you add in the docstring a bit more information about the objective/idea behind this method?

It's great to have the comments on the step-by-step like in the SQL version before, but a summary would be a great addition to it, particularly clarifying and being explicit about the meaning of the arguments.

):
"""Create a cross cumulative distribution function selection given two samples.

Concretely, both ``DataReference``s are expected to have specified a single relevant column.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't you want to explicitly enforce that expectation at the beginning of the method?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, great point! Did it for all at once:
3c9408c

@ivergara
Copy link
Collaborator

Maybe it would help people like me if we had a simplified version of the query in SQL somewhere in a docstring to help get an overview of the code.

I'm definitely open to the idea.

Why not put a link to the full version in Yasin's repository containing the SQL version?

@YYYasin19
Copy link
Contributor

Very nice (re-)implementation in the expression language API!
I agree with the fact that it is hard to parse without having experience with the API. I think we can split the query up as much as possible here, since the final compilation and optimization will still be done by the database engine(s), so our goal should be readability and maintainability -- though I don't see any obvious wins left here.

@kklein kklein merged commit 8adeefd into main Aug 1, 2022
@kklein kklein deleted the ks_le2 branch August 1, 2022 18:33
@@ -288,7 +288,13 @@ def get_column(self, engine):
f"Trying to access column of DataReference "
f"{self.get_string()} yet none is given."
)
return self.get_columns(engine)[0]
columns = self.get_columns(engine)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like this one

(col,) = self.get_columns(engine)
return col

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

KolmogorovSmirnov2Sample: Conditions are ignored Re-implement KS test in sqlalchemy expression language API
4 participants