[FEATURE] Update dataframe batch.validate workflow #10165

billdirks · 2024-08-01T22:52:57Z

The shape of the API are implementing is described is:

# Initial data asset setup
import great_expectations as gx
context = gx.get_context()
datasource = context.data_sources.add_spark(name="myspark")
asset = datasource.add_dataframe_asset(name="spark_asset")

# Validating a batch
bd = asset.add_batch_definition_whole_dataframe(name="bd")
batch = bd.get_batch(batch_parameters={"dataframe": df})
# my_expectation is an Expectation
batch.validate(my_expectation)

# Validating a batch definition
bd = asset.add_batch_definition_whole_dataframe(name="another_bd")
# es is an ExpectationSuite
validation_definition = context.add_validation_definition(batch_definition=bd, expectation_suite=es)
validation_definition.validate(batch_parameters={"dataframe": df})

Description of PR changes above includes a link to an existing GitHub issue
PR title is prefixed with one of: [BUGFIX], [FEATURE], [DOCS], [MAINTENANCE], [CONTRIB]
Code is linted - run invoke lint (uses ruff format + ruff check)
Appropriate tests and docs have been updated

For more information about contributing, see Contribute.

After you submit your PR, keep the page open and monitor the statuses of the various checks made by our continuous integration process at the bottom of the page. Please fix any issues that come up and reach out on Slack if you need help. Thanks for contributing!

netlify · 2024-08-01T22:53:13Z

✅ Deploy Preview for niobium-lead-7998 canceled.

Name	Link
🔨 Latest commit	`92d252e`
🔍 Latest deploy log	https://app.netlify.com/sites/niobium-lead-7998/deploys/66b26ac464fb5400080f8acd

codecov · 2024-08-01T22:59:38Z

Codecov Report

Attention: Patch coverage is 68.29268% with 13 lines in your changes missing coverage. Please review.

Project coverage is 79.20%. Comparing base (a4636f9) to head (92d252e).

Files	Patch %	Lines
...xpectations/datasource/fluent/pandas_datasource.py	57.14%	9 Missing ⚠️
...expectations/datasource/fluent/spark_datasource.py	69.23%	4 Missing ⚠️

Additional details and impacted files

@@             Coverage Diff             @@
##           develop   #10165      +/-   ##
===========================================
- Coverage    79.21%   79.20%   -0.02%     
===========================================
  Files          456      456              
  Lines        39720    39703      -17     
===========================================
- Hits         31465    31445      -20     
- Misses        8255     8258       +3

Flag	Coverage Δ
3.10	`65.82% <46.34%> (+0.04%)`	⬆️
3.10 aws_deps	`?`
3.10 big	`?`
3.10 databricks	`?`
3.10 filesystem	`?`
3.10 mssql	`?`
3.10 mysql	`?`
3.10 postgresql	`?`
3.10 snowflake	`?`
3.10 spark	`?`
3.10 trino	`?`
3.11	`65.82% <46.34%> (+0.04%)`	⬆️
3.11 aws_deps	`?`
3.11 big	`?`
3.11 databricks	`?`
3.11 filesystem	`?`
3.11 mssql	`?`
3.11 mysql	`?`
3.11 postgresql	`?`
3.11 snowflake	`?`
3.11 spark	`?`
3.11 trino	`?`
3.12	`64.39% <46.34%> (+0.02%)`	⬆️
3.12 aws_deps	`45.56% <34.14%> (+0.01%)`	⬆️
3.12 big	`54.27% <46.34%> (-0.01%)`	⬇️
3.12 filesystem	`60.06% <53.65%> (+0.01%)`	⬆️
3.12 mssql	`49.66% <29.26%> (+0.01%)`	⬆️
3.12 mysql	`49.72% <29.26%> (+0.01%)`	⬆️
3.12 postgresql	`53.87% <34.14%> (+0.01%)`	⬆️
3.12 spark	`57.20% <51.21%> (+0.18%)`	⬆️
3.12 trino	`51.79% <31.70%> (+0.01%)`	⬆️
3.8	`65.86% <46.34%> (+0.02%)`	⬆️
3.8 athena or clickhouse or openpyxl or pyarrow or project or sqlite or aws_creds	`54.29% <34.14%> (+0.01%)`	⬆️
3.8 aws_deps	`45.58% <34.14%> (+0.01%)`	⬆️
3.8 big	`54.29% <46.34%> (-0.01%)`	⬇️
3.8 databricks	`46.75% <31.70%> (+0.01%)`	⬆️
3.8 filesystem	`60.08% <53.65%> (+0.01%)`	⬆️
3.8 mssql	`49.65% <29.26%> (+0.01%)`	⬆️
3.8 mysql	`49.71% <29.26%> (+0.01%)`	⬆️
3.8 postgresql	`53.86% <34.14%> (+0.01%)`	⬆️
3.8 snowflake	`47.66% <31.70%> (+0.01%)`	⬆️
3.8 spark	`57.17% <51.21%> (+0.18%)`	⬆️
3.8 trino	`51.78% <31.70%> (+0.01%)`	⬆️
3.9	`65.85% <46.34%> (+0.04%)`	⬆️
3.9 aws_deps	`?`
3.9 big	`?`
3.9 databricks	`?`
3.9 filesystem	`?`
3.9 mssql	`?`
3.9 mysql	`?`
3.9 postgresql	`?`
3.9 snowflake	`?`
3.9 spark	`?`
3.9 trino	`?`
cloud	`0.00% <0.00%> (ø)`
docs-basic	`49.20% <46.34%> (+<0.01%)`	⬆️
docs-creds-needed	`49.72% <46.34%> (+<0.01%)`	⬆️
docs-spark	`48.42% <46.34%> (+<0.01%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

billdirks · 2024-08-05T20:46:26Z

great_expectations/datasource/datasource_dict.py

@@ -129,13 +129,7 @@ def _init_fluent_datasource(self, name: str, ds: FluentDatasource) -> FluentData
                        datasource_name=name,
                        data_asset_name=asset.name,
                    )
-                    cached_data_asset = self._in_memory_data_assets.get(in_memory_asset_name)


This code block is removed because we no longer store the dataframe on the asset.

billdirks · 2024-08-05T20:52:49Z

great_expectations/datasource/fluent/interfaces.py

@@ -477,7 +479,11 @@ def _get_batch_metadata_from_batch_request(self, batch_request: BatchRequest) ->
        batch_metadata = _ConfigurationSubstitutor().substitute_all_config_variables(
            data=batch_metadata, replace_variables_dict=config_variables
        )
-        batch_metadata.update(copy.deepcopy(batch_request.options))
+        batch_metadata.update(


This was added since the dataframe now is always passed in in options and we don't want to copy that to the metadata (spark will actual die if we try). Previously, for runtime dataframes we forced batch_request.options to be {} and added a special argument for the dataframe which broke Liskov and arguably the Interface Segragation principles of SOLID.

billdirks · 2024-08-05T20:55:18Z

great_expectations/datasource/fluent/interfaces.py

@@ -994,7 +1000,7 @@ def __init__(  # noqa: PLR0913
    def _create_id(self) -> str:
        options_list = []
        for key, value in self.batch_request.options.items():
-            if key != "path":
+            if key not in ("path", "dataframe"):


We don't want the dataframe to be part of the Batch id. We should probably have this configurable at the asset/batch definition level since it seems strange for the top level batch to know specifics about different assets/batch definitions that produce them, but I am following the pattern here and we can refactor later.

billdirks · 2024-08-05T20:57:22Z

great_expectations/datasource/fluent/pandas_datasource.py

@@ -356,21 +355,13 @@ def _short_id() -> str:
    return str(uuid.uuid4()).replace("-", "")[:11]


-class DataFrameAsset(_PandasDataAsset, Generic[_PandasDataFrameT]):
+class DataFrameAsset(_PandasDataAsset):


We remove the dataframe from the asset since it now needs to be passed in when making the batch definition. This let's us remove all the dataframe logic and this Generic from the asset.

billdirks · 2024-08-05T21:30:23Z

tests/datasource/fluent/integration/test_integration_datasource.py

    context = gx.get_context(context_root_dir=context.root_directory, cloud_mode=False)
    dataframe_asset = context.get_datasource(datasource_name="fluent_pandas_datasource").get_asset(
        asset_name="my_df_asset"
    )
-    _ = dataframe_asset.build_batch_request(dataframe=df)
-    assert dataframe_asset.dataframe.equals(df)  # type: ignore[attr-defined] # _PandasDataFrameT
+    reloaded_batch_def = dataframe_asset.get_batch_definition(batch_definition_name="bd")


We should change the argument to name instead of batch_definition_name. I can do that in a separate PR.

I actually see both patterns in the code depending on domain object. I'd prefer these to all be name since we know the type of the object from the method name, eg get_<domain_object_type>.

billdirks · 2024-08-05T23:54:26Z

tests/datasource/fluent/integration/test_integration_datasource.py

+    assert new_batch.data.dataframe.toPandas().equals(df)
+
+
+@dataclass


From here to the end of the file are new e2e tests that verify the validation workflows for batches and validation_defintions.

tyler-hoffman

This looks good to me!

tyler-hoffman · 2024-08-06T17:32:27Z

.../docs/snippets/get_existing_data_asset_from_existing_datasource_pandas_filesystem_example.py

@@ -44,7 +44,7 @@

 # Python
 # <snippet name="docs/docusaurus/docs/snippets/get_existing_data_asset_from_existing_datasource_pandas_filesystem_example.py build_batch_request_with_dataframe">
-my_batch_request = my_asset.build_batch_request(dataframe=dataframe)
+my_batch_request = my_asset.build_batch_request(options={"dataframe": dataframe})


First impression here: does this interface put us in a place where things are less predictable? Should there just be a different method for building a batch request via dataframe?

Hmm I like the unified method call perhaps we can add a specialized method based on need?

We need to have a better runtime story. This is a stopgap for the short term. I think currently options and other parameters exist and this throws if you pass them. This adds consistency. It also prevents you from adding dataframes onto the asset which had some weird consequences.
But overall, I agree that this isn't ideal.

tyler-hoffman · 2024-08-06T17:58:25Z

tests/datasource/fluent/integration/test_integration_datasource.py

+    dataframe: SparkDataFrame
+
+
+def _validate_whole_dataframe_batch_definition(


nit: it feels weird to have this method defined in between its uses. Obviously non-blocking.

Slightly more of a nit, but also not blocking: this pattern of putting the asserts and some other other non-setup stuff in a helper function makes the test somewhat harder to reason about, and likely invites overloading even more logic into here. I guess some things I'd prefer in tests:

asserts live in the test itself, not a helper method

I don't want to have to read the helper implementation to know what expectation my data is tested on.

Yes, this tests should really be parameterized and the helper functions should be inline. I can follow up with that fix.

billdirks added 4 commits August 1, 2024 15:33

Initial pass for pandas batch.validate

e5f0d7c

Fix lint errors.

b6b2cef

cleanup stray references to deleted dataframe.

40db097

Fix pandas batch request validation.

d19f2ed

billdirks changed the title ~~Update pandas dataframe batch.validate workflow~~ [FEATURE] Update pandas dataframe batch.validate workflow Aug 1, 2024

billdirks added 10 commits August 1, 2024 16:08

Update build_batch_list signature in pandas_datasource.pyi file.

560b6a7

Update missed update to build_batch_request

6ff0243

Don't append dataframe to batch_id.

6b5705e

Fix linting errors.

b4b1291

Fix some tests.

bbef71f

Fix some static analysis tests.

210199e

Generalize BuildBatchRequestError.

745f2e4

Update spark data asset to remove dataframe.

64faf55

Update calls to spark add_dataframe_asset

e52bb02

Fix some tests.

4d26889

billdirks changed the title ~~[FEATURE] Update pandas dataframe batch.validate workflow~~ [FEATURE] Update dataframe batch.validate workflow Aug 5, 2024

billdirks added 3 commits August 5, 2024 10:44

Update schemas.

20511e6

Fix some more tests.

b54d253

Exclude dataframe from batch metadata.

a098801

billdirks commented Aug 5, 2024

View reviewed changes

billdirks added 3 commits August 5, 2024 14:07

Update test to use batch definition instead of batch request.

dcc18fa

Merge branch 'develop' into f/v1-384/pandas-runtime-1

b5ce308

Update method call in test to use right name of argument.

8458f77

billdirks commented Aug 5, 2024

View reviewed changes

Fix more tests.

f68eb0b

billdirks added 3 commits August 5, 2024 14:48

Merge branch 'develop' into f/v1-384/pandas-runtime-1

806613e

Merge branch 'develop' into f/v1-384/pandas-runtime-1

3af8eaa

Add e2e tests for runtime batches and batch definitions

acdd886

billdirks commented Aug 5, 2024

View reviewed changes

billdirks added 3 commits August 5, 2024 16:55

Merge branch 'develop' into f/v1-384/pandas-runtime-1

093e3da

Fix import.

81bfa72

Rename PySparkDataFrame to SparkDataFrame for consistency.

04df188

tyler-hoffman approved these changes Aug 6, 2024

View reviewed changes

Merge branch 'develop' into f/v1-384/pandas-runtime-1

92d252e

billdirks enabled auto-merge August 6, 2024 18:24

billdirks added this pull request to the merge queue Aug 6, 2024

Merged via the queue into develop with commit 624653d Aug 6, 2024
67 checks passed

billdirks deleted the f/v1-384/pandas-runtime-1 branch August 6, 2024 18:50

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FEATURE] Update dataframe batch.validate workflow #10165

[FEATURE] Update dataframe batch.validate workflow #10165

billdirks commented Aug 1, 2024 •

edited

Loading

netlify bot commented Aug 1, 2024 •

edited

Loading

codecov bot commented Aug 1, 2024 •

edited

Loading

billdirks Aug 5, 2024

billdirks Aug 5, 2024 •

edited

Loading

billdirks Aug 5, 2024

billdirks Aug 5, 2024

billdirks Aug 5, 2024

billdirks Aug 5, 2024

billdirks Aug 5, 2024

tyler-hoffman left a comment

tyler-hoffman Aug 6, 2024

cdkini Aug 6, 2024

billdirks Aug 6, 2024

tyler-hoffman Aug 6, 2024

tyler-hoffman Aug 6, 2024

billdirks Aug 6, 2024

		assert new_batch.data.dataframe.toPandas().equals(df)


		@dataclass

		dataframe: SparkDataFrame


		def _validate_whole_dataframe_batch_definition(

[FEATURE] Update dataframe batch.validate workflow #10165

[FEATURE] Update dataframe batch.validate workflow #10165

Conversation

billdirks commented Aug 1, 2024 • edited Loading

netlify bot commented Aug 1, 2024 • edited Loading

✅ Deploy Preview for niobium-lead-7998 canceled.

codecov bot commented Aug 1, 2024 • edited Loading

Codecov Report

Choose a reason for hiding this comment

billdirks Aug 5, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tyler-hoffman left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

billdirks commented Aug 1, 2024 •

edited

Loading

netlify bot commented Aug 1, 2024 •

edited

Loading

codecov bot commented Aug 1, 2024 •

edited

Loading

billdirks Aug 5, 2024 •

edited

Loading