Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEATURE] Update dataframe batch.validate workflow #10165

Merged
merged 28 commits into from
Aug 6, 2024

Conversation

billdirks
Copy link
Contributor

@billdirks billdirks commented Aug 1, 2024

The shape of the API are implementing is described is:

# Initial data asset setup
import great_expectations as gx
context = gx.get_context()
datasource = context.data_sources.add_spark(name="myspark")
asset = datasource.add_dataframe_asset(name="spark_asset")

# Validating a batch
bd = asset.add_batch_definition_whole_dataframe(name="bd")
batch = bd.get_batch(batch_parameters={"dataframe": df})
# my_expectation is an Expectation
batch.validate(my_expectation)

# Validating a batch definition
bd = asset.add_batch_definition_whole_dataframe(name="another_bd")
# es is an ExpectationSuite
validation_definition = context.add_validation_definition(batch_definition=bd, expectation_suite=es)
validation_definition.validate(batch_parameters={"dataframe": df})
  • Description of PR changes above includes a link to an existing GitHub issue
  • PR title is prefixed with one of: [BUGFIX], [FEATURE], [DOCS], [MAINTENANCE], [CONTRIB]
  • Code is linted - run invoke lint (uses ruff format + ruff check)
  • Appropriate tests and docs have been updated

For more information about contributing, see Contribute.

After you submit your PR, keep the page open and monitor the statuses of the various checks made by our continuous integration process at the bottom of the page. Please fix any issues that come up and reach out on Slack if you need help. Thanks for contributing!

Copy link

netlify bot commented Aug 1, 2024

Deploy Preview for niobium-lead-7998 canceled.

Name Link
🔨 Latest commit 92d252e
🔍 Latest deploy log https://app.netlify.com/sites/niobium-lead-7998/deploys/66b26ac464fb5400080f8acd

@billdirks billdirks changed the title Update pandas dataframe batch.validate workflow [FEATURE] Update pandas dataframe batch.validate workflow Aug 1, 2024
Copy link

codecov bot commented Aug 1, 2024

Codecov Report

Attention: Patch coverage is 68.29268% with 13 lines in your changes missing coverage. Please review.

Project coverage is 79.20%. Comparing base (a4636f9) to head (92d252e).

Files Patch % Lines
...xpectations/datasource/fluent/pandas_datasource.py 57.14% 9 Missing ⚠️
...expectations/datasource/fluent/spark_datasource.py 69.23% 4 Missing ⚠️
Additional details and impacted files
@@             Coverage Diff             @@
##           develop   #10165      +/-   ##
===========================================
- Coverage    79.21%   79.20%   -0.02%     
===========================================
  Files          456      456              
  Lines        39720    39703      -17     
===========================================
- Hits         31465    31445      -20     
- Misses        8255     8258       +3     
Flag Coverage Δ
3.10 65.82% <46.34%> (+0.04%) ⬆️
3.10 aws_deps ?
3.10 big ?
3.10 databricks ?
3.10 filesystem ?
3.10 mssql ?
3.10 mysql ?
3.10 postgresql ?
3.10 snowflake ?
3.10 spark ?
3.10 trino ?
3.11 65.82% <46.34%> (+0.04%) ⬆️
3.11 aws_deps ?
3.11 big ?
3.11 databricks ?
3.11 filesystem ?
3.11 mssql ?
3.11 mysql ?
3.11 postgresql ?
3.11 snowflake ?
3.11 spark ?
3.11 trino ?
3.12 64.39% <46.34%> (+0.02%) ⬆️
3.12 aws_deps 45.56% <34.14%> (+0.01%) ⬆️
3.12 big 54.27% <46.34%> (-0.01%) ⬇️
3.12 filesystem 60.06% <53.65%> (+0.01%) ⬆️
3.12 mssql 49.66% <29.26%> (+0.01%) ⬆️
3.12 mysql 49.72% <29.26%> (+0.01%) ⬆️
3.12 postgresql 53.87% <34.14%> (+0.01%) ⬆️
3.12 spark 57.20% <51.21%> (+0.18%) ⬆️
3.12 trino 51.79% <31.70%> (+0.01%) ⬆️
3.8 65.86% <46.34%> (+0.02%) ⬆️
3.8 athena or clickhouse or openpyxl or pyarrow or project or sqlite or aws_creds 54.29% <34.14%> (+0.01%) ⬆️
3.8 aws_deps 45.58% <34.14%> (+0.01%) ⬆️
3.8 big 54.29% <46.34%> (-0.01%) ⬇️
3.8 databricks 46.75% <31.70%> (+0.01%) ⬆️
3.8 filesystem 60.08% <53.65%> (+0.01%) ⬆️
3.8 mssql 49.65% <29.26%> (+0.01%) ⬆️
3.8 mysql 49.71% <29.26%> (+0.01%) ⬆️
3.8 postgresql 53.86% <34.14%> (+0.01%) ⬆️
3.8 snowflake 47.66% <31.70%> (+0.01%) ⬆️
3.8 spark 57.17% <51.21%> (+0.18%) ⬆️
3.8 trino 51.78% <31.70%> (+0.01%) ⬆️
3.9 65.85% <46.34%> (+0.04%) ⬆️
3.9 aws_deps ?
3.9 big ?
3.9 databricks ?
3.9 filesystem ?
3.9 mssql ?
3.9 mysql ?
3.9 postgresql ?
3.9 snowflake ?
3.9 spark ?
3.9 trino ?
cloud 0.00% <0.00%> (ø)
docs-basic 49.20% <46.34%> (+<0.01%) ⬆️
docs-creds-needed 49.72% <46.34%> (+<0.01%) ⬆️
docs-spark 48.42% <46.34%> (+<0.01%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@billdirks billdirks changed the title [FEATURE] Update pandas dataframe batch.validate workflow [FEATURE] Update dataframe batch.validate workflow Aug 5, 2024
@@ -129,13 +129,7 @@ def _init_fluent_datasource(self, name: str, ds: FluentDatasource) -> FluentData
datasource_name=name,
data_asset_name=asset.name,
)
cached_data_asset = self._in_memory_data_assets.get(in_memory_asset_name)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This code block is removed because we no longer store the dataframe on the asset.

@@ -477,7 +479,11 @@ def _get_batch_metadata_from_batch_request(self, batch_request: BatchRequest) ->
batch_metadata = _ConfigurationSubstitutor().substitute_all_config_variables(
data=batch_metadata, replace_variables_dict=config_variables
)
batch_metadata.update(copy.deepcopy(batch_request.options))
batch_metadata.update(
Copy link
Contributor Author

@billdirks billdirks Aug 5, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was added since the dataframe now is always passed in in options and we don't want to copy that to the metadata (spark will actual die if we try). Previously, for runtime dataframes we forced batch_request.options to be {} and added a special argument for the dataframe which broke Liskov and arguably the Interface Segragation principles of SOLID.

@@ -994,7 +1000,7 @@ def __init__( # noqa: PLR0913
def _create_id(self) -> str:
options_list = []
for key, value in self.batch_request.options.items():
if key != "path":
if key not in ("path", "dataframe"):
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't want the dataframe to be part of the Batch id. We should probably have this configurable at the asset/batch definition level since it seems strange for the top level batch to know specifics about different assets/batch definitions that produce them, but I am following the pattern here and we can refactor later.

@@ -356,21 +355,13 @@ def _short_id() -> str:
return str(uuid.uuid4()).replace("-", "")[:11]


class DataFrameAsset(_PandasDataAsset, Generic[_PandasDataFrameT]):
class DataFrameAsset(_PandasDataAsset):
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We remove the dataframe from the asset since it now needs to be passed in when making the batch definition. This let's us remove all the dataframe logic and this Generic from the asset.

context = gx.get_context(context_root_dir=context.root_directory, cloud_mode=False)
dataframe_asset = context.get_datasource(datasource_name="fluent_pandas_datasource").get_asset(
asset_name="my_df_asset"
)
_ = dataframe_asset.build_batch_request(dataframe=df)
assert dataframe_asset.dataframe.equals(df) # type: ignore[attr-defined] # _PandasDataFrameT
reloaded_batch_def = dataframe_asset.get_batch_definition(batch_definition_name="bd")
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should change the argument to name instead of batch_definition_name. I can do that in a separate PR.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I actually see both patterns in the code depending on domain object. I'd prefer these to all be name since we know the type of the object from the method name, eg get_<domain_object_type>.

assert new_batch.data.dataframe.toPandas().equals(df)


@dataclass
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From here to the end of the file are new e2e tests that verify the validation workflows for batches and validation_defintions.

Copy link
Contributor

@tyler-hoffman tyler-hoffman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good to me!

@@ -44,7 +44,7 @@

# Python
# <snippet name="docs/docusaurus/docs/snippets/get_existing_data_asset_from_existing_datasource_pandas_filesystem_example.py build_batch_request_with_dataframe">
my_batch_request = my_asset.build_batch_request(dataframe=dataframe)
my_batch_request = my_asset.build_batch_request(options={"dataframe": dataframe})
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

First impression here: does this interface put us in a place where things are less predictable? Should there just be a different method for building a batch request via dataframe?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm I like the unified method call perhaps we can add a specialized method based on need?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to have a better runtime story. This is a stopgap for the short term. I think currently options and other parameters exist and this throws if you pass them. This adds consistency. It also prevents you from adding dataframes onto the asset which had some weird consequences.
But overall, I agree that this isn't ideal.

dataframe: SparkDataFrame


def _validate_whole_dataframe_batch_definition(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: it feels weird to have this method defined in between its uses. Obviously non-blocking.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Slightly more of a nit, but also not blocking: this pattern of putting the asserts and some other other non-setup stuff in a helper function makes the test somewhat harder to reason about, and likely invites overloading even more logic into here. I guess some things I'd prefer in tests:

  • asserts live in the test itself, not a helper method
  • I don't want to have to read the helper implementation to know what expectation my data is tested on.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this tests should really be parameterized and the helper functions should be inline. I can follow up with that fix.

@billdirks billdirks enabled auto-merge August 6, 2024 18:24
@billdirks billdirks added this pull request to the merge queue Aug 6, 2024
Merged via the queue into develop with commit 624653d Aug 6, 2024
67 checks passed
@billdirks billdirks deleted the f/v1-384/pandas-runtime-1 branch August 6, 2024 18:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants