diff --git a/docs/docusaurus/docs/cloud/connect/connect_python.md b/docs/docusaurus/docs/cloud/connect/connect_python.md index e0ec5f1aef3a..18b3176d9dcd 100644 --- a/docs/docusaurus/docs/cloud/connect/connect_python.md +++ b/docs/docusaurus/docs/cloud/connect/connect_python.md @@ -61,7 +61,7 @@ Environment variables securely store your GX Cloud access credentials. ``` :::note - After you save your **GX_CLOUD_ACCESS_TOKEN** and **GX_CLOUD_ORGANIZTION_ID**, you can use Python scripts to access GX Cloud and complete other tasks. See the [GX OSS guides](/core/introduction/about_gx.md). + After you save your **GX_CLOUD_ACCESS_TOKEN** and **GX_CLOUD_ORGANIZTION_ID**, you can use Python scripts to access GX Cloud and complete other tasks. See the [GX Core guides](/core/introduction/introduction.md). ::: 2. Optional. If you created a temporary file to record your user access token and Organization ID, delete it. diff --git a/docs/docusaurus/docs/components/examples_under_test.py b/docs/docusaurus/docs/components/examples_under_test.py index 1a64944aa592..849a656d6b9e 100644 --- a/docs/docusaurus/docs/components/examples_under_test.py +++ b/docs/docusaurus/docs/components/examples_under_test.py @@ -8,6 +8,23 @@ docs_tests = [] +try_gx = [ + IntegrationTestFixture( + # To test, run: + # pytest --docs-tests -k "try_gx_exploratory" tests/integration/test_script_runner.py + name="try_gx_exploratory", + user_flow_script="docs/docusaurus/docs/core/introduction/try_gx_exploratory.py", + backend_dependencies=[], + ), + # To test, run: + # pytest --docs-tests --postgresql -k "try_gx_end_to_end" tests/integration/test_script_runner.py + IntegrationTestFixture( + name="try_gx_end_to_end", + user_flow_script="docs/docusaurus/docs/core/introduction/try_gx_end_to_end.py", + backend_dependencies=[BackendDependencies.POSTGRESQL], + ), +] + create_a_data_context = [ IntegrationTestFixture( # To test, run: @@ -524,6 +541,7 @@ # Extend the docs_tests list with the above sublists (only the docs_tests list is imported # into `test_script_runner.py` and actually used in CI checks). +docs_tests.extend(try_gx) docs_tests.extend(create_a_data_context) diff --git a/docs/docusaurus/docs/core/installation_and_setup/manage_data_contexts.md b/docs/docusaurus/docs/core/installation_and_setup/manage_data_contexts.md index 3682543b8dbe..7ab8dd46fe29 100644 --- a/docs/docusaurus/docs/core/installation_and_setup/manage_data_contexts.md +++ b/docs/docusaurus/docs/core/installation_and_setup/manage_data_contexts.md @@ -288,7 +288,7 @@ Environment variables securely store your GX Cloud access credentials. export GX_CLOUD_ORGANIZATION_ID= ``` - After you save your **GX_CLOUD_ACCESS_TOKEN** and **GX_CLOUD_ORGANIZTION_ID**, you can use Python scripts to access GX Cloud and complete other tasks. See the [GX OSS guides](/core/introduction/about_gx.md). + After you save your **GX_CLOUD_ACCESS_TOKEN** and **GX_CLOUD_ORGANIZTION_ID**, you can use Python scripts to access GX Cloud and complete other tasks. See the [Introduction to GX Core](/core/introduction/introduction.md). 2. Optional. If you created a temporary file to record your user access token and Organization ID, delete it. diff --git a/docs/docusaurus/docs/core/introduction/about_gx.md b/docs/docusaurus/docs/core/introduction/about_gx.md deleted file mode 100644 index 8197f10c1179..000000000000 --- a/docs/docusaurus/docs/core/introduction/about_gx.md +++ /dev/null @@ -1,75 +0,0 @@ ---- -title: About Great Expectations ---- -import GxData from '../_core_components/_data.jsx' -import PythonVersion from '../_core_components/_python_version.md' -import GxCloudAdvert from '/static/docs/_static_components/_gx_cloud_advert.md' - -Great Expectations (GX) is the leading tool for validating and documenting your data. {GxData.product_name} is the open source Python library that supports this tool. With -{GxData.product_name} you can further customize, automate, and expand on GX's processes to suite your specialized use cases. - -Software developers have long known that automated testing is essential for managing complex codebases. GX brings the same discipline, confidence, and acceleration to data science and data engineering teams. - -## Why use GX? - -With GX, you can assert what you expect from the data you load and transform, and catch data issues quickly – Expectations are basically unit tests for your data. Not only that, but GX also creates data documentation and data quality reports from those Expectations. Data science and data engineering teams use GX to: - -- Test data they ingest from other teams or vendors and ensure its validity. -- Validate data they transform as a step in their data pipeline in order to ensure the correctness of transformations. -- Prevent data quality issues from slipping into data products. -- Streamline knowledge capture from subject-matter experts and make implicit knowledge explicit. -- Develop rich, shared documentation of their data. - -To learn more about how data teams are using GX, see [GX case studies](https://greatexpectations.io/case-studies/). - -## Key Features - -GX provides a unique framework for describing your data and validating it to ensure that it meets the standards you've defined. In the process, GX will generate human-readable reports about the state of your data. Additionally, GX's support for multiple backends ensures you can run your validations on data stored in different formats with minimal effort. - -### Expectations - -Expectations are assertions about your data. In GX, those assertions are expressed in a declarative language in the form of simple, human-readable Python methods. For example, in order to assert that you want the column “passenger_count” to be between 1 and 6, you can say: - -```python title="Python code" -expect_column_values_to_be_between( - column="passenger_count", - min_value=1, - max_value=6 -) -``` - -GX then uses this statement to validate whether the column passenger_count in a given table is indeed between 1 and 6, and returns a success or failure result. The library currently provides several dozen highly expressive built-in Expectations, and allows you to write custom Expectations. - -### Data validation - -Once you’ve created your Expectations, GX can load any batch or several batches of data to validate with your suite of Expectations. GX tells you whether each Expectation in an Expectation Suite passes or fails, and returns any unexpected values that failed a test, which can significantly speed up debugging data issues! - -### Data Docs - -GX renders Expectations in a clean, human-readable format called Data Docs. These HTML docs contain both your Expectation Suites and your data Validation Results each time validation is run – think of it as a continuously updated data quality report. The following image shows a sample Data Doc: - -![Screenshot of Data Docs](/docs/oss/guides/images/datadocs.png) - -### Support for various Data Sources and Store backends - -GX currently supports native execution of Expectations against various Data Sources, such as Pandas dataframes, Spark dataframes, and SQL databases via SQLAlchemy. This means you’re not tied to having your data in a database in order to validate it: You can also run GX against CSV files or any piece of data you can load into a dataframe. - -GX is highly configurable. It allows you to store all relevant metadata, such as the Expectations and Validation Results in file systems, database backends, as well as cloud storage such as S3 and Google Cloud Storage, by configuring metadata Stores. - -## What does GX NOT do? - -GX is NOT a pipeline execution framework. - -GX integrates seamlessly with DAG execution tools such as [Airflow](https://airflow.apache.org/), [dbt](https://www.getdbt.com/), [Prefect](https://www.prefect.io/), [Dagster](https://github.com/dagster-io/dagster), and [Kedro](https://github.com/quantumblacklabs/kedro). GX does not execute your pipelines for you, but instead, validation can simply be run as a step in your pipeline. - -GX is NOT a data versioning tool. - -GX does not store data itself. Instead, it deals in metadata about data: Expectations, Validation Results, etc. If you want to bring your data itself under version control, check out tools like: [DVC](https://dvc.org/), [Quilt](https://github.com/quiltdata/quilt), and [lakeFS](https://github.com/treeverse/lakeFS/). - -GX is NOT an independent executable. - -{GxData.product_name} is a Python library. To use GX, you will need an installation of . Ideally, you will also configure a Python virtual environment to in which you can install and run GX. Guidance on setting up your Python environment and installing the GX Python library is provided under [Set up a GX environment](/core/installation_and_setup/installation_and_setup.md) in the GX docs. - -## GX Cloud - - diff --git a/docs/docusaurus/docs/core/introduction/gx_overview.md b/docs/docusaurus/docs/core/introduction/gx_overview.md index b92126f9458d..6b0ec4bcc205 100644 --- a/docs/docusaurus/docs/core/introduction/gx_overview.md +++ b/docs/docusaurus/docs/core/introduction/gx_overview.md @@ -1,101 +1,88 @@ --- -title: Great Expectations overview +title: GX Core overview --- -This overview is for new users of Great Expectations (GX) and those looking for an understanding of its components and its primary workflows. It does not require an in-depth understanding of GX code, and is an ideal place to start before moving to more advanced topics, or if you want a better understanding of GX functionality. +This overview is for new users of GX Core and those looking for an improved understanding of GX Core components and primary workflows. It is an ideal place to start before exploring more advanced topics found in the GX Core documentation. -## What is GX +## GX Core components and workflows -GX is a framework for describing data using expressive tests and then validating that the data meets those criteria. +**Great Expectations (GX)** is a framework for describing data using expressive tests and then validating that the data meets test criteria. **GX Core** is a Python library that provides a programmatic interface to building and running data validation workflows using GX. -## GX core components +GX Core is versatile and supports a variety of workflows. It can be used for interactive, exploratory data validation as well as data validation within production deployments. -GX is built around the following five core components: +**GX components** are Python classes that represent your data and data validation entities. -- **[Data Sources:](#data-sources)** Connect to your data, and organize data for testing. -- **[Expectations:](#expectations)** Identify the standards to which your data should conform. -- **[Validation Definitions:](#validation-definitions)** Link a set of Expectations to a specific set of data. -- **[Checkpoints:](#checkpoints)** Facilitate the integration of GX into data pipelines by allowing you to run automated actions based on the results of validations. -- **[Data Context:](#data-context)** Manages the settings and metadata for a GX project, and provides an entry point to the GX Python API. +**GX workflows** are programmatically defined data validation processes. GX workflows are built using GX components. +## The pattern of a GX workflow -## Data Sources +All GX workflows share a common pattern: -Data Sources connect GX to data such as CSV files in a folder, a PostgreSQL database hosted on AWS, or any combination of data formats and environments. Regardless of the format of your Data Asset or where it resides, Data Sources provide GX with a unified API for working with it. +1. Set up a GX environment +2. Connect to data +3. Define Expectations +4. Run Validations -### Data Assets +At each workflow step, different GX components are defined and used. This section introduces the key GX Core components required to create a data validation workflow. -Data Assets are collections of records within a Data Source, like tables in a database or files in a cloud storage bucket. A Data Source tells GX how to connect to your data and Data Assets tell GX how to organize that data. +![GX workflow pattern with related GX components](./overview_images/gx_workflow_steps_and_components.png) -Data Assets should be defined in a way that makes sense for your data and your use case. For instance, you could define a Data Asset based on a SQL view that joins multiple tables or selects a subset of a table, such as all of the records with a given status in a specific field. +### Set up a GX environment -## Batches +A **Data Context** manages the settings and metadata for a GX workflow. In GX Core, the Data Context is a Python object that serves as the entrypoint for the [GX Python API](/reference/index.md). You use the Data Context to define and run a GX workflow; the Data Context provides access to the configurations, metadata, and actions of your GX workflow components and the results of data validations. -All validation in GX is performed on Batches of data. You can validate the entire data asset as a single batch, or you can partition the data asset into multiple batches and validate each one separately. +All GX workflows start with the creation of a Data Context. -### Batch Definitions +For more information on the types of Data Context, see [Create a Data Context](/core/set_up_a_gx_environment/create_a_data_context.md). -A Batch Definition tells GX how to organize the records in a Data Asset into Batches for retrieval. For example, if a table is updated with new records each day, you could define each day's worth of data as a different Batch. Batch Definitions allow you to retrieve a specific Batch based on parameters provided at runtime. +### Connect to data -Multiple Batch Definitions can be added to a Data Asset. That feature allows you to apply different Expectations to different subsets of the same data. For instance, you could define one Batch Definition that returns all the records within a Data Asset. You might then configure a second Batch Definition to only return the most recent day's records. And you could also create a Batch Definition that returns all the records for a given year and month which you only specify at runtime in a script. +A **Data Source** is the GX representation of a data store. The Data Source tells GX how to connect to your data, and supports connection to different types of data stores, including databases, schemas, and data files in cloud object storage. -## Expectations +A **Data Asset** is a collection of records within a Data Source. A useful analogy is: if a Data Source is a relational database, then a Data Asset is a table within that database, or the results of a select query on a table within that database. -An Expectation is a verifiable assertion about data. Similar to assertions in traditional Python unit tests, Expectations provide a flexible, declarative language for describing expected behaviors. Unlike traditional unit tests which describe the expected behavior of code given a specific input, Expectations apply to the input data itself. For example, you can define an Expectation that a column contains no null values. When GX runs that Expectation on your data it generates a report which indicates if a null value was found. +A **Batch Definition** tells GX how to organize the records within a Data Asset. The Batch Definition Python object enables you to retrieve a **Batch**, or collection of records from a Data Asset, for validation at runtime. A Data Asset can be validated as a single Batch, or partitioned into multiple Batches for separate validations. -Expectations can be built directly from the domain knowledge of subject matter experts, interactively while introspecting a set of data, or through automated tools provided by GX. +For more information on connecting to data, see [Connect to data](/core/connect_to_data/connect_to_data.md). -For a list of available Expectations, see [the Expectation Gallery](https://greatexpectations.io/expectations/). +### Define Expectations -### Expectation Suites +An **Expectation** is a verifiable assertion about data. Similar to assertions in traditional Python unit tests, Expectations provide a flexible, declarative language for describing expected data qualities. An Expectation can be used to validate a Batch of data. -Expectation Suites are collections of Expectations describing your data. When GX validates data, an Expectation Suite helps streamline the process by running all the contained Expectations against that data. +For a full list of available Expectations, see [the Expectation Gallery](https://greatexpectations.io/expectations/). -You can define multiple Expectation Suites for the same data to cover different use cases, and you can apply the same Expectation Suite to different Data Assets. +An **Expectation Suite** is a collection of Expectations. Expectation Suites can be used to validate a Batch of data using multiple Expectations, streamlining the validation process. You can define multiple Expectation Suites for the same data to cover different use cases, and you can apply the same Expectation Suite to different Batches. -## Validation Definitions +For more information defining Expectations and creating Expectation Suites, see [Define Expectations](/core/define_expectations/define_expectations.md). -Validation Definitions tell GX what Expectations to apply to specific data for validation. It connects a Data Asset's Batch Definition to a specific Expectation Suite. +### Run Validations -Because an Expectation Suite is decoupled from a specific source of data, you can apply the same Expectation Suite against different data by reusing it in different Validation Definitions. +A **Validation Definition** explicitly associates a Batch Definition to an Expectation Suite, defining what data should be validated against which Expectations. -The same holds true for Batch Definitions: because they are decoupled from a specific Expectation Suite, you can run multiple Expectation Suites against the same Batch of data by reusing the Batch Definition in different Validation Definitions. As an example, you could have one Validation Definition that links a permissive Expectation Suite to a Batch Definition. Then you could have a second Validation Definition that links a more strict Expectation Suite to that same Batch Batch Definition to verify different quality parameters. +A **Validation Result** is returned by GX after data validation. The Validation Results tell you how your data corresponds to what you expected of it. -In Python, Validation Definition objects also provide the API for running their defined validation and returning Validation Results. +A **Checkpoint** is the primary means for validating data in a production deployment of GX. Checkpoints enable you to run a list of Validation Definitions with shared parameters. Checkpoints can be configured to run Actions, and can pass Validation Results to a list of predefined Actions for processing. -### Validation Results +**Actions** provide a mechanism to integrate Checkpoints into your data pipeline infrastructure by automatically processing Validation Results. Typical use cases include sending email alerts, Slack messages, or custom notifications based on the result of data validation. -The Validation Results returned by GX tell you how your data corresponds to what you expected of it. You can view this information in the Data Docs that are configured in your Data Context. Evaluating your Validation Results helps you identify issues with your data. If the Validation Results show that your data meets your Expectations, you can confidently use it. +**Data Docs** are human-readable documentation generated by GX that host your Expectation Suite definitions and Validation Results. Using Checkpoints and Actions, you can configure your GX workflow to automatically write Validation Results to a chosen Data Docs site. -## Checkpoints +For more information on defining and running Validations, see [Run Validations](/core/run_validations/run_validations.md). -A Checkpoint is the primary means for validating data in a production deployment of GX. Checkpoints allow you to run a list of Validation Definitions with shared parameters and then pass the Validation Results to a list of automated Actions. +## Customize GX Core workflows -### Actions +While all GX Core workflows follow a shared pattern, the outcome and operation of a workflow can be customized based on how you create Batches, define Expectations, and run Validations. GX Core components are building blocks that can be applied in a variety of ways to satisfy your data validation use case. -One of the most powerful features of Checkpoints is that you can configure them to run Actions. The Validation Results generated when a Checkpoint runs determine what Actions are performed. Typical use cases include sending email, Slack messages, or custom notifications. Another common use case is updating Data Docs sites. Actions can be used to do anything you are capable of programming in Python. Actions are a versatile tool for integrating Checkpoints in your pipeline's workflow. +For instance, a GX Core workflow might: -## Data Context +* Create a Batch using data from a Spark DataFrame and allow you to interactively validate the Batch with Expectations and immediately review the Validation Results. This workflow could serve to inform your exploration of which Expectations you want to use in a production deployment of GX. -A Data Context manages the settings and metadata for a GX project. In Python, the Data Context object serves as the entry point for the GX API and manages various classes to limit the objects you need to directly manage yourself. A Data Context contains all the metadata used by GX, the configurations for GX objects, and the output from validating data. +* Connect to data in a SQL table, define multiple Expectation Suites that each test for a desired data quality characteristic, and use a Checkpoint to run all Expectation Suites. This workflow, when integrated with and triggered by an orchestrator, could enable automated, scheduled data quality testing on an essential data table. -The following are the available Data Context types: -- **Ephemeral Data Context:** Exists in memory, and does not persist beyond the current Python session. -- **File Data Context:** Exists as a folder and configuration files. Its contents persist between Python sessions. -- **Cloud Data Context:** Supports persistence between Python sessions, but additionally serves as the entry point for GX Cloud. +* Connect to a group of SQL tables and define a collection of Data Assets, each batched on a time-based column, and validate the data within each Data Asset using the same Expectation Suite. This workflow could provide a way to implement consistent data quality testing across a sharded data infrastructure. -### The GX API +Equipped with an understanding of the GX Core components, you can design data validation workflows that logically and effectively validate your data across a variety of data store types, environments, and business use cases. -A Data Context object in Python provides methods for configuring and interacting with GX. These methods and the objects and additional methods accessed through them compose the GX public API. +## Next steps -For more information, see [The GX API reference](/reference/api_reference.md). - -### Stores - -Stores contain the metadata GX uses. This includes configurations for GX objects, information that is recorded when GX validates data, and credentials used for accessing data sources or remote environments. GX utilizes one Store for each type of metadata, and the Data Context contains the settings that tell GX where that Store should reside and how to access it. - -### Data Docs - -Data Docs are human-readable documentation generated by GX. Data Docs describe the standards that you expect your data to conform to, and the results of validating your data against those standards. The Data Context manages the storage and retrieval of this information. - -You can configure where your Data Docs are hosted. Unlike Stores, you can define configurations for multiple Data Docs sites. You can also specify what information each Data Doc site provides, allowing you to format and provide different Data Docs for different use cases. \ No newline at end of file +Visit [Try GX Core](/core/introduction/try_gx.md) to see example workflows implemented using GX Core. \ No newline at end of file diff --git a/docs/docusaurus/docs/core/introduction/introduction.md b/docs/docusaurus/docs/core/introduction/introduction.md index 17adc5140ac6..60daf9eb4481 100644 --- a/docs/docusaurus/docs/core/introduction/introduction.md +++ b/docs/docusaurus/docs/core/introduction/introduction.md @@ -1,6 +1,6 @@ --- -title: Introduction to Great Expectations -description: Learn about the key features of GX, how to connect with the GX community, and try GX in Python. +title: Introduction to GX Core +description: Learn about GX Core components and workflows and try out the GX Core Python library. hide_feedback_survey: true hide_title: true --- @@ -11,41 +11,25 @@ import LinkCard from '@site/src/components/LinkCard'; import OverviewCard from '@site/src/components/OverviewCard'; - Learn about the key features of Great Expectations (GX). Connect with the GX community, and try GX in Python using provided sample data. + Learn about key Great Expectations (GX) Core components and workflows. Use the GX Core Python library and provided sample data to create a data validation workflow. - - - - \ No newline at end of file diff --git a/docs/docusaurus/docs/core/introduction/overview_images/gx_overview.drawio b/docs/docusaurus/docs/core/introduction/overview_images/gx_overview.drawio new file mode 100644 index 000000000000..14c21976a8a4 --- /dev/null +++ b/docs/docusaurus/docs/core/introduction/overview_images/gx_overview.drawio @@ -0,0 +1,79 @@ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + diff --git a/docs/docusaurus/docs/core/introduction/overview_images/gx_workflow_steps_and_components.png b/docs/docusaurus/docs/core/introduction/overview_images/gx_workflow_steps_and_components.png new file mode 100644 index 000000000000..b3c75b9f3955 Binary files /dev/null and b/docs/docusaurus/docs/core/introduction/overview_images/gx_workflow_steps_and_components.png differ diff --git a/docs/docusaurus/docs/core/introduction/try_gx.md b/docs/docusaurus/docs/core/introduction/try_gx.md index cc2f6b439642..4e1b77980be9 100644 --- a/docs/docusaurus/docs/core/introduction/try_gx.md +++ b/docs/docusaurus/docs/core/introduction/try_gx.md @@ -1,15 +1,16 @@ --- -title: Try Great Expectations +title: Try GX Core --- import Tabs from '@theme/Tabs'; import TabItem from '@theme/TabItem'; -import GxData from '../_core_components/_data.jsx'; import PrereqPythonInstalled from '../_core_components/prerequisites/_python_installation.md'; import ReleaseVersionBox from '../../components/versions/_gx_version_code_box.mdx' import GxCloudAdvert from '/static/docs/_static_components/_gx_cloud_advert.md' -Start here to learn how to connect to sample data, build an Expectation, validate sample data, and review Validation Results. This is an ideal place to start if you're new to {GxData.product_name} and want to experiment with features and see what it offers. +Start here to learn how to connect to data, create Expectations, validate data, and review Validation Results. This is an ideal place to start if you're new to GX Core and want to experiment with features and see what it offers. + +To complement your code exploration, check out the [GX Core overview](/core/introduction/gx_overview.md) for a primer on the GX Core components and workflow pattern used in the examples. ## Prerequisites @@ -17,110 +18,179 @@ Start here to learn how to connect to sample data, build an Expectation, validat ## Setup -{GxData.product_name} is a Python library you can install with the Python `pip` tool. +GX Core is a Python library you can install with the Python `pip` tool. -For more comprehensive guidance on setting up a Python environment, installing {GxData.product_name}, and installing additional dependencies for specific data formats and storage environments, see [Set up a GX environment](/core/installation_and_setup/install_gx.md). +For more comprehensive guidance on setting up a Python environment, installing GX Core, and installing additional dependencies for specific data formats and storage environments, see [Set up a GX environment](/core/installation_and_setup/install_gx.md). -1. Run the following terminal command to install the {GxData.product_name} library: +1. Run the following terminal command to install the GX Core library: ```bash title="Terminal input" pip install great_expectations ``` -2. Verify {GxData.product_name} installed successfully: +2. Verify GX Core installed successfully by running the command below in your Python interpreter, IDE, notebook, or script: - ```bash title="Terminal input" - great_expectations --version + ```python title="Python input" + import great_expectations as gx + + print(gx.__version__) ``` - The following output appears when {GxData.product_name} is successfully installed: + The following output appears when GX Core is successfully installed: -## Test features and functionality +## Sample data +The examples provided on this page use a sample of [NYC taxi trip record data](https://www.nyc.gov/site/tlc/about/tlc-trip-record-data.page). The sample data is provided using multiple mediums (CSV file, Postgres table) to support each workflow. + +When using the taxi data, you can make certain assumptions. For example: +* The passenger count should be greater than zero because at least one passenger needs to be present for a ride. And, taxis can accommodate a maximum of six passengers. +* Trip fares should be greater than zero. + +## Validate data in a DataFrame +This example workflow walks you through connecting to data in a Pandas DataFrame and validating the data using a single Expectation. + +:::tip Pandas install +This example requires that [Pandas](https://pandas.pydata.org/) is installed in the same Python environment where you are running GX Core. +::: - + -1. Import the `great_expectations` library and `expectations` module. +Run the following steps in a Python interpreter, IDE, notebook, or script. - The `great_expectations` module is the root of the GX library and contains shortcuts and convenience methods for starting a GX project in a Python session. +1. Import the `great_expectations` library. - The `expectations` module contains all the Expectation classes that are provided by the GX library. + The `great_expectations` module is the root of the GX Core library and contains shortcuts and convenience methods for starting a GX project in a Python session. - Run the following code in a Python interpreter, IDE, or script: + The `pandas` library is used to ingest sample data for this example. - ```python title="Python input" name="docs/docusaurus/docs/core/introduction/try_gx.py imports" + ```python title="Python input" name="docs/docusaurus/docs/core/introduction/try_gx_exploratory.py import gx library" ``` -3. Create a temporary Data Context and connect to sample data. +2. Download and read the sample data into a Pandas DataFrame. - In Python, a Data Context provides the API for interacting with many common GX objects. + ```python title="Python input" name="docs/docusaurus/docs/core/introduction/try_gx_exploratory.py import sample data" + ``` + +3. Create a Data Context. - Run the following code to initialize a Data Context and then use it to read the contents of a `.csv` file into a Batch of sample data: + A Data Context object serves as the entrypoint for interacting with GX components. - ```python title="Python input" name="docs/docusaurus/docs/core/introduction/try_gx.py set up" + ```python title="Python input" name="docs/docusaurus/docs/core/introduction/try_gx_exploratory.py create data context" ``` - - You'll use this sample data to test your Expectations. -3. Create an Expectation. +4. Connect to data and create a Batch. - Expectations are a fundamental component of GX. They allow you to explicitly define the state to which your data should conform. + Define a Data Source, Data Asset, Batch Definition, and Batch. The Pandas DataFrame is provided to the Batch Definition at runtime to create the Batch. + + ```python title="Python input" name="docs/docusaurus/docs/core/introduction/try_gx_exploratory.py connect to data and get batch" + ``` + +5. Create an Expectation. - The sample data you're using is taxi trip record data. With this data, you can make certain assumptions. For example, the passenger count shouldn't be zero because at least one passenger needs to be present. Additionally, a taxi can accomodate a maximum of six passengers. + Expectations are a fundamental component of GX. They allow you to explicitly define the state to which your data should conform. - Run the following code to define an Expectation that the contents of the column `passenger_count` consist of values ranging from `1` to `6`: + Run the following code to define an Expectation that the contents of the column `passenger_count` consist of values ranging from `1` to `6`: - ```python title="Python input" name="docs/docusaurus/docs/core/introduction/try_gx.py create an expectation" + ```python title="Python input" name="docs/docusaurus/docs/core/introduction/try_gx_exploratory.py create expectation" ``` -4. Run the following code to validate the sample data against your Expectation and view the results: +6. Run the following code to validate the sample data against your Expectation and view the results: - ```python title="Python input" name="docs/docusaurus/docs/core/introduction/try_gx.py validate and view results" + ```python title="Python input" name="docs/docusaurus/docs/core/introduction/try_gx_exploratory.py validate batch" ``` The sample data conforms to the defined Expectation and the following Validation Results are returned: - ```python title="Python output" name="docs/docusaurus/docs/core/introduction/try_gx.py output1" + ```python title="Python output" name="docs/docusaurus/docs/core/introduction/try_gx_exploratory.py passing output" + ``` + + + + +```python title="Full example code" name="docs/docusaurus/docs/core/introduction/try_gx_exploratory.py full exploratory script" +``` + + + + + +## Validate data in a SQL table +This example workflow walks you through connecting to data in a Postgres table, creating an Expectation Suite, and setting up a Checkpoint to validate the data. + + + + + +Run the following steps in a Python interpreter, IDE, notebook, or script. + +1. Import the `great_expectations` library. + + The `great_expectations` module is the root of the GX Core library and contains shortcuts and convenience methods for starting a GX project in a Python session. + + ```python title="Python input" name="docs/docusaurus/docs/core/introduction/try_gx_end_to_end.py import gx library" ``` -5. Optional. Create an Expectation that will fail when validated against the provided data. +2. Create a Data Context. - A failed Expectation lets you know there is something wrong with the data, such as missing or incorrect values, or there is a misunderstanding about the data. + A Data Context object serves as the entrypoint for interacting with GX components. - Run the following code to create an Expectation that fails because it assumes that a taxi can seat a maximum of three passengers: + ```python title="Python input" name="docs/docusaurus/docs/core/introduction/try_gx_end_to_end.py create data context" + ``` + +3. Connect to data and create a Batch. + + Define a Data Source, Data Asset, Batch Definition, and Batch. The connection string is used by the Data Source to connect to the cloud Postgres database hosting the sample data. - ```python title="Python input" name="docs/docusaurus/docs/core/introduction/try_gx.py validate and view failed results" + ```python title="Python input" name="docs/docusaurus/docs/core/introduction/try_gx_end_to_end.py connect to data and get batch" ``` - When an Expectation fails, the Validation Results of the failed Expectation include metrics to help you assess the severity of the issue: +4. Create an Expectation Suite. - ```python title="Python output" name="docs/docusaurus/docs/core/introduction/try_gx.py failed output" + Expectations are a fundamental component of GX. They allow you to explicitly define the state to which your data should conform. Expectation Suites are collections of Expectations. + + Run the following code to define an Expectation Suite containing two Expectations. The first Expectation expects that the column `passenger_count` consists of values ranging from `1` to `6`, and the second expects that the column `fare_amount` contains non-negative values. + + ```python title="Python input" name="docs/docusaurus/docs/core/introduction/try_gx_end_to_end.py create expectation suite" ``` - To reduce the size of the report and make it easier to review, only a portion of the failed values and record indexes are included in the Validation Results. The failed counts and percentages correspond to the failed records in the validated data. +5. Create an Validation Definition. -6. Optional. Go to the [Expectations Gallery](https://greatexpectations.io/expectations) and experiment with other Expectations. + The Validation Definition explicitly ties together the Batch of data to be validated to the Expectation Suite used to validate the data. - + ```python title="Python input" name="docs/docusaurus/docs/core/introduction/try_gx_end_to_end.py create validation definition" + ``` - +6. Create and run a Checkpoint to validate the data based on the supplied Validation Definition. `.describe()` is a convenience method to view a summary of the Checkpoint results. -```python title="Full example script" name="docs/docusaurus/docs/core/introduction/try_gx.py full example script" -``` + ```python title="Python input" name="docs/docusaurus/docs/core/introduction/try_gx_end_to_end.py create and run checkpoint" + ``` + +The returned results reflect the passing of one Expectation and the failure of one Expectation. + +When an Expectation fails, the Validation Results of the failed Expectation include metrics to help you assess the severity of the issue: + + ```python title="Python input" name="docs/docusaurus/docs/core/introduction/try_gx_end_to_end.py checkpoint result" + ``` + +To reduce the size of the results and make it easier to review, only a portion of the failed values and record indexes are included in the Checkpoint results. The failed counts and percentages correspond to the failed records in the validated data. + +```python title="Full example code" name="docs/docusaurus/docs/core/introduction/try_gx_end_to_end.py full end2end script" +``` + -## Next steps -- +## Next steps -- To learn more about {GxData.product_name}, see [Community resources](/core/introduction/community_resources.md). +- Go to the [Expectations Gallery](https://greatexpectations.io/expectations) and experiment with other Expectations. -- If you're ready to start using {GxData.product_name} with your own data, the [Set up a GX environment](/core/installation_and_setup/install_gx.md) documentation provides a more comprehensive guide to setting up GX to work with specific data formats and environments. +- If you're ready to start using GX Core with your own data, the [Set up a GX environment](/core/installation_and_setup/install_gx.md) documentation provides a more comprehensive guide to setting up GX to work with specific data formats and environments. +- \ No newline at end of file diff --git a/docs/docusaurus/docs/core/introduction/try_gx.py b/docs/docusaurus/docs/core/introduction/try_gx.py deleted file mode 100644 index ce0eb5511294..000000000000 --- a/docs/docusaurus/docs/core/introduction/try_gx.py +++ /dev/null @@ -1,164 +0,0 @@ -""" -This example script allows the user to try out GX by validating Expectations - against sample data. - -The snippet tags are used to insert the corresponding code into the - Great Expectations documentation. They can be disregarded by anyone - reviewing this script. -""" - -# -# Import required modules from the GX library -# -import great_expectations as gx -import great_expectations.expectations as gxe - -# - -# Create a temporary Data Context and connect to provided sample data. -# -context = gx.get_context() -batch = context.data_sources.pandas_default.read_csv( - "https://raw.githubusercontent.com/great-expectations/gx_tutorials/main/data/yellow_tripdata_sample_2019-01.csv" -) -# - -# Create an Expectation -# highlight-start -# -expectation = gxe.ExpectColumnValuesToBeBetween( - column="passenger_count", min_value=1, max_value=6 -) -# -# highlight-end - -# Validate the sample data against your Expectation and view the results -# highlight-start -# -validation_result = batch.validate(expectation) -print(validation_result.describe()) -# -# highlight-end -# - -output1 = """ -# -{ - "type": "expect_column_values_to_be_between", - "success": true, - "kwargs": { - "batch_id": "default_pandas_datasource-#ephemeral_pandas_asset", - "column": "passenger_count", - "min_value": 1.0, - "max_value": 6.0 - }, - "result": { - "element_count": 10000, - "unexpected_count": 0, - "unexpected_percent": 0.0, - "partial_unexpected_list": [], - "missing_count": 0, - "missing_percent": 0.0, - "unexpected_percent_total": 0.0, - "unexpected_percent_nonmissing": 0.0, - "partial_unexpected_counts": [], - "partial_unexpected_index_list": [] - } -} -# -""" - -output1 = output1.split(">", maxsplit=1)[1].split("#", maxsplit=1)[0].strip() -assert validation_result.describe() == output1 - - -# -# highlight-start -failed_expectation = gxe.ExpectColumnValuesToBeBetween( - column="passenger_count", min_value=1, max_value=3 -) -# highlight-end -failed_validation_result = batch.validate(failed_expectation) -print(failed_validation_result.describe()) -# - -failed_output = """ -# -{ - "type": "expect_column_values_to_be_between", - "success": false, - "kwargs": { - "batch_id": "default_pandas_datasource-#ephemeral_pandas_asset", - "column": "passenger_count", - "min_value": 1.0, - "max_value": 3.0 - }, - "result": { - "element_count": 10000, - "unexpected_count": 853, - "unexpected_percent": 8.53, - "partial_unexpected_list": [ - 4, - 4, - 4, - 4, - 4, - 4, - 4, - 4, - 4, - 4, - 4, - 4, - 4, - 4, - 4, - 4, - 4, - 4, - 4, - 4 - ], - "missing_count": 0, - "missing_percent": 0.0, - "unexpected_percent_total": 8.53, - "unexpected_percent_nonmissing": 8.53, - "partial_unexpected_counts": [ - { - "value": 4, - "count": 20 - } - ], - "partial_unexpected_index_list": [ - 9147, - 9148, - 9149, - 9150, - 9151, - 9152, - 9153, - 9154, - 9155, - 9156, - 9157, - 9158, - 9159, - 9160, - 9161, - 9162, - 9163, - 9164, - 9165, - 9166 - ] - } -} -# -""" - -# This section removes the snippet tags from the failed_output string and then verifies -# that the script ran as expected. It can be disregarded. -failed_output = ( - failed_output.split(">", maxsplit=1)[1].split("#", maxsplit=1)[0].strip() -) -assert failed_validation_result.describe() == failed_output diff --git a/docs/docusaurus/docs/core/introduction/try_gx_end_to_end.py b/docs/docusaurus/docs/core/introduction/try_gx_end_to_end.py new file mode 100644 index 000000000000..dd37266f8ec5 --- /dev/null +++ b/docs/docusaurus/docs/core/introduction/try_gx_end_to_end.py @@ -0,0 +1,189 @@ +""" +This example script allows the user to try out GX by validating Expectations + against sample data. + +The snippet tags are used to insert the corresponding code into the + Great Expectations documentation. They can be disregarded by anyone + reviewing this script. +""" + +# + +# Import required modules from GX library. +# +import great_expectations as gx + +# +# Create Data Context. +# +context = gx.get_context() +# + +# Connect to data. +# Create Data Source, Data Asset, Batch Definition, and Batch. +# +connection_string = "postgresql+psycopg2://try_gx:try_gx@postgres.workshops.greatexpectations.io/gx_example_db" + +data_source = context.data_sources.add_postgres( + "postgres db", connection_string=connection_string +) +data_asset = data_source.add_table_asset(name="taxi data", table_name="nyc_taxi_data") + +batch_definition = data_asset.add_batch_definition_whole_table("batch definition") +batch = batch_definition.get_batch() +# + +# Create Expectation Suite containing two Expectations. +# +suite = context.suites.add( + gx.core.expectation_suite.ExpectationSuite(name="expectations") +) +suite.add_expectation( + gx.expectations.ExpectColumnValuesToBeBetween( + column="passenger_count", min_value=1, max_value=6 + ) +) +suite.add_expectation( + gx.expectations.ExpectColumnValuesToBeBetween(column="fare_amount", min_value=0) +) +# + +# Create Validation Definition. +# +validation_definition = context.validation_definitions.add( + gx.core.validation_definition.ValidationDefinition( + name="validation definition", + data=batch_definition, + suite=suite, + ) +) +# + +# Create Checkpoint, run Checkpoint, and capture result. +# +checkpoint = context.checkpoints.add( + gx.checkpoint.checkpoint.Checkpoint( + name="checkpoint", validation_definitions=[validation_definition] + ) +) + +checkpoint_result = checkpoint.run() +print(checkpoint_result.describe()) +# + +# +# Above snippet ends the full end-to-end script. + +end_to_end_output = """ +# +{ + "success": false, + "statistics": { + "evaluated_validations": 1, + "success_percent": 0.0, + "successful_validations": 0, + "unsuccessful_validations": 1 + }, + "validation_results": [ + { + "success": false, + "statistics": { + "evaluated_expectations": 2, + "successful_expectations": 1, + "unsuccessful_expectations": 1, + "success_percent": 50.0 + }, + "expectations": [ + { + "expectation_type": "expect_column_values_to_be_between", + "success": true, + "kwargs": { + "batch_id": "postgres db-taxi data", + "column": "passenger_count", + "min_value": 1.0, + "max_value": 6.0 + }, + "result": { + "element_count": 20000, + "unexpected_count": 0, + "unexpected_percent": 0.0, + "partial_unexpected_list": [], + "missing_count": 0, + "missing_percent": 0.0, + "unexpected_percent_total": 0.0, + "unexpected_percent_nonmissing": 0.0, + "partial_unexpected_counts": [] + } + }, + { + "expectation_type": "expect_column_values_to_be_between", + "success": false, + "kwargs": { + "batch_id": "postgres db-taxi data", + "column": "fare_amount", + "min_value": 0.0 + }, + "result": { + "element_count": 20000, + "unexpected_count": 14, + "unexpected_percent": 0.06999999999999999, + "partial_unexpected_list": [ + -0.01, + -52.0, + -0.1, + -5.5, + -3.0, + -52.0, + -4.0, + -0.01, + -52.0, + -0.1, + -5.5, + -3.0, + -52.0, + -4.0 + ], + "missing_count": 0, + "missing_percent": 0.0, + "unexpected_percent_total": 0.06999999999999999, + "unexpected_percent_nonmissing": 0.06999999999999999, + "partial_unexpected_counts": [ + { + "value": -52.0, + "count": 4 + }, + { + "value": -5.5, + "count": 2 + }, + { + "value": -4.0, + "count": 2 + }, + { + "value": -3.0, + "count": 2 + }, + { + "value": -0.1, + "count": 2 + }, + { + "value": -0.01, + "count": 2 + } + ] + } + } + ], + "result_url": null + } + ] +} +# +""" + +checkpoint_summary = checkpoint_result.describe_dict() + +assert checkpoint_summary["success"] is False +assert len(checkpoint_summary["validation_results"][0]["expectations"]) == 2 diff --git a/docs/docusaurus/docs/core/introduction/try_gx_exploratory.py b/docs/docusaurus/docs/core/introduction/try_gx_exploratory.py new file mode 100644 index 000000000000..74c33ea2ec7a --- /dev/null +++ b/docs/docusaurus/docs/core/introduction/try_gx_exploratory.py @@ -0,0 +1,100 @@ +""" +This example script allows the user to try out GX by validating Expectations + against sample data. + +The snippet tags are used to insert the corresponding code into the + Great Expectations documentation. They can be disregarded by anyone + reviewing this script. +""" +# ruff: noqa: I001 +# Adding noqa rule so that GX and Pandas imports don't get reordered by linter. + +# + +# Import required modules from GX library. +# +import great_expectations as gx + +import pandas as pd +# + +# Create Data Context. +# +context = gx.get_context() +# + +# Import sample data into Pandas DataFrame. +# +df = pd.read_csv( + "https://raw.githubusercontent.com/great-expectations/gx_tutorials/main/data/yellow_tripdata_sample_2019-01.csv" +) +# + +# Connect to data. +# Create Data Source, Data Asset, Batch Definition, and Batch. +# +data_source = context.data_sources.add_pandas("pandas") +data_asset = data_source.add_dataframe_asset(name="pd dataframe asset") + +batch_definition = data_asset.add_batch_definition_whole_dataframe("batch definition") +batch = batch_definition.get_batch(batch_parameters={"dataframe": df}) +# + +# Create Expectation. +# +expectation = gx.expectations.ExpectColumnValuesToBeBetween( + column="passenger_count", min_value=1, max_value=6 +) +# + +# Validate Batch using Expectation. +# +validation_result = batch.validate(expectation) +# + +# +# Above snippet ends the full exploratory script. + +exploratory_output = """ +# +{ + "success": true, + "expectation_config": { + "type": "expect_column_values_to_be_between", + "kwargs": { + "batch_id": "pandas-pd dataframe asset", + "column": "passenger_count", + "min_value": 1.0, + "max_value": 6.0 + }, + "meta": {} + }, + "result": { + "element_count": 10000, + "unexpected_count": 0, + "unexpected_percent": 0.0, + "partial_unexpected_list": [], + "missing_count": 0, + "missing_percent": 0.0, + "unexpected_percent_total": 0.0, + "unexpected_percent_nonmissing": 0.0, + "partial_unexpected_counts": [], + "partial_unexpected_index_list": [] + }, + "meta": {}, + "exception_info": { + "raised_exception": false, + "exception_traceback": null, + "exception_message": null + } +} +# +""" + +# Test workflow output with passing Expectation. +assert validation_result["success"] is True +assert ( + validation_result["expectation_config"]["type"] + == "expect_column_values_to_be_between" +) +assert validation_result["result"]["element_count"] == 10_000 diff --git a/docs/docusaurus/docs/gx_welcome.md b/docs/docusaurus/docs/gx_welcome.md index 02d1c091ba54..d327c88618b8 100644 --- a/docs/docusaurus/docs/gx_welcome.md +++ b/docs/docusaurus/docs/gx_welcome.md @@ -24,7 +24,7 @@ import GXCard from '@site/src/components/GXCard'; - + diff --git a/docs/docusaurus/docs/resources/get_support.md b/docs/docusaurus/docs/resources/get_support.md index 7f7f7bbbc7e4..8a910e2d3732 100644 --- a/docs/docusaurus/docs/resources/get_support.md +++ b/docs/docusaurus/docs/resources/get_support.md @@ -32,7 +32,7 @@ Search the docs you're using currently for an answer to your issue or question. If you're new to GX Cloud, review [About GX Cloud](/cloud/about_gx.md). -If you're new to GX OSS, see [About Great Expectations](/core/introduction/about_gx.md). +If you're new to GX Core, see the [Introduction to GX Core](/core/introduction/introduction.md). ### Include all the relevant information diff --git a/docs/docusaurus/sidebars.js b/docs/docusaurus/sidebars.js index bb82b4a76051..e7b12c2684f7 100644 --- a/docs/docusaurus/sidebars.js +++ b/docs/docusaurus/sidebars.js @@ -2,29 +2,19 @@ module.exports = { gx_core: [ { type: 'category', - label: 'Introduction to Great Expectations', + label: 'Introduction to GX Core', link: {type: 'doc', id: 'core/introduction/introduction'}, items: [ - { - type: 'doc', - id: 'core/introduction/about_gx', - label: 'About GX' - }, { type: 'doc', id: 'core/introduction/gx_overview', - label: 'GX overview' + label: 'GX Core overview' }, { type: 'doc', id: 'core/introduction/try_gx', - label: 'Try GX' - }, - { - type: 'doc', - id: 'core/introduction/community_resources', - label: 'Community resources' - }, + label: 'Try GX Core' + } ], }, { @@ -149,6 +139,11 @@ module.exports = { id: 'oss/changelog', label: 'Changelog' }, + { + type: 'doc', + id: 'core/introduction/community_resources', + label: 'Community resources' + } ], gx_cloud: [ {type: 'doc', id: 'cloud/why_gx_cloud'},