Implement scan database command (#292)

* Implement database_coverage command * Add tests for find_missing_dataset_fields * Change database-coverage to only count uncategorized fields * Fix issues merging from main * Change command to use new name / format * Add integration test for database_coverage * Change command to use percent not decimal * Add manifest-dir option and check for all datasets now * Small fixes and ci violations * Fix broken tests * doc strings updates * moved the coverage percentage to the bottom of the error message * add integration tests for manifest, update coverage used in tests to int * Apply suggestions from code review small docs change Co-authored-by: Phil Salant <[email protected]> * Address comments, mostly refactoring * Fix command formatting after some testing * a few output nits * Update docs to include a guide for using the scan command * Rename guides to How-To Guides Co-authored-by: Eduardo Armendariz <[email protected]> Co-authored-by: Thomas La Piana <[email protected]> Co-authored-by: SteveDMurphy <[email protected]> Co-authored-by: Thomas La Piana <[email protected]> Co-authored-by: Phil Salant <[email protected]>
ethyca · Jan 5, 2022 · 90bda47 · 90bda47
1 parent 840e1de
commit 90bda47
Show file tree

Hide file tree

Showing 8 changed files with 686 additions and 95 deletions.
diff --git a/docs/fides/docs/guides/generate_dataset.md b/docs/fides/docs/guides/generate_dataset.md
@@ -0,0 +1,69 @@
+# Generating a Dataset
+
+As an alternative to manually creating dataset resource files like in our [tutorial](../tutorial/dataset.md), it is possible to generate these files using the `generate-dataset` CLI command. The CLI will connect to a given resource and automatically generate a non-annotated resource YAML file in the specified location, based on the database schema. 
+
+Not only is this the simplest way to begin annotating your resources, but it also follows the expected fidesctl format for these resources. This is important as some commands, like `scan`, expect resources to follow this format. 
+
+# Working With a Database
+
+The `generate-dataset` command can connect to a database and automatically generate resource YAML file. Given a database schema with a single `users` table as follows:
+
+```shell
+flaskr=# SELECT * FROM users;
+ id |     created_at      |       email       |              password              | first_name | last_name
+----+---------------------+-------------------+------------------------------------+------------+-----------
+  1 | 2020-01-01 00:00:00 | [email protected] | pbkdf2:sha256:260000$O87nanbSkl... | Admin      | User
+  2 | 2020-01-03 00:00:00 | [email protected]  | pbkdf2:sha256:260000$PGcBy5NzZe... | Example    | User
+(2 rows)
+```
+
+We can invoke the `generate-dataset` by simply providing a connection url for this database:
+```sh
+./venv/bin/fidesctl generate-dataset \
+  postgresql://postgres:postgres@localhost:5432/flaskr \
+  fides_resources/flaskr_postgres_dataset.yml
+```
+
+The result is a resource file with a dataset with collections and fields to represent our schema:
+```
+dataset:
+- fides_key: public
+  organization_fides_key: default_organization
+  name: public
+  description: 'Fides Generated Description for Schema: public'
+  meta: null
+  data_categories: []
+  data_qualifier: aggregated.anonymized.unlinked_pseudonymized.pseudonymized.identified
+  collections:
+  - name: public.users
+    description: 'Fides Generated Description for Table: public.users'
+    data_categories: []
+    data_qualifier: aggregated.anonymized.unlinked_pseudonymized.pseudonymized.identified
+    fields:
+    - name: created_at
+      description: 'Fides Generated Description for Column: created_at'
+      data_categories: []
+      data_qualifier: aggregated.anonymized.unlinked_pseudonymized.pseudonymized.identified
+    - name: email
+      description: 'Fides Generated Description for Column: email'
+      data_categories: []
+      data_qualifier: aggregated.anonymized.unlinked_pseudonymized.pseudonymized.identified
+    - name: first_name
+      description: 'Fides Generated Description for Column: first_name'
+      data_categories: []
+      data_qualifier: aggregated.anonymized.unlinked_pseudonymized.pseudonymized.identified
+    - name: id
+      description: 'Fides Generated Description for Column: id'
+      data_categories: []
+      data_qualifier: aggregated.anonymized.unlinked_pseudonymized.pseudonymized.identified
+    - name: last_name
+      description: 'Fides Generated Description for Column: last_name'
+      data_categories: []
+      data_qualifier: aggregated.anonymized.unlinked_pseudonymized.pseudonymized.identified
+    - name: password
+      description: 'Fides Generated Description for Column: password'
+      data_categories: []
+      data_qualifier: aggregated.anonymized.unlinked_pseudonymized.pseudonymized.identified
+```
+<!-- TODO: Link to the `annotate-dataset` usage documentation below, when it exists. -->
+The resulting file still requires annotating the dataset with data categories to represent what is stored. 
diff --git a/docs/fides/docs/guides/scan_resource.md b/docs/fides/docs/guides/scan_resource.md
@@ -0,0 +1,81 @@
+# Scanning a Resource
+
+As you annotate resources with fidesctl it is important to keep your fidesctl resources up to date. The `scan` command is available to compare your resources and what is defined in your fidesctl server or resource files. It will output any part of the dataset which is not defined or categorized. The command will exit in error if a coverage threshold is not met. 
+
+The `scan` command works best when used in tandem with the `generate-dataset` command as it creates resources in the expected format. The fidesctl format for datasets must be followed in order to be able to track coverage. 
+
+# Scanning a Database
+
+The `scan` command can connect to a database and compare its schema to your already defined resources. Given a database schema with a single `users` table as follows:
+
+```shell
+flaskr=# SELECT * FROM users;
+ id |     created_at      |       email       |              password              | first_name | last_name
+----+---------------------+-------------------+------------------------------------+------------+-----------
+  1 | 2020-01-01 00:00:00 | [email protected] | pbkdf2:sha256:260000$O87nanbSkl... | Admin      | User
+  2 | 2020-01-03 00:00:00 | [email protected]  | pbkdf2:sha256:260000$PGcBy5NzZe... | Example    | User
+(2 rows)
+```
+
+We have fully annotated this schema before with the following dataset resource file:
+```
+dataset:
+- fides_key: public
+  organization_fides_key: default_organization
+  name: public
+  description: 'Fides Generated Description for Schema: public'
+  meta: null
+  data_categories: []
+  data_qualifier: aggregated.anonymized.unlinked_pseudonymized.pseudonymized.identified
+  collections:
+  - name: public.users
+    description: 'Fides Generated Description for Table: public.users'
+    data_categories: []
+    data_qualifier: aggregated.anonymized.unlinked_pseudonymized.pseudonymized.identified
+    fields:
+    - name: created_at
+      description: 'Fides Generated Description for Column: created_at'
+      data_categories: [system.operations]
+      data_qualifier: aggregated.anonymized.unlinked_pseudonymized.pseudonymized.identified
+    - name: email
+      description: 'Fides Generated Description for Column: email'
+      data_categories: [user.provided.identifiable.contact.email]
+      data_qualifier: aggregated.anonymized.unlinked_pseudonymized.pseudonymized.identified
+    - name: first_name
+      description: 'Fides Generated Description for Column: first_name'
+      data_categories: [user.provided.identifiable.name]
+      data_qualifier: aggregated.anonymized.unlinked_pseudonymized.pseudonymized.identified
+    - name: id
+      description: 'Fides Generated Description for Column: id'
+      data_categories: [user.derived.identifiable.unique_id]
+      data_qualifier: aggregated.anonymized.unlinked_pseudonymized.pseudonymized.identified
+    - name: last_name
+      description: 'Fides Generated Description for Column: last_name'
+      data_categories: [user.provided.identifiable.name]
+      data_qualifier: aggregated.anonymized.unlinked_pseudonymized.pseudonymized.identified
+    - name: password
+      description: 'Fides Generated Description for Column: password'
+      data_categories: [user.provided.identifiable.credentials.password]
+      data_qualifier: aggregated.anonymized.unlinked_pseudonymized.pseudonymized.identified
+```
+
+fidesctl scan --manifest-dir dataset.yml database postgresql+psycopg2://postgres:fidesctl@fidesctl-db:5432/postgres 
+
+
+We can invoke the `scan` by simply providing a connection url for this database:
+```sh
+./venv/bin/fidesctl scan \
+  --manifest-dir dataset.yml \
+  database \
+  postgresql+psycopg2://postgres:fidesctl@fidesctl-db:5432/postgres
+```
+
+The command output confirms our database resource is covered fully!
+```sh
+Loading resource manifests from: dataset.yml
+Taxonomy successfully created.
+Successfully scanned the following datasets:
+	public
+
+Annotation coverage: 100%
+```
diff --git a/docs/fides/mkdocs.yml b/docs/fides/mkdocs.yml
@@ -30,6 +30,9 @@ nav:
       - Write a Policy: tutorial/policy.md
       - Add Google Analytics: tutorial/google.md
       - Manage Google Analytics with Fidesctl: tutorial/pass.md
+  - How-To Guides: 
+      - Generating a Dataset: guides/generate_dataset.md
+      - Scanning a Resource: guides/scan_resource.md
   - CI/CD Reference: ci_reference.md
   - Fides Language:
       - Overview: language/overview.md

diff --git a/fidesctl/src/fidesctl/cli/__init__.py b/fidesctl/src/fidesctl/cli/__init__.py
@@ -7,6 +7,7 @@
     delete,
     evaluate,
     generate_dataset,
+    scan,
     annotate_dataset,
     get,
     init_db,
@@ -26,6 +27,7 @@
     apply,
     delete,
     generate_dataset,
+    scan,
     get,
     init_db,
     ls,

diff --git a/fidesctl/src/fidesctl/cli/cli.py b/fidesctl/src/fidesctl/cli/cli.py
@@ -140,6 +140,39 @@ def generate_dataset(
     _generate_dataset.generate_dataset(connection_string, output_filename)
 
 
+@click.command()
+@click.pass_context
+@click.argument("source_type", type=click.Choice(["database"]))
+@click.argument("connection_string", type=str)
+@click.option("-m", "--manifest-dir", type=str, default="")
+@click.option("-c", "--coverage-threshold", type=click.IntRange(0, 100), default=100)
+def scan(
+    ctx: click.Context,
+    source_type: str,
+    connection_string: str,
+    manifest_dir: str,
+    coverage_threshold: int,
+) -> None:
+    """
+    Connect to a database directly via a SQLAlchemy-stlye connection string and
+    compare the database objects to existing datasets.
+
+    If there are fields within the database that aren't listed and categorized
+    within one of the datasets, this counts as lacking coverage.
+
+    Outputs missing fields and has a non-zero exit if coverage is
+    under the stated threshold.
+    """
+    config = ctx.obj["CONFIG"]
+    _generate_dataset.database_coverage(
+        connection_string=connection_string,
+        manifest_dir=manifest_dir,
+        coverage_threshold=coverage_threshold,
+        url=config.cli.server_url,
+        headers=config.user.request_headers,
+    )
+
+
 @click.command()
 @click.pass_context
 @click.argument("input_filename", type=str)