Skip to content

Commit

Permalink
Implement scan database command (#292)
Browse files Browse the repository at this point in the history
* Implement database_coverage command

* Add tests for find_missing_dataset_fields

* Change database-coverage to only count uncategorized fields

* Fix issues merging from main

* Change command to use new name / format

* Add integration test for database_coverage

* Change command to use percent not decimal

* Add manifest-dir option and check for all datasets now

* Small fixes and ci violations

* Fix broken tests

* doc strings updates

* moved the coverage percentage to the bottom of the error message

* add integration tests for manifest, update coverage used in tests to int

* Apply suggestions from code review

small docs change

Co-authored-by: Phil Salant <[email protected]>

* Address comments, mostly refactoring

* Fix command formatting after some testing

* a few output nits

* Update docs to include a guide for using the scan command

* Rename guides to How-To Guides

Co-authored-by: Eduardo Armendariz <[email protected]>
Co-authored-by: Thomas La Piana <[email protected]>
Co-authored-by: SteveDMurphy <[email protected]>
Co-authored-by: Thomas La Piana <[email protected]>
Co-authored-by: Phil Salant <[email protected]>
  • Loading branch information
6 people authored Jan 5, 2022
1 parent 840e1de commit 90bda47
Show file tree
Hide file tree
Showing 8 changed files with 686 additions and 95 deletions.
69 changes: 69 additions & 0 deletions docs/fides/docs/guides/generate_dataset.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,69 @@
# Generating a Dataset

As an alternative to manually creating dataset resource files like in our [tutorial](../tutorial/dataset.md), it is possible to generate these files using the `generate-dataset` CLI command. The CLI will connect to a given resource and automatically generate a non-annotated resource YAML file in the specified location, based on the database schema.

Not only is this the simplest way to begin annotating your resources, but it also follows the expected fidesctl format for these resources. This is important as some commands, like `scan`, expect resources to follow this format.

# Working With a Database

The `generate-dataset` command can connect to a database and automatically generate resource YAML file. Given a database schema with a single `users` table as follows:

```shell
flaskr=# SELECT * FROM users;
id | created_at | email | password | first_name | last_name
----+---------------------+-------------------+------------------------------------+------------+-----------
1 | 2020-01-01 00:00:00 | [email protected] | pbkdf2:sha256:260000$O87nanbSkl... | Admin | User
2 | 2020-01-03 00:00:00 | [email protected] | pbkdf2:sha256:260000$PGcBy5NzZe... | Example | User
(2 rows)
```

We can invoke the `generate-dataset` by simply providing a connection url for this database:
```sh
./venv/bin/fidesctl generate-dataset \
postgresql://postgres:postgres@localhost:5432/flaskr \
fides_resources/flaskr_postgres_dataset.yml
```

The result is a resource file with a dataset with collections and fields to represent our schema:
```
dataset:
- fides_key: public
organization_fides_key: default_organization
name: public
description: 'Fides Generated Description for Schema: public'
meta: null
data_categories: []
data_qualifier: aggregated.anonymized.unlinked_pseudonymized.pseudonymized.identified
collections:
- name: public.users
description: 'Fides Generated Description for Table: public.users'
data_categories: []
data_qualifier: aggregated.anonymized.unlinked_pseudonymized.pseudonymized.identified
fields:
- name: created_at
description: 'Fides Generated Description for Column: created_at'
data_categories: []
data_qualifier: aggregated.anonymized.unlinked_pseudonymized.pseudonymized.identified
- name: email
description: 'Fides Generated Description for Column: email'
data_categories: []
data_qualifier: aggregated.anonymized.unlinked_pseudonymized.pseudonymized.identified
- name: first_name
description: 'Fides Generated Description for Column: first_name'
data_categories: []
data_qualifier: aggregated.anonymized.unlinked_pseudonymized.pseudonymized.identified
- name: id
description: 'Fides Generated Description for Column: id'
data_categories: []
data_qualifier: aggregated.anonymized.unlinked_pseudonymized.pseudonymized.identified
- name: last_name
description: 'Fides Generated Description for Column: last_name'
data_categories: []
data_qualifier: aggregated.anonymized.unlinked_pseudonymized.pseudonymized.identified
- name: password
description: 'Fides Generated Description for Column: password'
data_categories: []
data_qualifier: aggregated.anonymized.unlinked_pseudonymized.pseudonymized.identified
```
<!-- TODO: Link to the `annotate-dataset` usage documentation below, when it exists. -->
The resulting file still requires annotating the dataset with data categories to represent what is stored.
81 changes: 81 additions & 0 deletions docs/fides/docs/guides/scan_resource.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,81 @@
# Scanning a Resource

As you annotate resources with fidesctl it is important to keep your fidesctl resources up to date. The `scan` command is available to compare your resources and what is defined in your fidesctl server or resource files. It will output any part of the dataset which is not defined or categorized. The command will exit in error if a coverage threshold is not met.

The `scan` command works best when used in tandem with the `generate-dataset` command as it creates resources in the expected format. The fidesctl format for datasets must be followed in order to be able to track coverage.

# Scanning a Database

The `scan` command can connect to a database and compare its schema to your already defined resources. Given a database schema with a single `users` table as follows:

```shell
flaskr=# SELECT * FROM users;
id | created_at | email | password | first_name | last_name
----+---------------------+-------------------+------------------------------------+------------+-----------
1 | 2020-01-01 00:00:00 | [email protected] | pbkdf2:sha256:260000$O87nanbSkl... | Admin | User
2 | 2020-01-03 00:00:00 | [email protected] | pbkdf2:sha256:260000$PGcBy5NzZe... | Example | User
(2 rows)
```

We have fully annotated this schema before with the following dataset resource file:
```
dataset:
- fides_key: public
organization_fides_key: default_organization
name: public
description: 'Fides Generated Description for Schema: public'
meta: null
data_categories: []
data_qualifier: aggregated.anonymized.unlinked_pseudonymized.pseudonymized.identified
collections:
- name: public.users
description: 'Fides Generated Description for Table: public.users'
data_categories: []
data_qualifier: aggregated.anonymized.unlinked_pseudonymized.pseudonymized.identified
fields:
- name: created_at
description: 'Fides Generated Description for Column: created_at'
data_categories: [system.operations]
data_qualifier: aggregated.anonymized.unlinked_pseudonymized.pseudonymized.identified
- name: email
description: 'Fides Generated Description for Column: email'
data_categories: [user.provided.identifiable.contact.email]
data_qualifier: aggregated.anonymized.unlinked_pseudonymized.pseudonymized.identified
- name: first_name
description: 'Fides Generated Description for Column: first_name'
data_categories: [user.provided.identifiable.name]
data_qualifier: aggregated.anonymized.unlinked_pseudonymized.pseudonymized.identified
- name: id
description: 'Fides Generated Description for Column: id'
data_categories: [user.derived.identifiable.unique_id]
data_qualifier: aggregated.anonymized.unlinked_pseudonymized.pseudonymized.identified
- name: last_name
description: 'Fides Generated Description for Column: last_name'
data_categories: [user.provided.identifiable.name]
data_qualifier: aggregated.anonymized.unlinked_pseudonymized.pseudonymized.identified
- name: password
description: 'Fides Generated Description for Column: password'
data_categories: [user.provided.identifiable.credentials.password]
data_qualifier: aggregated.anonymized.unlinked_pseudonymized.pseudonymized.identified
```

fidesctl scan --manifest-dir dataset.yml database postgresql+psycopg2://postgres:fidesctl@fidesctl-db:5432/postgres


We can invoke the `scan` by simply providing a connection url for this database:
```sh
./venv/bin/fidesctl scan \
--manifest-dir dataset.yml \
database \
postgresql+psycopg2://postgres:fidesctl@fidesctl-db:5432/postgres
```

The command output confirms our database resource is covered fully!
```sh
Loading resource manifests from: dataset.yml
Taxonomy successfully created.
Successfully scanned the following datasets:
public

Annotation coverage: 100%
```
3 changes: 3 additions & 0 deletions docs/fides/mkdocs.yml
Original file line number Diff line number Diff line change
Expand Up @@ -30,6 +30,9 @@ nav:
- Write a Policy: tutorial/policy.md
- Add Google Analytics: tutorial/google.md
- Manage Google Analytics with Fidesctl: tutorial/pass.md
- How-To Guides:
- Generating a Dataset: guides/generate_dataset.md
- Scanning a Resource: guides/scan_resource.md
- CI/CD Reference: ci_reference.md
- Fides Language:
- Overview: language/overview.md
Expand Down
2 changes: 2 additions & 0 deletions fidesctl/src/fidesctl/cli/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -7,6 +7,7 @@
delete,
evaluate,
generate_dataset,
scan,
annotate_dataset,
get,
init_db,
Expand All @@ -26,6 +27,7 @@
apply,
delete,
generate_dataset,
scan,
get,
init_db,
ls,
Expand Down
33 changes: 33 additions & 0 deletions fidesctl/src/fidesctl/cli/cli.py
Original file line number Diff line number Diff line change
Expand Up @@ -140,6 +140,39 @@ def generate_dataset(
_generate_dataset.generate_dataset(connection_string, output_filename)


@click.command()
@click.pass_context
@click.argument("source_type", type=click.Choice(["database"]))
@click.argument("connection_string", type=str)
@click.option("-m", "--manifest-dir", type=str, default="")
@click.option("-c", "--coverage-threshold", type=click.IntRange(0, 100), default=100)
def scan(
ctx: click.Context,
source_type: str,
connection_string: str,
manifest_dir: str,
coverage_threshold: int,
) -> None:
"""
Connect to a database directly via a SQLAlchemy-stlye connection string and
compare the database objects to existing datasets.
If there are fields within the database that aren't listed and categorized
within one of the datasets, this counts as lacking coverage.
Outputs missing fields and has a non-zero exit if coverage is
under the stated threshold.
"""
config = ctx.obj["CONFIG"]
_generate_dataset.database_coverage(
connection_string=connection_string,
manifest_dir=manifest_dir,
coverage_threshold=coverage_threshold,
url=config.cli.server_url,
headers=config.user.request_headers,
)


@click.command()
@click.pass_context
@click.argument("input_filename", type=str)
Expand Down
Loading

0 comments on commit 90bda47

Please sign in to comment.