-
Notifications
You must be signed in to change notification settings - Fork 74
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Implement
scan database
command (#292)
* Implement database_coverage command * Add tests for find_missing_dataset_fields * Change database-coverage to only count uncategorized fields * Fix issues merging from main * Change command to use new name / format * Add integration test for database_coverage * Change command to use percent not decimal * Add manifest-dir option and check for all datasets now * Small fixes and ci violations * Fix broken tests * doc strings updates * moved the coverage percentage to the bottom of the error message * add integration tests for manifest, update coverage used in tests to int * Apply suggestions from code review small docs change Co-authored-by: Phil Salant <[email protected]> * Address comments, mostly refactoring * Fix command formatting after some testing * a few output nits * Update docs to include a guide for using the scan command * Rename guides to How-To Guides Co-authored-by: Eduardo Armendariz <[email protected]> Co-authored-by: Thomas La Piana <[email protected]> Co-authored-by: SteveDMurphy <[email protected]> Co-authored-by: Thomas La Piana <[email protected]> Co-authored-by: Phil Salant <[email protected]>
- Loading branch information
1 parent
840e1de
commit 90bda47
Showing
8 changed files
with
686 additions
and
95 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,69 @@ | ||
# Generating a Dataset | ||
|
||
As an alternative to manually creating dataset resource files like in our [tutorial](../tutorial/dataset.md), it is possible to generate these files using the `generate-dataset` CLI command. The CLI will connect to a given resource and automatically generate a non-annotated resource YAML file in the specified location, based on the database schema. | ||
|
||
Not only is this the simplest way to begin annotating your resources, but it also follows the expected fidesctl format for these resources. This is important as some commands, like `scan`, expect resources to follow this format. | ||
|
||
# Working With a Database | ||
|
||
The `generate-dataset` command can connect to a database and automatically generate resource YAML file. Given a database schema with a single `users` table as follows: | ||
|
||
```shell | ||
flaskr=# SELECT * FROM users; | ||
id | created_at | email | password | first_name | last_name | ||
----+---------------------+-------------------+------------------------------------+------------+----------- | ||
1 | 2020-01-01 00:00:00 | [email protected] | pbkdf2:sha256:260000$O87nanbSkl... | Admin | User | ||
2 | 2020-01-03 00:00:00 | [email protected] | pbkdf2:sha256:260000$PGcBy5NzZe... | Example | User | ||
(2 rows) | ||
``` | ||
|
||
We can invoke the `generate-dataset` by simply providing a connection url for this database: | ||
```sh | ||
./venv/bin/fidesctl generate-dataset \ | ||
postgresql://postgres:postgres@localhost:5432/flaskr \ | ||
fides_resources/flaskr_postgres_dataset.yml | ||
``` | ||
|
||
The result is a resource file with a dataset with collections and fields to represent our schema: | ||
``` | ||
dataset: | ||
- fides_key: public | ||
organization_fides_key: default_organization | ||
name: public | ||
description: 'Fides Generated Description for Schema: public' | ||
meta: null | ||
data_categories: [] | ||
data_qualifier: aggregated.anonymized.unlinked_pseudonymized.pseudonymized.identified | ||
collections: | ||
- name: public.users | ||
description: 'Fides Generated Description for Table: public.users' | ||
data_categories: [] | ||
data_qualifier: aggregated.anonymized.unlinked_pseudonymized.pseudonymized.identified | ||
fields: | ||
- name: created_at | ||
description: 'Fides Generated Description for Column: created_at' | ||
data_categories: [] | ||
data_qualifier: aggregated.anonymized.unlinked_pseudonymized.pseudonymized.identified | ||
- name: email | ||
description: 'Fides Generated Description for Column: email' | ||
data_categories: [] | ||
data_qualifier: aggregated.anonymized.unlinked_pseudonymized.pseudonymized.identified | ||
- name: first_name | ||
description: 'Fides Generated Description for Column: first_name' | ||
data_categories: [] | ||
data_qualifier: aggregated.anonymized.unlinked_pseudonymized.pseudonymized.identified | ||
- name: id | ||
description: 'Fides Generated Description for Column: id' | ||
data_categories: [] | ||
data_qualifier: aggregated.anonymized.unlinked_pseudonymized.pseudonymized.identified | ||
- name: last_name | ||
description: 'Fides Generated Description for Column: last_name' | ||
data_categories: [] | ||
data_qualifier: aggregated.anonymized.unlinked_pseudonymized.pseudonymized.identified | ||
- name: password | ||
description: 'Fides Generated Description for Column: password' | ||
data_categories: [] | ||
data_qualifier: aggregated.anonymized.unlinked_pseudonymized.pseudonymized.identified | ||
``` | ||
<!-- TODO: Link to the `annotate-dataset` usage documentation below, when it exists. --> | ||
The resulting file still requires annotating the dataset with data categories to represent what is stored. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,81 @@ | ||
# Scanning a Resource | ||
|
||
As you annotate resources with fidesctl it is important to keep your fidesctl resources up to date. The `scan` command is available to compare your resources and what is defined in your fidesctl server or resource files. It will output any part of the dataset which is not defined or categorized. The command will exit in error if a coverage threshold is not met. | ||
|
||
The `scan` command works best when used in tandem with the `generate-dataset` command as it creates resources in the expected format. The fidesctl format for datasets must be followed in order to be able to track coverage. | ||
|
||
# Scanning a Database | ||
|
||
The `scan` command can connect to a database and compare its schema to your already defined resources. Given a database schema with a single `users` table as follows: | ||
|
||
```shell | ||
flaskr=# SELECT * FROM users; | ||
id | created_at | email | password | first_name | last_name | ||
----+---------------------+-------------------+------------------------------------+------------+----------- | ||
1 | 2020-01-01 00:00:00 | [email protected] | pbkdf2:sha256:260000$O87nanbSkl... | Admin | User | ||
2 | 2020-01-03 00:00:00 | [email protected] | pbkdf2:sha256:260000$PGcBy5NzZe... | Example | User | ||
(2 rows) | ||
``` | ||
|
||
We have fully annotated this schema before with the following dataset resource file: | ||
``` | ||
dataset: | ||
- fides_key: public | ||
organization_fides_key: default_organization | ||
name: public | ||
description: 'Fides Generated Description for Schema: public' | ||
meta: null | ||
data_categories: [] | ||
data_qualifier: aggregated.anonymized.unlinked_pseudonymized.pseudonymized.identified | ||
collections: | ||
- name: public.users | ||
description: 'Fides Generated Description for Table: public.users' | ||
data_categories: [] | ||
data_qualifier: aggregated.anonymized.unlinked_pseudonymized.pseudonymized.identified | ||
fields: | ||
- name: created_at | ||
description: 'Fides Generated Description for Column: created_at' | ||
data_categories: [system.operations] | ||
data_qualifier: aggregated.anonymized.unlinked_pseudonymized.pseudonymized.identified | ||
- name: email | ||
description: 'Fides Generated Description for Column: email' | ||
data_categories: [user.provided.identifiable.contact.email] | ||
data_qualifier: aggregated.anonymized.unlinked_pseudonymized.pseudonymized.identified | ||
- name: first_name | ||
description: 'Fides Generated Description for Column: first_name' | ||
data_categories: [user.provided.identifiable.name] | ||
data_qualifier: aggregated.anonymized.unlinked_pseudonymized.pseudonymized.identified | ||
- name: id | ||
description: 'Fides Generated Description for Column: id' | ||
data_categories: [user.derived.identifiable.unique_id] | ||
data_qualifier: aggregated.anonymized.unlinked_pseudonymized.pseudonymized.identified | ||
- name: last_name | ||
description: 'Fides Generated Description for Column: last_name' | ||
data_categories: [user.provided.identifiable.name] | ||
data_qualifier: aggregated.anonymized.unlinked_pseudonymized.pseudonymized.identified | ||
- name: password | ||
description: 'Fides Generated Description for Column: password' | ||
data_categories: [user.provided.identifiable.credentials.password] | ||
data_qualifier: aggregated.anonymized.unlinked_pseudonymized.pseudonymized.identified | ||
``` | ||
|
||
fidesctl scan --manifest-dir dataset.yml database postgresql+psycopg2://postgres:fidesctl@fidesctl-db:5432/postgres | ||
|
||
|
||
We can invoke the `scan` by simply providing a connection url for this database: | ||
```sh | ||
./venv/bin/fidesctl scan \ | ||
--manifest-dir dataset.yml \ | ||
database \ | ||
postgresql+psycopg2://postgres:fidesctl@fidesctl-db:5432/postgres | ||
``` | ||
|
||
The command output confirms our database resource is covered fully! | ||
```sh | ||
Loading resource manifests from: dataset.yml | ||
Taxonomy successfully created. | ||
Successfully scanned the following datasets: | ||
public | ||
|
||
Annotation coverage: 100% | ||
``` |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.