feat: Add support for excluding list of exact column names #20

ethan-cartwright · 2023-12-16T21:57:17Z

Description

This PR adds support for specifying a list of ExcludeName values to be excluded from classification per info_type.

mardikark-gslab

Overall changes looks good to me. The solution can be generalized by some string preprocessing before exact match.

mardikark-gslab · 2023-12-18T13:38:24Z

datahub-classify/src/datahub_classify/reference_input.py

@@ -3,11 +3,12 @@
 input1 = {
    "Email_Address": {
        "Prediction_Factors_and_Weights": {
-            "Name": 0.4,
+            "Name": 1,


do we need to change the default weight?

mardikark-gslab · 2023-12-18T13:38:43Z

datahub-classify/src/datahub_classify/reference_input.py

            "Description": 0,
            "Datatype": 0,
-            "Values": 0.6,
+            "Values": 0,


do we need to change the default weight?

mardikark-gslab · 2023-12-18T13:49:35Z

datahub-classify/src/datahub_classify/infotype_utils.py

+            f"does not meet minimum threshold for {infotype}"
+        )
+        basic_checks_status = False
+    elif config_dict[EXCLUDE_NAME] is not None and metadata.name in config_dict.get(


I feel let's preprocess the metadata.name and exclude name list (for e.g. remove special characters, use lowercase letters, etc.) before performing a exact match. It might help in generalizing the solution.
E.g. Cases like metadata.name="Email Sent" and ExcludeName=["email_sent"] will find a match.

Hmm not sure if they'd want this or not, but it's useful enough to put behind a configurable flag I think. I'll do that

mardikark-gslab · 2023-12-18T13:52:42Z

datahub-classify/README.md

@@ -40,6 +40,7 @@ Infotype configuration is a dictionary with all infotypes at root level key. Eac
  2. Description
  3. Datatype
  4. Values
+- `ExcludeName` - optional exact match list for column names to exclude from classification for this info_type


Any plan to support regex? Are we thinking to support it in future?

good question- at this point, only if driven by customer requirement

mayurinehate · 2023-12-19T09:16:45Z

Hey @ethan-cartwright can you please fix the lint failures ?
The commands are same as main datahub PR and are also mentioned here.

mardikark-gslab

Everything looks good, curious to know reasoning behind introducing the "strip_exclusion_formatting" flag.

mardikark-gslab · 2023-12-26T12:56:07Z

datahub-classify/src/datahub_classify/reference_input.py

    "Email_Address": {
        "Prediction_Factors_and_Weights": {
            "Name": 0.4,
            "Description": 0,
            "Datatype": 0,
            "Values": 0.6,
        },
+        "ExcludeName": ["email_sent", "email_recieved"],


As this is a generic input config, let's avoid to add customer specific config details. I would prefer a empty list.

mardikark-gslab · 2023-12-27T03:59:03Z

datahub-classify/src/datahub_classify/infotype_predictor.py

+            if EXCLUDE_NAME in config_dict and config_dict[EXCLUDE_NAME] is not None:
+                config_dict[EXCLUDE_NAME] = (
+                    set(config_dict[EXCLUDE_NAME])
+                    if not strip_exclusion_formatting


just thinking aloud, do we really require the flag "strip_exclusion_formatting"? we can call "strip_formatting" for all the names present in EXCLUDE_NAME list without any switch (if condition). This preprocessing will cover comparison of both cases 1. exact match and 2. strings with special characters or case change

Currently, the customer we're working with may want the ability to have the EXCLUDE_NAME list only pertain to exact matches. The strip_exclussion_formatting flag would allow them to decide if they only want to use exact matches (strip_exclussion_formatting: false), or if they want both exact matches and matches after stripping special chars & case changes (strip_exclussion_formatting: true).

Does that make sense? Please let me know if there is flaw in my logic.

Okay, so if one of the customer needs is an exact match, then the strip_exclussion_formatting flag seems appropriate.

mardikark-gslab · 2023-12-29T08:03:24Z

As per customer requirements, all changes looks good to me.

mardikark-gslab

Considering customer specific requirements, changes looks good to me.

mayurinehate · 2024-01-12T12:46:25Z

datahub-classify/src/datahub_classify/infotype_utils.py

+            f"The number of values for column {metadata.name}"
+            f"does not meet minimum threshold for {infotype}"
+        )
+        basic_checks_status = False


@ethan-cartwright Can you change this to debug log ? This is creating a lot of noise in datahub ingestion logs.

done in this PR: #21

mayurinehate · 2024-01-12T12:48:44Z

datahub-classify/src/datahub_classify/infotype_utils.py

+        )
+        basic_checks_status = False
+    elif exclude_name is not None and metadata.name in exclude_name:
+        logger.warning(f"Excluding match for {infotype} on column {metadata.name}")


same for this. As this is executed for every column of every table, the real ingestion logs become pretty difficult to debug.

If I remember correctly, there is already an aggregated table level warning logs statement if the columns were skipped due to basic It does not display the reason details, but that should be good enough to get first indication.

done in this PR: #21

ethan-cartwright added 3 commits December 15, 2023 19:49

working solution except for running pytest

cb53421

add git ignore

4cbfbcb

remove driver call in test file

9d26152

ethan-cartwright requested review from mardikark-gslab and mayurinehate December 16, 2023 21:57

ethan-cartwright added 3 commits December 16, 2023 16:59

add documentation

73fba17

add note that excludeName is optional

6762f0f

add code to handle unspecified excludeName better

9fd8fb6

ethan-cartwright mentioned this pull request Dec 17, 2023

feat(classifier): Add support for excluding list of exact column names datahub-project/datahub#9472

Merged

5 tasks

mardikark-gslab reviewed Dec 18, 2023

View reviewed changes

ethan-cartwright added 4 commits December 21, 2023 20:13

add strip_exclusion_formatting flag

0e36df2

add test for strip formatting

20886a6

fix linting

0a481cb

fix linting

5bec60a

ethan-cartwright requested a review from mardikark-gslab December 22, 2023 02:00

ethan-cartwright added 2 commits December 22, 2023 13:09

fix type annotation errors

071e8cc

delete unused test file

dc61449

mardikark-gslab reviewed Dec 27, 2023

View reviewed changes

address pr comment

fa05dcf

mardikark-gslab approved these changes Dec 29, 2023

View reviewed changes

hsheth2 changed the title ~~Add support for excluding list of exact column names~~ feat: Add support for excluding list of exact column names Jan 2, 2024

hsheth2 merged commit 63b8397 into main Jan 2, 2024
2 checks passed

hsheth2 deleted the exclusion_names_support branch January 2, 2024 20:59

mayurinehate reviewed Jan 12, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Add support for excluding list of exact column names #20

feat: Add support for excluding list of exact column names #20

ethan-cartwright commented Dec 16, 2023

mardikark-gslab left a comment

mardikark-gslab Dec 18, 2023

mardikark-gslab Dec 18, 2023

mardikark-gslab Dec 18, 2023

ethan-cartwright Dec 21, 2023

mardikark-gslab Dec 18, 2023

ethan-cartwright Dec 21, 2023

mayurinehate commented Dec 19, 2023

mardikark-gslab left a comment

mardikark-gslab Dec 26, 2023

mardikark-gslab Dec 27, 2023

ethan-cartwright Dec 28, 2023 •

edited

Loading

mardikark-gslab Dec 29, 2023

mardikark-gslab commented Dec 29, 2023

mardikark-gslab left a comment

mayurinehate Jan 12, 2024

ethan-cartwright Jan 17, 2024

mayurinehate Jan 12, 2024

ethan-cartwright Jan 17, 2024

feat: Add support for excluding list of exact column names #20

feat: Add support for excluding list of exact column names #20

Conversation

ethan-cartwright commented Dec 16, 2023

Description

mardikark-gslab left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mayurinehate commented Dec 19, 2023

mardikark-gslab left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ethan-cartwright Dec 28, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mardikark-gslab commented Dec 29, 2023

mardikark-gslab left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ethan-cartwright Dec 28, 2023 •

edited

Loading