-
Notifications
You must be signed in to change notification settings - Fork 6
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: Add support for excluding list of exact column names #20
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall changes looks good to me. The solution can be generalized by some string preprocessing before exact match.
@@ -3,11 +3,12 @@ | |||
input1 = { | |||
"Email_Address": { | |||
"Prediction_Factors_and_Weights": { | |||
"Name": 0.4, | |||
"Name": 1, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we need to change the default weight?
"Description": 0, | ||
"Datatype": 0, | ||
"Values": 0.6, | ||
"Values": 0, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we need to change the default weight?
f"does not meet minimum threshold for {infotype}" | ||
) | ||
basic_checks_status = False | ||
elif config_dict[EXCLUDE_NAME] is not None and metadata.name in config_dict.get( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I feel let's preprocess the metadata.name and exclude name list (for e.g. remove special characters, use lowercase letters, etc.) before performing a exact match. It might help in generalizing the solution.
E.g. Cases like metadata.name="Email Sent" and ExcludeName=["email_sent"] will find a match.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm not sure if they'd want this or not, but it's useful enough to put behind a configurable flag I think. I'll do that
@@ -40,6 +40,7 @@ Infotype configuration is a dictionary with all infotypes at root level key. Eac | |||
2. Description | |||
3. Datatype | |||
4. Values | |||
- `ExcludeName` - optional exact match list for column names to exclude from classification for this info_type |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any plan to support regex? Are we thinking to support it in future?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good question- at this point, only if driven by customer requirement
Hey @ethan-cartwright can you please fix the lint failures ? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Everything looks good, curious to know reasoning behind introducing the "strip_exclusion_formatting" flag.
"Email_Address": { | ||
"Prediction_Factors_and_Weights": { | ||
"Name": 0.4, | ||
"Description": 0, | ||
"Datatype": 0, | ||
"Values": 0.6, | ||
}, | ||
"ExcludeName": ["email_sent", "email_recieved"], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As this is a generic input config, let's avoid to add customer specific config details. I would prefer a empty list.
if EXCLUDE_NAME in config_dict and config_dict[EXCLUDE_NAME] is not None: | ||
config_dict[EXCLUDE_NAME] = ( | ||
set(config_dict[EXCLUDE_NAME]) | ||
if not strip_exclusion_formatting |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just thinking aloud, do we really require the flag "strip_exclusion_formatting"? we can call "strip_formatting" for all the names present in EXCLUDE_NAME list without any switch (if condition). This preprocessing will cover comparison of both cases 1. exact match and 2. strings with special characters or case change
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Currently, the customer we're working with may want the ability to have the EXCLUDE_NAME
list only pertain to exact matches. The strip_exclussion_formatting
flag would allow them to decide if they only want to use exact matches (strip_exclussion_formatting: false
), or if they want both exact matches and matches after stripping special chars & case changes (strip_exclussion_formatting: true
).
Does that make sense? Please let me know if there is flaw in my logic.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay, so if one of the customer needs is an exact match, then the strip_exclussion_formatting flag seems appropriate.
As per customer requirements, all changes looks good to me. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Considering customer specific requirements, changes looks good to me.
f"The number of values for column {metadata.name}" | ||
f"does not meet minimum threshold for {infotype}" | ||
) | ||
basic_checks_status = False |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ethan-cartwright Can you change this to debug log ? This is creating a lot of noise in datahub ingestion logs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done in this PR: #21
) | ||
basic_checks_status = False | ||
elif exclude_name is not None and metadata.name in exclude_name: | ||
logger.warning(f"Excluding match for {infotype} on column {metadata.name}") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same for this. As this is executed for every column of every table, the real ingestion logs become pretty difficult to debug.
If I remember correctly, there is already an aggregated table level warning logs statement if the columns were skipped due to basic It does not display the reason details, but that should be good enough to get first indication.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done in this PR: #21
Description
This PR adds support for specifying a list of
ExcludeName
values to be excluded from classification perinfo_type
.