Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Add support for excluding list of exact column names #20

Merged
merged 13 commits into from
Jan 2, 2024

Conversation

ethan-cartwright
Copy link
Contributor

Description

This PR adds support for specifying a list of ExcludeName values to be excluded from classification per info_type.

Copy link
Contributor

@mardikark-gslab mardikark-gslab left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall changes looks good to me. The solution can be generalized by some string preprocessing before exact match.

@@ -3,11 +3,12 @@
input1 = {
"Email_Address": {
"Prediction_Factors_and_Weights": {
"Name": 0.4,
"Name": 1,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we need to change the default weight?

"Description": 0,
"Datatype": 0,
"Values": 0.6,
"Values": 0,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we need to change the default weight?

f"does not meet minimum threshold for {infotype}"
)
basic_checks_status = False
elif config_dict[EXCLUDE_NAME] is not None and metadata.name in config_dict.get(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel let's preprocess the metadata.name and exclude name list (for e.g. remove special characters, use lowercase letters, etc.) before performing a exact match. It might help in generalizing the solution.
E.g. Cases like metadata.name="Email Sent" and ExcludeName=["email_sent"] will find a match.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm not sure if they'd want this or not, but it's useful enough to put behind a configurable flag I think. I'll do that

@@ -40,6 +40,7 @@ Infotype configuration is a dictionary with all infotypes at root level key. Eac
2. Description
3. Datatype
4. Values
- `ExcludeName` - optional exact match list for column names to exclude from classification for this info_type
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any plan to support regex? Are we thinking to support it in future?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good question- at this point, only if driven by customer requirement

@mayurinehate
Copy link
Contributor

Hey @ethan-cartwright can you please fix the lint failures ?
The commands are same as main datahub PR and are also mentioned here.

Copy link
Contributor

@mardikark-gslab mardikark-gslab left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Everything looks good, curious to know reasoning behind introducing the "strip_exclusion_formatting" flag.

"Email_Address": {
"Prediction_Factors_and_Weights": {
"Name": 0.4,
"Description": 0,
"Datatype": 0,
"Values": 0.6,
},
"ExcludeName": ["email_sent", "email_recieved"],
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As this is a generic input config, let's avoid to add customer specific config details. I would prefer a empty list.

if EXCLUDE_NAME in config_dict and config_dict[EXCLUDE_NAME] is not None:
config_dict[EXCLUDE_NAME] = (
set(config_dict[EXCLUDE_NAME])
if not strip_exclusion_formatting
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just thinking aloud, do we really require the flag "strip_exclusion_formatting"? we can call "strip_formatting" for all the names present in EXCLUDE_NAME list without any switch (if condition). This preprocessing will cover comparison of both cases 1. exact match and 2. strings with special characters or case change

Copy link
Contributor Author

@ethan-cartwright ethan-cartwright Dec 28, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently, the customer we're working with may want the ability to have the EXCLUDE_NAME list only pertain to exact matches. The strip_exclussion_formatting flag would allow them to decide if they only want to use exact matches (strip_exclussion_formatting: false), or if they want both exact matches and matches after stripping special chars & case changes (strip_exclussion_formatting: true).

Does that make sense? Please let me know if there is flaw in my logic.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, so if one of the customer needs is an exact match, then the strip_exclussion_formatting flag seems appropriate.

@mardikark-gslab
Copy link
Contributor

As per customer requirements, all changes looks good to me.

Copy link
Contributor

@mardikark-gslab mardikark-gslab left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Considering customer specific requirements, changes looks good to me.

@hsheth2 hsheth2 changed the title Add support for excluding list of exact column names feat: Add support for excluding list of exact column names Jan 2, 2024
@hsheth2 hsheth2 merged commit 63b8397 into main Jan 2, 2024
2 checks passed
@hsheth2 hsheth2 deleted the exclusion_names_support branch January 2, 2024 20:59
f"The number of values for column {metadata.name}"
f"does not meet minimum threshold for {infotype}"
)
basic_checks_status = False
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ethan-cartwright Can you change this to debug log ? This is creating a lot of noise in datahub ingestion logs.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done in this PR: #21

)
basic_checks_status = False
elif exclude_name is not None and metadata.name in exclude_name:
logger.warning(f"Excluding match for {infotype} on column {metadata.name}")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same for this. As this is executed for every column of every table, the real ingestion logs become pretty difficult to debug.

If I remember correctly, there is already an aggregated table level warning logs statement if the columns were skipped due to basic It does not display the reason details, but that should be good enough to get first indication.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done in this PR: #21

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants