Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimize queries used in Snowflake lineage connector #797

Merged

Conversation

mars-lan
Copy link
Contributor

@mars-lan mars-lan commented Mar 14, 2024

🤔 Why?

It takes an exceedingly long time (> 1.5 hours) to run the Snowflake lineage crawler against large Snowflake.AccountUnage.QUERY_HISTORY views.

Turns out that similar to ACCESS_HISTORY, we must also specify a filter against START_TIME in order to query QUERY_HISTORY efficiently.

🤓 What?

  • Specify the START_TIME filter for QUERY_HISTORY and use inner JOIN when joining with ACCESS_HISTORY in Snowflake lineage connector
  • Qualify the filters in the base Snowflake connector to avoid confusion.

🧪 Tested?

Verified that the MCEs before & after the changes are identical. Observed an order of magnitude improvement in performance:

Before:

Ended running with RunStatus.SUCCESS at 2024-03-14 08:54:28.949943, fetched 78 entities, took 217.1s

After:

Ended running with RunStatus.SUCCESS at 2024-03-14 08:47:54.532785, fetched 78 entities, took 27.6s

☑️ Checks

  • My PR contains actual code changes, and I have updated the version number in pyproject.toml.

Copy link

@mars-lan mars-lan enabled auto-merge (squash) March 14, 2024 16:52
Copy link
Contributor

@alyiwang alyiwang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link

☂️ Python Coverage

current status: ✅

Overall Coverage

Lines Covered Coverage Threshold Status
16051 14848 93% 85% 🟢

New Files

No new covered files...

Modified Files

File Coverage Status
metaphor/snowflake/extractor.py 86% 🟢
metaphor/snowflake/lineage/extractor.py 68% 🟢
TOTAL 77% 🟢

updated for commit: 61d2a7a by action🐍

@mars-lan mars-lan merged commit 7ef524a into main Mar 14, 2024
4 checks passed
@mars-lan mars-lan deleted the marslan/sc-25106/snowflake-lineage-crawler-timing-out-due branch March 14, 2024 16:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants