Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(metadata-ingestion): Add support for druid #2235

Merged
merged 7 commits into from
Mar 18, 2021
Merged

feat(metadata-ingestion): Add support for druid #2235

merged 7 commits into from
Mar 18, 2021

Conversation

pedro93
Copy link
Collaborator

@pedro93 pedro93 commented Mar 15, 2021

Checklist

  • The PR conforms to DataHub's Contributing Guideline (particularly Commit Message Format)
  • Links to related issues (if applicable)
  • Tests for the changes have been added/updated (if applicable)
  • Docs related to the changes have been added/updated (if applicable)

Adds support for druid as a metadata source to be crawled using the ingestion framework.

Test

This was tested in 2 ways:

@pedro93
Copy link
Collaborator Author

pedro93 commented Mar 15, 2021

This is a WIP as there are 2 unknowns at this time.
1.) How does the metadata framework plugin system work and what changes do I need to do to make it work?

2.) When running the crawler it fails with an exception despite creating the expected MCEs:

$ datahub ingest -c /home/pedro/dev/DCData/Data.DataHub/helm/datahub/configs/druid-metadata-crawler/crawler-config.yaml
[2021-03-15 10:41:20,408] DEBUG    {datahub.entrypoints:67} - Using config: {'source': {'type': 'druid', 'config': {'host_port': 'localhost:8082'}}, 'sink': {'type': 'file', 'config': {'filename': './druid.json'}}}
[2021-03-15 10:41:20,408] DEBUG    {datahub.ingestion.run.pipeline:63} - Source type:druid,<class 'datahub.ingestion.source.druid.DruidSource'> configured
[2021-03-15 10:41:20,408] INFO     {datahub.ingestion.sink.file:27} - Will write to druid.json
[2021-03-15 10:41:20,409] DEBUG    {datahub.ingestion.run.pipeline:69} - Sink type:file,<class 'datahub.ingestion.sink.file.FileSink'> configured
[2021-03-15 10:41:20,409] DEBUG    {datahub.ingestion.source.sql_common:197} - sql_alchemy_url=druid://localhost:8082/druid/v2/sql/
[2021-03-15 10:41:20,603] DEBUG    {datahub.ingestion.run.pipeline:38} - sink called success callback
Traceback (most recent call last):
  File "/home/pedro/dev/datahub/metadata-ingestion/venv/lib/python3.8/site-packages/sqlalchemy/engine/result.py", line 1215, in _fetchone_impl
    return self.cursor.fetchone()
AttributeError: 'NoneType' object has no attribute 'fetchone'

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/home/pedro/dev/datahub/metadata-ingestion/venv/bin/datahub", line 33, in <module>
    sys.exit(load_entry_point('datahub', 'console_scripts', 'datahub')())
  File "/home/pedro/dev/datahub/metadata-ingestion/venv/lib/python3.8/site-packages/click/core.py", line 829, in __call__
    return self.main(*args, **kwargs)
  File "/home/pedro/dev/datahub/metadata-ingestion/venv/lib/python3.8/site-packages/click/core.py", line 782, in main
    rv = self.invoke(ctx)
  File "/home/pedro/dev/datahub/metadata-ingestion/venv/lib/python3.8/site-packages/click/core.py", line 1259, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/home/pedro/dev/datahub/metadata-ingestion/venv/lib/python3.8/site-packages/click/core.py", line 1066, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/home/pedro/dev/datahub/metadata-ingestion/venv/lib/python3.8/site-packages/click/core.py", line 610, in invoke
    return callback(*args, **kwargs)
  File "/home/pedro/dev/datahub/metadata-ingestion/src/datahub/entrypoints.py", line 73, in ingest
    pipeline.run()
  File "/home/pedro/dev/datahub/metadata-ingestion/src/datahub/ingestion/run/pipeline.py", line 81, in run
    for wu in self.source.get_workunits():
  File "/home/pedro/dev/datahub/metadata-ingestion/src/datahub/ingestion/source/sql_common.py", line 205, in get_workunits
    for table in inspector.get_table_names(schema):
  File "<string>", line 2, in get_table_names
  File "/home/pedro/dev/datahub/metadata-ingestion/venv/lib/python3.8/site-packages/sqlalchemy/util/deprecations.py", line 139, in warned
    return fn(*args, **kwargs)
  File "/home/pedro/dev/datahub/metadata-ingestion/venv/lib/python3.8/site-packages/sqlalchemy/engine/reflection.py", line 201, in get_table_names
    tnames = self.dialect.get_table_names(
  File "/home/pedro/dev/datahub/metadata-ingestion/venv/lib/python3.8/site-packages/pydruid/db/sqlalchemy.py", line 152, in get_table_names
    return [row.TABLE_NAME for row in result]
  File "/home/pedro/dev/datahub/metadata-ingestion/venv/lib/python3.8/site-packages/pydruid/db/sqlalchemy.py", line 152, in <listcomp>
    return [row.TABLE_NAME for row in result]
  File "/home/pedro/dev/datahub/metadata-ingestion/venv/lib/python3.8/site-packages/sqlalchemy/engine/result.py", line 1010, in __iter__
    row = self.fetchone()
  File "/home/pedro/dev/datahub/metadata-ingestion/venv/lib/python3.8/site-packages/sqlalchemy/engine/result.py", line 1343, in fetchone
    self.connection._handle_dbapi_exception(
  File "/home/pedro/dev/datahub/metadata-ingestion/venv/lib/python3.8/site-packages/sqlalchemy/engine/base.py", line 1514, in _handle_dbapi_exception
    util.raise_(exc_info[1], with_traceback=exc_info[2])
  File "/home/pedro/dev/datahub/metadata-ingestion/venv/lib/python3.8/site-packages/sqlalchemy/util/compat.py", line 182, in raise_
    raise exception
  File "/home/pedro/dev/datahub/metadata-ingestion/venv/lib/python3.8/site-packages/sqlalchemy/engine/result.py", line 1336, in fetchone
    row = self._fetchone_impl()
  File "/home/pedro/dev/datahub/metadata-ingestion/venv/lib/python3.8/site-packages/sqlalchemy/engine/result.py", line 1217, in _fetchone_impl
    return self._non_result(None, err)
  File "/home/pedro/dev/datahub/metadata-ingestion/venv/lib/python3.8/site-packages/sqlalchemy/engine/result.py", line 1236, in _non_result
    util.raise_(
  File "/home/pedro/dev/datahub/metadata-ingestion/venv/lib/python3.8/site-packages/sqlalchemy/util/compat.py", line 182, in raise_
    raise exception
sqlalchemy.exc.ResourceClosedError: This result object does not return rows. It has been closed automatically.

I believe this is due to the logic defined in SQLAlchemySource#get_workunits. My understading is that this particular logic assumes all tables in the metadata store in druid are meant to be crawled when they should not.

I'm not familiar enough with either the database metadata architect or the ingestion framework to understand why the tables are empty and if they should be crawled to begin with.

@hsheth2
Copy link
Collaborator

hsheth2 commented Mar 15, 2021

For (1) - to add a plugin:

As for how the plugin system works: in source_registry.py, if the source can be imported successfully then we register the plugin as enabled. The setup.py defines Python extra requirements, which will install the right set of dependencies and ensure that the source gets enabled.

The second issue is more opaque to me, since it varies by SQLAlchemy driver. At a high level, we're using SQLAlchemy's fine-grained inspection to extract metadata from each source.

Based on the stack trace, it seems to be failing on the call to get_table_names(), while corresponds to pydruid's get_table_names() implementation: https://github.com/druid-io/pydruid/blob/a1d9b7ec877800c7b322fb671f3fb46738a58438/pydruid/db/sqlalchemy.py#L144. Somehow that query is failing?

@pedro93
Copy link
Collaborator Author

pedro93 commented Mar 15, 2021

Thank you for the detailed explanation.
About point 2 I have opened an issue in pydruid with the hope that someone there can shed a light on what we are seeing.

@pedro93
Copy link
Collaborator Author

pedro93 commented Mar 16, 2021

FYI @hsheth2, while I don't figure out what is happening on pydruid's side I've added a schema deny pattern to not crawl what seems to be internal druid metadata structures since those are not metadata I think are meant to be shown.

It should suffice for this PRs scope, my only concern is whether user-specified table & schema patterns may overwrite this behaviour.

Do you see any way to handle such a case?

@pedro93 pedro93 changed the title [WIP] feat(metadata-ingestion): Add support for druid feat(metadata-ingestion): Add support for druid Mar 16, 2021
@@ -42,6 +42,8 @@ ignore_missing_imports = yes
ignore_missing_imports = yes
[mypy-snowflake.*]
ignore_missing_imports = yes
[mypy-druid.*]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you'd want [mypy-pydruid.*], since the actual python package is called pydruid

@pedro93
Copy link
Collaborator Author

pedro93 commented Mar 17, 2021

CI pipeline failed due to code coverage percentage under target (79.16% vs the intended 80%).

@hsheth2
Copy link
Collaborator

hsheth2 commented Mar 17, 2021

If you don’t mind, can you drop the coverage requirement in setup.cfg to 75%?

Eventually I’ll get around to writing more comprehensive tests, but I don’t want to block you on that

@pedro93
Copy link
Collaborator Author

pedro93 commented Mar 17, 2021

Done, appreciate the understanding!

Copy link
Collaborator

@hsheth2 hsheth2 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM - awesome work @pedro93!

host_port: localhost:8082
schema_pattern:
deny:
- "^(lookup|sys).*"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if we should "hardcode" this inside the Druid config class so this is always added on regardless of what the user configures.

Copy link
Contributor

@shirshanka shirshanka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@shirshanka shirshanka merged commit 6a0c402 into datahub-project:master Mar 18, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants