Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OpenAPI crawler #1021

Merged
merged 22 commits into from
Oct 28, 2024
Merged
Show file tree
Hide file tree
Changes from 6 commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -72,6 +72,7 @@ Each connector is placed under its own directory under [metaphor](./metaphor) an
| [monte_carlo](metaphor/monte_carlo/) | Data monitor |
| [mssql](metaphor/mssql/) | Schema |
| [mysql](metaphor/mysql/) | Schema, description |
| [openapi](metaphor/openapi/) | API, description |
| [oracle](metaphor/oracle/) | Schema, description, queries |
| [notion](metaphor/notion/) | Document embeddings |
| [postgresql](metaphor/postgresql/) | Schema, description, statistics |
Expand Down
8 changes: 6 additions & 2 deletions metaphor/common/event_util.py
Original file line number Diff line number Diff line change
Expand Up @@ -8,6 +8,7 @@

from metaphor import models # type: ignore
from metaphor.models.metadata_change_event import (
API,
Dashboard,
Dataset,
ExternalSearchDocument,
Expand All @@ -26,6 +27,7 @@
logger.setLevel(logging.INFO)

ENTITY_TYPES = Union[
API,
Dashboard,
Dataset,
ExternalSearchDocument,
Expand Down Expand Up @@ -57,9 +59,11 @@ def _build_event(**kwargs) -> MetadataChangeEvent:
return MetadataChangeEvent(**kwargs)

@staticmethod
def build_event(entity: ENTITY_TYPES):
def build_event(entity: ENTITY_TYPES): # noqa: C901
"""Build MCE given an entity"""
if type(entity) is Dashboard:
if type(entity) is API:
return EventUtil._build_event(api=entity)
elif type(entity) is Dashboard:
return EventUtil._build_event(dashboard=entity)
elif type(entity) is Dataset:
return EventUtil._build_event(dataset=entity)
Expand Down
41 changes: 41 additions & 0 deletions metaphor/openapi/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,41 @@
# OpenAPI Connector

This connector extracts APIs from an OpenAPI Specification JSON.
elic-eon marked this conversation as resolved.
Show resolved Hide resolved

## Config File

Create a YAML config file based on the following template.

### Required Configurations

```yaml
base_url: <url> # BaseUrl for endpoints in OAS
openapi_json_path: <path or url> # URL or path of OAS
```

### Optional Configurations

If accessing the OAS JSON requires authentication, please include an optional auth configuration.

```yaml
auth:
basic_auth:
user: <user>
password: <password>
```

#### Output Destination

See [Output Config](../common/docs/output.md) for more information on the optional `output` config.

## Testing

Follow the [installation](../../README.md) instructions to install `metaphor-connectors` in your environment (or virtualenv). Make sure to include the `openapi` or `all` extra.

Run the following command to test the connector locally:

```shell
metaphor openapi <config_file>
```

Manually verify the output after the run finishes.
6 changes: 6 additions & 0 deletions metaphor/openapi/__init__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
from metaphor.common.cli import cli_main
from metaphor.openapi.extractor import OpenAPIExtractor


def main(config_file: str):
cli_main(OpenAPIExtractor, config_file)
25 changes: 25 additions & 0 deletions metaphor/openapi/config.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,25 @@
from typing import Optional

from pydantic.dataclasses import dataclass

from metaphor.common.base_config import BaseConfig
from metaphor.common.dataclass import ConnectorConfig


@dataclass(config=ConnectorConfig)
class BasicAuth:
user: str
password: str


@dataclass(config=ConnectorConfig)
class OpenAPIAuthConfig:
basic_auth: Optional[BasicAuth] = None
elic-eon marked this conversation as resolved.
Show resolved Hide resolved


@dataclass(config=ConnectorConfig)
class OpenAPIRunConfig(BaseConfig):
openapi_json_path: str # URL or file path
elic-eon marked this conversation as resolved.
Show resolved Hide resolved
base_url: str # base_url of endpoints
elic-eon marked this conversation as resolved.
Show resolved Hide resolved

auth: Optional[OpenAPIAuthConfig] = None
159 changes: 159 additions & 0 deletions metaphor/openapi/extractor.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,159 @@
import json
from collections import OrderedDict
from typing import Collection, List, Optional
from urllib.parse import urljoin

import requests

from metaphor.common.base_extractor import BaseExtractor
from metaphor.common.event_util import ENTITY_TYPES
from metaphor.common.logger import get_logger
from metaphor.common.utils import md5_digest
from metaphor.models.crawler_run_metadata import Platform
from metaphor.models.metadata_change_event import (
API,
APILogicalID,
APIPlatform,
AssetPlatform,
AssetStructure,
Hierarchy,
HierarchyInfo,
HierarchyLogicalID,
HierarchyType,
OpenAPI,
OpenAPIMethod,
OpenAPISpecification,
OperationType,
)
from metaphor.openapi.config import OpenAPIRunConfig

logger = get_logger()


class OpenAPIExtractor(BaseExtractor):
"""OpenAPI metadata extractor"""

_description = "OpenAPI metadata crawler"
_platform = Platform.OPEN_API

@staticmethod
def from_config_file(config_file: str) -> "OpenAPIExtractor":
return OpenAPIExtractor(OpenAPIRunConfig.from_yaml_file(config_file))

Check warning on line 41 in metaphor/openapi/extractor.py

View check run for this annotation

Codecov / codecov/patch

metaphor/openapi/extractor.py#L41

Added line #L41 was not covered by tests

def __init__(self, config: OpenAPIRunConfig):
super().__init__(config)

self._base_url = config.base_url
self._api_id = md5_digest(config.base_url.encode("utf-8"))
self._openapi_json_path = config.openapi_json_path
self._auth = config.auth
self._requests_session = requests.sessions.Session()

async def extract(self) -> Collection[ENTITY_TYPES]:
logger.info(f"Fetching metadata from {self._openapi_json_path}")

self._initial_session()
elic-eon marked this conversation as resolved.
Show resolved Hide resolved
openapi_json = self._get_openapi_json()

if not openapi_json:
logger.error("Unable to get OAS json")
return []

Check warning on line 60 in metaphor/openapi/extractor.py

View check run for this annotation

Codecov / codecov/patch

metaphor/openapi/extractor.py#L59-L60

Added lines #L59 - L60 were not covered by tests

endpoints = self._extract_paths(openapi_json)
elic-eon marked this conversation as resolved.
Show resolved Hide resolved
hierarchies = self._build_hierarchies(openapi_json)
elic-eon marked this conversation as resolved.
Show resolved Hide resolved

return hierarchies + endpoints

def _initial_session(self):
elic-eon marked this conversation as resolved.
Show resolved Hide resolved
if not self._auth:
return
elic-eon marked this conversation as resolved.
Show resolved Hide resolved

if self._auth.basic_auth:
basic_auth = self._auth.basic_auth
self._requests_session.auth = (basic_auth.user, basic_auth.password)

Check warning on line 73 in metaphor/openapi/extractor.py

View check run for this annotation

Codecov / codecov/patch

metaphor/openapi/extractor.py#L71-L73

Added lines #L71 - L73 were not covered by tests

def _get_openapi_json(self) -> Optional[dict]:
if not self._openapi_json_path.startswith("http"):
elic-eon marked this conversation as resolved.
Show resolved Hide resolved
with open(self._openapi_json_path, "r") as f:
return json.load(f)

headers = OrderedDict(
elic-eon marked this conversation as resolved.
Show resolved Hide resolved
{
"User-Agent": None,
"Accept": None,
"Connection": None,
"Accept-Encoding": None,
}
)
resp = self._requests_session.get(self._openapi_json_path, headers=headers)

if resp.status_code != 200:
return None

return resp.json()
usefulalgorithm marked this conversation as resolved.
Show resolved Hide resolved

def _extract_paths(self, openapi: dict) -> List[API]:
endpoints: List[API] = []
servers = openapi.get("servers")

for path, path_item in openapi["paths"].items():
path_servers = path_item.get("servers")
base_path = (
usefulalgorithm marked this conversation as resolved.
Show resolved Hide resolved
path_servers[0]["url"]
if path_servers
else servers[0]["url"] if servers else ""
)

if not base_path.startswith("http"):
endpoint_url = urljoin(self._base_url, base_path + path)
else:
endpoint_url = urljoin(base_path + "/", f"./{path}")

endpoint = API(
logical_id=APILogicalID(
name=endpoint_url, platform=APIPlatform.OPEN_API
),
open_api=OpenAPI(path=path, methods=self._extract_methods(path_item)),
structure=AssetStructure(directories=[self._api_id], name=path),
)
endpoints.append(endpoint)
return endpoints

def _extract_methods(self, path_item: dict) -> List[OpenAPIMethod]:
def to_operation_type(method: str) -> Optional[OperationType]:
try:
operation_type = OperationType(method.upper())
return operation_type
elic-eon marked this conversation as resolved.
Show resolved Hide resolved
except ValueError:
return None

methods: List[OpenAPIMethod] = []
for method, item in path_item.items():
operation_type = to_operation_type(method)

if not operation_type:
continue

methods.append(
OpenAPIMethod(
summary=item.get("summary") or None,
description=item.get("description") or None,
type=operation_type,
)
)
elic-eon marked this conversation as resolved.
Show resolved Hide resolved
return methods

def _build_hierarchies(self, openapi: dict) -> List[Hierarchy]:
title = openapi["info"]["title"]
hierarchy = Hierarchy(
logical_id=HierarchyLogicalID(
path=[AssetPlatform.OPEN_API.value] + [self._api_id],
),
hierarchy_info=HierarchyInfo(
name=title,
open_api=OpenAPISpecification(definition=json.dumps(openapi)),
type=HierarchyType.OPEN_API,
),
)

return [hierarchy]
24 changes: 12 additions & 12 deletions poetry.lock

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

4 changes: 2 additions & 2 deletions pyproject.toml
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
[tool.poetry]
name = "metaphor-connectors"
version = "0.14.135"
version = "0.14.136"
license = "Apache-2.0"
description = "A collection of Python-based 'connectors' that extract metadata from various sources to ingest into the Metaphor app."
authors = ["Metaphor <[email protected]>"]
Expand Down Expand Up @@ -42,7 +42,7 @@ llama-index-readers-confluence = { version = "^0.1.4", optional = true }
llama-index-readers-notion = { version = "^0.1.6", optional = true }
looker-sdk = { version = "^24.2.0", optional = true }
lxml = { version = "~=5.0.0", optional = true }
metaphor-models = "0.40.5"
metaphor-models = "0.41.0"
more-itertools = { version = "^10.1.0", optional = true }
msal = { version = "^1.28.0", optional = true }
msgraph-beta-sdk = { version = "~1.4.0", optional = true }
Expand Down
2 changes: 2 additions & 0 deletions tests/common/test_event_utils.py
Original file line number Diff line number Diff line change
@@ -1,5 +1,6 @@
from metaphor.common.event_util import EventUtil
from metaphor.models.metadata_change_event import (
API,
Dashboard,
Dataset,
ExternalSearchDocument,
Expand All @@ -19,6 +20,7 @@
def test_build_event():
event_utils = EventUtil()

assert event_utils.build_event(API()) == MetadataChangeEvent(api=API())
assert event_utils.build_event(Dashboard()) == MetadataChangeEvent(
dashboard=Dashboard()
)
Expand Down
Empty file added tests/openapi/__init__.py
Empty file.
Loading
Loading