Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add web/knowledgebase crawler #771

Merged
merged 27 commits into from
Feb 14, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
27 commits
Select commit Hold shift + click to select a range
5aa4c64
Initial extractor for webpages
rishimo Jan 10, 2024
68c4889
Updated tests and lxml dependency version
rishimo Jan 10, 2024
e39fc2b
Changed hashing, fixed subpage list
rishimo Jan 17, 2024
4d02695
merged main
rishimo Jan 17, 2024
ffd1d43
Added configurable recursion depth for page scraping
rishimo Jan 17, 2024
caaf9a8
Updated unit tests and extractor logic
rishimo Jan 17, 2024
790d1f7
some more test tweaks
rishimo Jan 17, 2024
0ee419c
Merged main
rishimo Jan 24, 2024
5e4dcc9
Updated unit tests and static web crawler for proper URL parsing and …
rishimo Jan 29, 2024
c37e05e
Merged main, bumped version
rishimo Jan 29, 2024
00cdbe0
change test to check url exactly
rishimo Jan 29, 2024
95a4ef2
dropped endpoint default from configuration
rishimo Jan 31, 2024
66c5877
Merged main
rishimo Jan 31, 2024
43b72f7
added _description and _platform
rishimo Jan 31, 2024
1834654
explicitly set llm to none to suppress openai api key warning
rishimo Feb 1, 2024
cca7f13
big rewrite of web crawler to reduce complication and make recursion …
rishimo Feb 5, 2024
1ff608b
updated unit tests and big refactoring
rishimo Feb 5, 2024
97c65f3
merged main
rishimo Feb 5, 2024
b29dd82
Added test for _process_subpages
rishimo Feb 7, 2024
3514c22
Merge branch 'main' into rishimohan/sc-23110/web-crawler
rishimo Feb 7, 2024
e97fc1f
Merge branch 'main' into rishimohan/sc-23110/web-crawler
rishimo Feb 12, 2024
08371c1
Forced llama-index version to 0.9.48
rishimo Feb 12, 2024
fc8f4f1
Merge branch 'main' into rishimohan/sc-23110/web-crawler
rishimo Feb 12, 2024
3c55ffc
Clean up some code inconsistency
rishimo Feb 14, 2024
f0860e5
Add another helper function for verifying a page and generating a Doc…
rishimo Feb 14, 2024
b3fc334
Added a test for no infinite recursion, sample HTML files, helper fun…
rishimo Feb 14, 2024
dc43b31
Added a test for shallow recursion
rishimo Feb 14, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion metaphor/notion/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -35,7 +35,7 @@ azure_openAI_version: <azure_openAI_version> # "2023-12-01-preview"
azure_openAI_model_name: <azure_openAI_model_name> # "Embedding_ada002"
azure_openAI_model: <azure_openAI_model> # "text-embedding-ada-002"

notion_api_version: <api_key_version> # "2022-06-08"
notion_api_version: <api_key_version> # "2022-06-28"
include_text: <include_text> # False
```

Expand Down
2 changes: 1 addition & 1 deletion metaphor/notion/config.py
Original file line number Diff line number Diff line change
Expand Up @@ -22,4 +22,4 @@ class NotionRunConfig(BaseConfig):
include_text: bool = False

# Notion API version
notion_api_version: str = "2022-06-08"
notion_api_version: str = "2022-06-28"
31 changes: 31 additions & 0 deletions metaphor/static_web/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,31 @@
# Static Webpage Connector

## Setup

## Config File

Create a YAML config file based on the following template.

`depth = 1` corresponds to scraping the specified page and its subpages only. Higher configured depths will recursively perform the same action on subpages `n` times.

### Required Configurations

```yaml
output:
file:
directory: <output_directory>
```

### Optional Configurations

## Testing

Follow the [Installation](../../README.md) instructions to install `metaphor-connectors` in your environment (or virtualenv). Make sure to include either `all` or `static_web` extra.

To test the connector locally, change the config file to output to a local path and run the following command

```shell
metaphor static_web <config_file>
```

Manually verify the output after the command finishes.
6 changes: 6 additions & 0 deletions metaphor/static_web/__init__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
from metaphor.common.cli import cli_main
from metaphor.static_web.extractor import StaticWebExtractor


def main(config_file: str):
cli_main(StaticWebExtractor, config_file)
25 changes: 25 additions & 0 deletions metaphor/static_web/config.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,25 @@
from pydantic.dataclasses import dataclass

from metaphor.common.base_config import BaseConfig
from metaphor.common.dataclass import ConnectorConfig


@dataclass(config=ConnectorConfig)
class StaticWebRunConfig(BaseConfig):
# Top-level URLs to scrape content from
links: list

# Configurable scraping depth
depths: list

# Azure OpenAI services configs
azure_openAI_key: str
azure_openAI_endpoint: str

# Default Azure OpenAI services configs
azure_openAI_version: str = "2023-12-01-preview"
azure_openAI_model: str = "text-embedding-ada-002"
azure_openAI_model_name: str = "Embedding_ada002"

# Store the document's content alongside embeddings
include_text: bool = False
rishimo marked this conversation as resolved.
Show resolved Hide resolved
241 changes: 241 additions & 0 deletions metaphor/static_web/extractor.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,241 @@
import datetime
from typing import Collection, List, Tuple
from urllib.parse import urljoin, urlparse

import requests
from bs4 import BeautifulSoup
from bs4.element import Comment
from llama_index import Document
from requests.exceptions import HTTPError, RequestException

from metaphor.common.base_extractor import BaseExtractor
from metaphor.common.embeddings import embed_documents, map_metadata, sanitize_text
from metaphor.common.logger import get_logger
from metaphor.common.utils import md5_digest
from metaphor.models.crawler_run_metadata import Platform
from metaphor.static_web.config import StaticWebRunConfig

logger = get_logger()

embedding_chunk_size = 512
embedding_overlap_size = 50


class StaticWebExtractor(BaseExtractor):
"""Static webpage extractor."""

_description = "Crawls webpages and and extracts documents & embeddings."
_platform = Platform.UNKNOWN

@staticmethod
def from_config_file(config_file: str) -> "StaticWebExtractor":
return StaticWebExtractor(StaticWebRunConfig.from_yaml_file(config_file))

Check warning on line 32 in metaphor/static_web/extractor.py

View check run for this annotation

Codecov / codecov/patch

metaphor/static_web/extractor.py#L32

Added line #L32 was not covered by tests

def __init__(self, config: StaticWebRunConfig):
super().__init__(config=config)

self.target_URLs = config.links
self.target_depths = config.depths

self.azure_openAI_key = config.azure_openAI_key
self.azure_openAI_version = config.azure_openAI_version
self.azure_openAI_endpoint = config.azure_openAI_endpoint
self.azure_openAI_model = config.azure_openAI_model
self.azure_openAI_model_name = config.azure_openAI_model_name

self.include_text = config.include_text

async def extract(self) -> Collection[dict]:
logger.info("Scraping provided URLs")
self.docs = list() # type: List[Document]
self.visited_pages = set() # type: set

for page, depth in zip(self.target_URLs, self.target_depths):
logger.info(f"Processing {page} with depth {depth}")
self.current_parent_page = page

# Fetch target content
success, content = self._check_page_make_document(page)

if success:
logger.info(f"Done with parent page {page}")
if depth: # recursive subpage processing
await self._process_subpages(page, content, depth)

# Embedding process
logger.info("Starting embedding process")
vector_store_index = embed_documents(
self.docs,
self.azure_openAI_key,
self.azure_openAI_version,
self.azure_openAI_endpoint,
self.azure_openAI_model,
self.azure_openAI_model_name,
embedding_chunk_size,
embedding_overlap_size,
)

embedded_nodes = map_metadata(
vector_store_index, include_text=self.include_text
)

return embedded_nodes

async def _process_subpages(
self,
parent_URL: str,
parent_content: str,
target_depth: int,
current_depth: int = 1,
) -> None:
logger.info(f"Processing subpages of {parent_URL}")
subpages = self._get_subpages_from_HTML(parent_content, parent_URL)

if current_depth > target_depth: # on recursion depth reached
return

for subpage in subpages:
if subpage in self.visited_pages:
continue

logger.info(f"Processing subpage {subpage} of parent {parent_URL}")
success, content = self._check_page_make_document(subpage)

if success:
logger.info(f"Done with subpage {subpage}")
await self._process_subpages(
subpage, content, target_depth, current_depth + 1
)

def _check_page_make_document(self, page: str) -> Tuple[bool, str]:
"""
Gets a page's HTML and adds to the visited pages set.
If page has valid content, extracts the text and title and generates
a Document object for the page.

Returns a bool and the page_content:
out[0]: False if the page content is invalid, True otherwise.
out[1]: "" if the page content is invalid, page_content otherwise
"""

page_content = self._get_page_HTML(page)
self.visited_pages.add(page)

if page_content == "ERROR IN PAGE RETRIEVAL":
return (False, "")

Check warning on line 125 in metaphor/static_web/extractor.py

View check run for this annotation

Codecov / codecov/patch

metaphor/static_web/extractor.py#L125

Added line #L125 was not covered by tests
else:
page_text = self._get_text_from_HTML(page_content)
page_title = self._get_title_from_HTML(page_content)

page_doc = self._make_document(page, page_title, page_text)
self.docs.append(page_doc)

return (True, page_content)

def _get_page_HTML(self, input_URL: str) -> str:
"""
Fetches a webpage's content, returning an error message on failure.
"""
try:
r = requests.get(input_URL, timeout=5)
r.raise_for_status()
return r.text
except (HTTPError, RequestException) as e:
logger.warning(f"Error in retrieving {input_URL}, error {e}")
return "ERROR IN PAGE RETRIEVAL"

def _get_subpages_from_HTML(self, html_content: str, input_URL: str) -> List[str]:
"""
Extracts and returns a list of subpage URLs from a given page's HTML and URL.
Subpage URLs are reconstructed to be absolute URLs and anchor links are trimmed.
"""
# Retrieve input page

soup = BeautifulSoup(html_content, "lxml")
links = soup.find_all("a", href=True)

# Parse the domain of the input URL
input_domain = urlparse(self.current_parent_page).netloc
subpages = [input_URL]

# Find eligible links
for link in links:
href = link["href"]
full_url = urljoin(input_URL, href)

# Check if the domain of the full URL matches the input domain
if urlparse(full_url).netloc == input_domain:
# Remove any query parameters or fragments
full_url = urljoin(full_url, urlparse(full_url).path)
if full_url not in subpages:
subpages.append(full_url)

return subpages

def _get_text_from_HTML(self, html_content: str) -> str:
"""
Extracts and returns visible text from given HTML content as a single string.
Designed to handle output from get_page_HTML.
"""

def filter_visible(el):
if el.parent.name in [
"style",
"script",
"head",
"title",
"meta",
"[document]",
]:
return False
elif isinstance(el, Comment):
return False
else:
return True

# Use bs4 to find visible text elements
soup = BeautifulSoup(html_content, "lxml")
visible_text = filter(filter_visible, soup.findAll(string=True))
return "\n".join(t.strip() for t in visible_text)

def _get_title_from_HTML(self, html_content: str) -> str:
"""
Extracts the title of a webpage given HTML content as a single string.
Designed to handle output from get_page_HTML.
"""

soup = BeautifulSoup(html_content, "lxml")
title_tag = soup.find("title")

if title_tag:
return title_tag.text

else:
return ""

def _make_document(
self, page_URL: str, page_title: str, page_text: str
) -> Document:
"""
Constructs Document objects from webpage URLs
and their content, including extra metadata.

Cleans text content and includes data like page title,
platform URL, page link, refresh timestamp, and page ID.
"""
netloc = urlparse(page_URL).netloc
current_time = str(datetime.datetime.utcnow())

doc = Document(
text=sanitize_text(page_text),
extra_info={
"title": page_title,
"platform": netloc,
"link": page_URL,
"lastRefreshed": current_time,
# Create a pageId based on page_URL - is this necessary?
"pageId": md5_digest(page_URL.encode()),
},
)

return doc
Loading
Loading