Add web/knowledgebase crawler #771

rishimo · 2024-01-29T20:42:21Z

🤔 Why?

Adds a connector that can crawl webpages with an adjustable recursion depth. Retrieves all unique subpages found and creates External Search Documents accordingly.

🤓 What?

Adds an extractor and methods to:

Retrieve a given page's subpages with some level of recursion
Extracts the visible text on the page
Extracts the page title
Creates Document objects, generates embeddings, and maps the metadata back to the text nodes.

🧪 Tested?

Unit tests, tested manually on several different pages and generates valid documents.

This only works on pages where the subpage URLs are immediately visible; if another element has to be clicked (script, etc.) to reveal elements those will not be found.

☑️ Checks

My PR contains actual code changes, and I have updated the version number in pyproject.toml.

…only visible text retrieval

shortcut-integration · 2024-01-29T20:42:24Z

This pull request has been linked to Shortcut Story #23110: Create initial extractor for static / public pages.

tests/static_web/test_extractor.py

github-actions · 2024-01-31T22:22:16Z

☂️ Python Coverage

current status: ✅

Overall Coverage

Lines	Covered	Coverage	Threshold	Status
15966	14762	92%	85%	🟢

New Files

File	Coverage	Status
metaphor/static_web/config.py	100%	🟢
metaphor/static_web/extractor.py	98%	🟢
TOTAL	99%	🟢

Modified Files

File	Coverage	Status
metaphor/notion/config.py	100%	🟢
TOTAL	100%	🟢

updated for commit: dc43b31 by action🐍

codecov · 2024-01-31T22:22:37Z

Codecov Report

Attention: 2 lines in your changes are missing coverage. Please review.

Comparison is base (a5b9785) 92.21% compared to head (dc43b31) 92.45%.

Files	Patch %	Lines
metaphor/static_web/extractor.py	98.24%	2 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main     #771      +/-   ##
==========================================
+ Coverage   92.21%   92.45%   +0.24%     
==========================================
  Files         194      156      -38     
  Lines       15999    15966      -33     
==========================================
+ Hits        14754    14762       +8     
+ Misses       1245     1204      -41

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

…better

metaphor/static_web/config.py

tests/static_web/test_extractor.py

metaphor/static_web/extractor.py

tests/static_web/test_extractor.py

metaphor/static_web/extractor.py

…ument

…ction to load text files, cleaned up extractor code.

prith189

LGTM

rishimo added 10 commits January 10, 2024 12:19

Initial extractor for webpages

5aa4c64

Updated tests and lxml dependency version

68c4889

Changed hashing, fixed subpage list

e39fc2b

merged main

4d02695

Added configurable recursion depth for page scraping

ffd1d43

Updated unit tests and extractor logic

caaf9a8

some more test tweaks

790d1f7

Merged main

0ee419c

Updated unit tests and static web crawler for proper URL parsing and …

5e4dcc9

…only visible text retrieval

Merged main, bumped version

c37e05e

github-advanced-security bot found potential problems Jan 29, 2024

View reviewed changes

tests/static_web/test_extractor.py Fixed Show fixed Hide fixed

rishimo added 4 commits January 29, 2024 12:50

change test to check url exactly

00cdbe0

dropped endpoint default from configuration

95a4ef2

Merged main

66c5877

added _description and _platform

43b72f7

rishimo and others added 5 commits January 31, 2024 16:17

explicitly set llm to none to suppress openai api key warning

1834654

big rewrite of web crawler to reduce complication and make recursion …

cca7f13

…better

updated unit tests and big refactoring

1ff608b

merged main

97c65f3

Added test for _process_subpages

b29dd82

rishimo marked this pull request as ready for review February 7, 2024 19:04

rishimo requested review from alyiwang, prith189 and mars-lan February 7, 2024 19:05

rishimo mentioned this pull request Feb 7, 2024

Raised timeout to 15 seconds #778

Merged

1 task

rishimo added 2 commits February 7, 2024 13:42

Merge branch 'main' into rishimohan/sc-23110/web-crawler

3514c22

Merge branch 'main' into rishimohan/sc-23110/web-crawler

e97fc1f

rishimo added 2 commits February 12, 2024 15:11

Forced llama-index version to 0.9.48

08371c1

Merge branch 'main' into rishimohan/sc-23110/web-crawler

fc8f4f1