-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add web/knowledgebase crawler #771
Conversation
…only visible text retrieval
This pull request has been linked to Shortcut Story #23110: Create initial extractor for static / public pages. |
☂️ Python Coverage
Overall Coverage
New Files
Modified Files
|
Codecov ReportAttention:
Additional details and impacted files@@ Coverage Diff @@
## main #771 +/- ##
==========================================
+ Coverage 92.21% 92.45% +0.24%
==========================================
Files 194 156 -38
Lines 15999 15966 -33
==========================================
+ Hits 14754 14762 +8
+ Misses 1245 1204 -41 ☔ View full report in Codecov by Sentry. |
…ction to load text files, cleaned up extractor code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
🤔 Why?
SC23110
Adds a connector that can crawl webpages with an adjustable recursion depth. Retrieves all unique subpages found and creates External Search Documents accordingly.
🤓 What?
Adds an extractor and methods to:
🧪 Tested?
Unit tests, tested manually on several different pages and generates valid documents.
This only works on pages where the subpage URLs are immediately visible; if another element has to be clicked (script, etc.) to reveal elements those will not be found.
☑️ Checks
pyproject.toml
.