You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
This is my code to extract content from web page through URL. But I am getting following error for some of the URLs.
WARNING:trafilatura.core:discarding data: None
WARNING:trafilatura.core:discarding data: None
Extracted content: None
The text was updated successfully, but these errors were encountered:
I assume this is related to a relatively rare combination, a homepage with no main text and also no paragraphs (text in div elements). It could be an occasion to make the baseline extraction better but without precise text markers and boundaries finding the right text elements is difficult.
import requests
from main_content_extractor import MainContentExtractor
url = "https://testing.nbnhchurch.org/"
response = requests.get(url)
response.encoding = 'utf-8'
content = response.text
extracted_html = MainContentExtractor.extract(content)
extracted_markdown = MainContentExtractor.extract(content, output_format="markdown")
print("Extracted content:",extracted_markdown)
This is my code to extract content from web page through URL. But I am getting following error for some of the URLs.
WARNING:trafilatura.core:discarding data: None
WARNING:trafilatura.core:discarding data: None
Extracted content: None
The text was updated successfully, but these errors were encountered: