Extracting content from an URl is getting none #586

Fabiha15 · 2024-05-05T12:01:12Z

import requests
from main_content_extractor import MainContentExtractor

url = "https://testing.nbnhchurch.org/"
response = requests.get(url)
response.encoding = 'utf-8'
content = response.text
extracted_html = MainContentExtractor.extract(content)
extracted_markdown = MainContentExtractor.extract(content, output_format="markdown")
print("Extracted content:",extracted_markdown)

This is my code to extract content from web page through URL. But I am getting following error for some of the URLs.
WARNING:trafilatura.core:discarding data: None
WARNING:trafilatura.core:discarding data: None
Extracted content: None

adbar · 2024-05-06T15:45:10Z

I assume this is related to a relatively rare combination, a homepage with no main text and also no paragraphs (text in div elements). It could be an occasion to make the baseline extraction better but without precise text markers and boundaries finding the right text elements is difficult.

adbar added the question Further information is requested label May 6, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Extracting content from an URl is getting none #586

Extracting content from an URl is getting none #586

Fabiha15 commented May 5, 2024 •

edited

Loading

adbar commented May 6, 2024

Extracting content from an URl is getting none #586

Extracting content from an URl is getting none #586

Comments

Fabiha15 commented May 5, 2024 • edited Loading

adbar commented May 6, 2024

Fabiha15 commented May 5, 2024 •

edited

Loading