Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support for sidemap parsing from text instead of urls #751

Open
NiClassic opened this issue Nov 27, 2024 · 1 comment
Open

Support for sidemap parsing from text instead of urls #751

NiClassic opened this issue Nov 27, 2024 · 1 comment
Labels
feedback Feedback from users requested

Comments

@NiClassic
Copy link

While working with your library, I noticed that content can be extracted from a site by passing the response text to the extract() function. However, I found that the sitemap_search() function only supports directly passing a URL to the sitemap, which can be problematic in environments with limited internet access (e.g., behind a proxy). I expected a similar behavior to the following snippet:

import httpx
from trafilatura import sitemaps

with httpx.Client(mounts=proxy_mounts) as client:
    url = "http://example.com/sitemap.xml"
    res = client.get(url)

sitemaps = sitemaps.sitemap_search(res)

Please let me know if this is something that could be implemented in the future.

@adbar adbar added the feedback Feedback from users requested label Nov 28, 2024
@adbar
Copy link
Owner

adbar commented Nov 28, 2024

Hi @NiClassic, the sitemap_search function attemps to guess a sitemap's address if you pass a homepage instead of a direct sitemap URL.

If you pass https://example.org/ as input and if there is a sitemap at https://example.org/sitemap.xml it should be found automatically.

Does that solve your problem or is there an issue with the guessing mechanism?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feedback Feedback from users requested
Projects
None yet
Development

No branches or pull requests

2 participants