Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NewsSiteMapParserBolt: do not detect feeds as sitemaps #35

Open
sebastian-nagel opened this issue Jan 9, 2020 · 3 comments
Open

NewsSiteMapParserBolt: do not detect feeds as sitemaps #35

sebastian-nagel opened this issue Jan 9, 2020 · 3 comments

Comments

@sebastian-nagel
Copy link
Collaborator

If a news feed uses the sitemaps namespace it is erroneously detected as sitemap which causes that it's processed as sitemap (without being properly parsed) and not as feed. One example feed:

<?xml version="1.0" encoding="ISO-8859-1"?>
<?xml-stylesheet type="text/xsl" media="screen" href="/~d/styles/rss2full.xsl"?><?xml-stylesheet type="text/css" media="screen" href="http://feeds.drudge.com/~d/styles/itemcontent.css"?><rss xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:sitemap="http://www.sitemaps.org/schemas/sitemap/0.9" xmlns:wordzilla="http://www.cadenhead.org/workbench/wordzilla/namespace" xmlns:feedburner="http://rssnamespace.org/feedburner/ext/1.0" version="2.0">
@silentninja
Copy link

silentninja commented Jan 7, 2025

In addition to checking for a sitemap tag match which we are doing now, can we also check for the presence of blacklisted tags, such as <rss>, and short-circuit the org.commoncrawl.stormcrawler.news.ContentDetector#getFirstMatch function to return as no matches found?
This would prevent identifying feeds as sitemaps

@sebastian-nagel
Copy link
Collaborator Author

Hi @silentninja, thanks for the suggestions: yes, it would be possible to implement contradictory patterns in the ContentDetector. In doubt, I would first go with a positive pattern which has a higher priority than the sitemap namespace(s), same as done for the news sitemap namespace and the <sitemapindex pattern. This solution wouldn't require changes in the ContentDetector.

@silentninja
Copy link

silentninja commented Jan 9, 2025

@sebastian-nagel Thanks for your response. While I understand the reasoning behind reusing the existing logic, I believe it’s cleaner and more intuitive to ensure positive matches are sitemaps and do not include feeds.

Mixing the two could lead to false assumptions that all positive matches are valid sitemaps, which could introduce confusion or errors down the line. Keeping feeds separate from sitemap matches ensures better clarity and avoids potential misclassification. For reference, the current approach in NewsSiteMapDetectorBolt highlights a potential pitfall where any positive match is assumed to be a sitemap

An alternate solution if we don't want to edit the ContentDetector would be to use a separate ContentDetector for detecting feeds and returning early if a match is found

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants