NewsSiteMapParserBolt: do not detect feeds as sitemaps #35

sebastian-nagel · 2020-01-09T16:13:50Z

If a news feed uses the sitemaps namespace it is erroneously detected as sitemap which causes that it's processed as sitemap (without being properly parsed) and not as feed. One example feed:

<?xml version="1.0" encoding="ISO-8859-1"?>
<?xml-stylesheet type="text/xsl" media="screen" href="/~d/styles/rss2full.xsl"?><?xml-stylesheet type="text/css" media="screen" href="http://feeds.drudge.com/~d/styles/itemcontent.css"?><rss xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:sitemap="http://www.sitemaps.org/schemas/sitemap/0.9" xmlns:wordzilla="http://www.cadenhead.org/workbench/wordzilla/namespace" xmlns:feedburner="http://rssnamespace.org/feedburner/ext/1.0" version="2.0">

silentninja · 2025-01-07T15:36:03Z

In addition to checking for a sitemap tag match which we are doing now, can we also check for the presence of blacklisted tags, such as <rss>, and short-circuit the org.commoncrawl.stormcrawler.news.ContentDetector#getFirstMatch function to return as no matches found?
This would prevent identifying feeds as sitemaps

sebastian-nagel · 2025-01-07T18:07:30Z

Hi @silentninja, thanks for the suggestions: yes, it would be possible to implement contradictory patterns in the ContentDetector. In doubt, I would first go with a positive pattern which has a higher priority than the sitemap namespace(s), same as done for the news sitemap namespace and the <sitemapindex pattern. This solution wouldn't require changes in the ContentDetector.

silentninja · 2025-01-09T15:17:54Z

@sebastian-nagel Thanks for your response. While I understand the reasoning behind reusing the existing logic, I believe it’s cleaner and more intuitive to ensure positive matches are sitemaps and do not include feeds.

Mixing the two could lead to false assumptions that all positive matches are valid sitemaps, which could introduce confusion or errors down the line. Keeping feeds separate from sitemap matches ensures better clarity and avoids potential misclassification. For reference, the current approach in NewsSiteMapDetectorBolt highlights a potential pitfall where any positive match is assumed to be a sitemap

An alternate solution if we don't want to edit the ContentDetector would be to use a separate ContentDetector for detecting feeds and returning early if a match is found

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

NewsSiteMapParserBolt: do not detect feeds as sitemaps #35

NewsSiteMapParserBolt: do not detect feeds as sitemaps #35

sebastian-nagel commented Jan 9, 2020

silentninja commented Jan 7, 2025 •

edited

Loading

sebastian-nagel commented Jan 7, 2025

silentninja commented Jan 9, 2025 •

edited

Loading

NewsSiteMapParserBolt: do not detect feeds as sitemaps #35

NewsSiteMapParserBolt: do not detect feeds as sitemaps #35

Comments

sebastian-nagel commented Jan 9, 2020

silentninja commented Jan 7, 2025 • edited Loading

sebastian-nagel commented Jan 7, 2025

silentninja commented Jan 9, 2025 • edited Loading

silentninja commented Jan 7, 2025 •

edited

Loading

silentninja commented Jan 9, 2025 •

edited

Loading