-
Notifications
You must be signed in to change notification settings - Fork 36
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
NewsSiteMapParserBolt: do not detect feeds as sitemaps #35
Comments
In addition to checking for a sitemap tag match which we are doing now, can we also check for the presence of blacklisted tags, such as |
Hi @silentninja, thanks for the suggestions: yes, it would be possible to implement contradictory patterns in the ContentDetector. In doubt, I would first go with a positive pattern which has a higher priority than the sitemap namespace(s), same as done for the news sitemap namespace and the |
@sebastian-nagel Thanks for your response. While I understand the reasoning behind reusing the existing logic, I believe it’s cleaner and more intuitive to ensure positive matches are sitemaps and do not include feeds. Mixing the two could lead to false assumptions that all positive matches are valid sitemaps, which could introduce confusion or errors down the line. Keeping feeds separate from sitemap matches ensures better clarity and avoids potential misclassification. For reference, the current approach in NewsSiteMapDetectorBolt highlights a potential pitfall where any positive match is assumed to be a sitemap An alternate solution if we don't want to edit the ContentDetector would be to use a separate ContentDetector for detecting feeds and returning early if a match is found |
If a news feed uses the sitemaps namespace it is erroneously detected as sitemap which causes that it's processed as sitemap (without being properly parsed) and not as feed. One example feed:
The text was updated successfully, but these errors were encountered: