You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I am not sure of how to deal with it, using heritrix 3.4 to crawl. I tend to set <property name="robotsPolicyName" value="ignore"/>, but wonder if this is a) considered friendly and b) has negative sideffects. So the question is:
How does heritrix deal with the Disallow statement above? In my interpretation, it excludes just all URLs with a ? anywhere. But could heritrix treat this more "greedy", i.e. disallow everything?
Does heritrix consider the Sitemap statement?
The text was updated successfully, but these errors were encountered:
Heritrix does not currently support sitemaps (although there's a draft pull request adding it: #262) and does not support wildcards in Disallow lines (feature request #250). I haven't tested it but I would guess the rule Disallow: /*?* will be interpreted as matching paths that actually start with the literal string /*?. It will not match /index.html?foo.
Websites / departments in my organisation usually have a robots.txt with the following simple entry:
I am not sure of how to deal with it, using heritrix 3.4 to crawl. I tend to set
<property name="robotsPolicyName" value="ignore"/>
, but wonder if this is a) considered friendly and b) has negative sideffects. So the question is:Disallow
statement above? In my interpretation, it excludes just all URLs with a?
anywhere. But could heritrix treat this more "greedy", i.e. disallow everything?Sitemap
statement?The text was updated successfully, but these errors were encountered: