Question on robots.txt #371

oschihin · 2021-03-17T10:43:35Z

Websites / departments in my organisation usually have a robots.txt with the following simple entry:

User-agent: *
Disallow: /*?*
Sitemap: https://www.[domain].org/sitemaps/[domain].xml

I am not sure of how to deal with it, using heritrix 3.4 to crawl. I tend to set <property name="robotsPolicyName" value="ignore"/>, but wonder if this is a) considered friendly and b) has negative sideffects. So the question is:

How does heritrix deal with the Disallow statement above? In my interpretation, it excludes just all URLs with a ? anywhere. But could heritrix treat this more "greedy", i.e. disallow everything?
Does heritrix consider the Sitemap statement?

The text was updated successfully, but these errors were encountered:

ato · 2021-05-12T14:28:29Z

Heritrix does not currently support sitemaps (although there's a draft pull request adding it: #262) and does not support wildcards in Disallow lines (feature request #250). I haven't tested it but I would guess the rule Disallow: /*?* will be interpreted as matching paths that actually start with the literal string /*?. It will not match /index.html?foo.

ato added the question label May 12, 2021

oschihin closed this as completed Dec 1, 2021

internetarchive locked and limited conversation to collaborators Sep 30, 2022

ato converted this issue into discussion #529 Sep 30, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

This issue was moved to a discussion.

Question on robots.txt #371

Question on robots.txt #371

oschihin commented Mar 17, 2021

ato commented May 12, 2021

This issue was moved to a discussion.

This issue was moved to a discussion.

Question on robots.txt #371

Question on robots.txt #371

Comments

oschihin commented Mar 17, 2021

ato commented May 12, 2021

This issue was moved to a discussion.