Skip to content

This issue was moved to a discussion.

You can continue the conversation there. Go to discussion →

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Question on robots.txt #371

Closed
oschihin opened this issue Mar 17, 2021 · 1 comment
Closed

Question on robots.txt #371

oschihin opened this issue Mar 17, 2021 · 1 comment
Labels

Comments

@oschihin
Copy link
Contributor

Websites / departments in my organisation usually have a robots.txt with the following simple entry:

User-agent: *
Disallow: /*?*
Sitemap: https://www.[domain].org/sitemaps/[domain].xml

I am not sure of how to deal with it, using heritrix 3.4 to crawl. I tend to set <property name="robotsPolicyName" value="ignore"/>, but wonder if this is a) considered friendly and b) has negative sideffects. So the question is:

  • How does heritrix deal with the Disallow statement above? In my interpretation, it excludes just all URLs with a ? anywhere. But could heritrix treat this more "greedy", i.e. disallow everything?
  • Does heritrix consider the Sitemap statement?
@ato ato added the question label May 12, 2021
@ato
Copy link
Collaborator

ato commented May 12, 2021

Heritrix does not currently support sitemaps (although there's a draft pull request adding it: #262) and does not support wildcards in Disallow lines (feature request #250). I haven't tested it but I would guess the rule Disallow: /*?* will be interpreted as matching paths that actually start with the literal string /*?. It will not match /index.html?foo.

@oschihin oschihin closed this as completed Dec 1, 2021
@internetarchive internetarchive locked and limited conversation to collaborators Sep 30, 2022
@ato ato converted this issue into discussion #529 Sep 30, 2022

This issue was moved to a discussion.

You can continue the conversation there. Go to discussion →

Labels
Projects
None yet
Development

No branches or pull requests

2 participants