-
Notifications
You must be signed in to change notification settings - Fork 67
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ignore www prefix #614
Comments
I'm not alone :) #596 |
One solution could be to define two crawlers, one with the URL normalization to always use If you do not know up front all the domains that will be crawled, it could get tricky for sure. We could make this a feature request, but I am not sure what solution could be generic enough. Especially knowing that Maybe we could have a smart URL Normalizer where you can indicate your preference (
Could an (optional) feature like that do it you think? |
maybe we could simplify the logic like following:
I assume, it would be not that easy to implement the options What do you think? Thanks a lot! |
Plenty of good ideas. I just marked this as a feature request. |
we need to crawl many Internet sites and encountered an issue with
www
prefix:some sites redirect to their domains without
www
, some other way round.Unfortunately, such case cannot be handle by NC in general way (globally): we can normalize URLs bei removing
www
prefix, and, if a site would redicrect towww.some.site
again, the collector would follow, as it is configured to follow sub-domains. But, there will be cases, when a site is available withwww
prefix only (e.g. https://www.pony.at/ does not work withoutwww
), so we will miss such sites again.So, I'm looking for a general solution for that problem.
Any ideas - very welcome! Thank you!
Common requirements for a crawler:
The text was updated successfully, but these errors were encountered: