Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Handling redirected URLs #25

Open
fritzmg opened this issue Feb 28, 2022 · 2 comments
Open

Handling redirected URLs #25

fritzmg opened this issue Feb 28, 2022 · 2 comments

Comments

@fritzmg
Copy link
Contributor

fritzmg commented Feb 28, 2022

As mentioned in contao/contao#4213 subscribers might want to handle or react to redirected URLs. Or rather: the CrawlUri passed to the crawler subscribers will currently not be the actual URL that has been crawled, if a redirect was involved.

When an URL responds with a redirect, the http client will automatically follow that redirect (up until the max_redirects setting). This means however that when subscribers are notified, the passed CrawlUri instance will not be the actual URL that has been crawled in the end. In contao/contao#4218 this is rectified by analysing the response info that the ResponseInterface of the Symfony Http Client provides. But may be there is a better way of doing this.

The Symfony Http Client does not offer much utility when it comes to redirects. If you want to handle redirects more granularly, you can set max_redirects to 0 and then handle the RedirectException - and then decide whether another request should be made or not. And at the same time you could also be directly update the CrawlUri instance with the new URL (or even track each individual URL in a stack) before it is passed to subscribers.

@Toflar
Copy link
Member

Toflar commented Mar 4, 2022

Implemented in 54f6e82 :)
You can now access it via $crawlUri->getRedirectedTo() which might be null of course.
Note that for this to work, I had to add another column in the DoctrineQueue so we cannot just update like that because afaik in Contao we define the table ourselves instead of using a schema listener that forwards to Escargot (which is would be now possible as of b551530 too). Care to work on this in the Core? 😊

@Toflar
Copy link
Member

Toflar commented Dec 15, 2022

I've released the changes to the DoctrineQueue in 1.5.0 (https://github.com/terminal42/escargot/releases/tag/1.5.0).
So technically, you can now now require ^1.5 and use a schema listener that adds the $queue->getTableSchema() to it.
Until then I probably cannot release a new version with this new feature as otherwise tl_crawl_queue would fail because it's lacking the new column required.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants