Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

What to do when we do not have canonical URL attribute? #7

Open
gagarine opened this issue Apr 10, 2017 · 4 comments
Open

What to do when we do not have canonical URL attribute? #7

gagarine opened this issue Apr 10, 2017 · 4 comments
Milestone

Comments

@gagarine
Copy link
Contributor

How I understand we are taking canonical URL from the page element (in html5 <link rel=canonical>) and use location.href has a fallback:

chrome.extension.sendRequest({'action': 'setCanonical', 'url': canonicalValue || location.href, 'title': title});

The things is URL often have tracking parameters and you end up with url like: http://www.example.com/?utm_source=adsite&utm_campaign=adcampaign&utm_term=adkeyword

Those tracking parameters are changing a lot, even for each users sometimes. So extra steps are needed to clean that to be able to match URL.

Are we taking care of that? Didn't see where in the JS (perhaps I just miss it). It can't be on the server side because we are making a hash of the URL on the client side.

I think we should just create a list of parameters we filter out. Do you see a smarter way?

@Aegist
Copy link
Collaborator

Aegist commented Apr 10, 2017

You're right. We just check for canonical, then default to whatever we have. We didn't try to do anything more complicated than that because the variability available is pretty infinite. However, I agree, we should be able to pretty safely ignore common tracking parameters.

We perhaps could even ignore everything after any example of any known tracking parameter. The few instances where that is incorrect would probably be heavily outweighed by all of the accurately tracked links which would otherwise be ruined by infinitely variable possible URLs.

@shanness
Copy link
Collaborator

Yeah, this is a painful and error prone solution, but I don't think there is much that can be done about it, although I agree with the idea of stripping off known junk like google analytics stuff. I can re-hash the DB fairly easily (I have a python script to do that), so we could get that to match. Good call @gagarine !

@gagarine
Copy link
Contributor Author

Ok, let's try to remove very well known tracking parameters to mitigate this problem.

I think in the future, we can compare page's content and analyses if they are similar to others. This can even be used to understand when a page is moved or duplicated on other websites. In other words a duplication detection and consolidation system but that keep all the url variant, to be able to match with the hash send by the plugin. But this part is a server side things and should be some kind of background task.

@tomlutzenberger
Copy link
Member

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants