What to do when we do not have canonical URL attribute? #7

gagarine · 2017-04-10T01:46:19Z

How I understand we are taking canonical URL from the page element (in html5 <link rel=canonical>) and use location.href has a fallback:

chrome.extension.sendRequest({'action': 'setCanonical', 'url': canonicalValue || location.href, 'title': title});

The things is URL often have tracking parameters and you end up with url like: http://www.example.com/?utm_source=adsite&utm_campaign=adcampaign&utm_term=adkeyword

Those tracking parameters are changing a lot, even for each users sometimes. So extra steps are needed to clean that to be able to match URL.

Are we taking care of that? Didn't see where in the JS (perhaps I just miss it). It can't be on the server side because we are making a hash of the URL on the client side.

I think we should just create a list of parameters we filter out. Do you see a smarter way?

The text was updated successfully, but these errors were encountered:

Aegist · 2017-04-10T04:29:41Z

You're right. We just check for canonical, then default to whatever we have. We didn't try to do anything more complicated than that because the variability available is pretty infinite. However, I agree, we should be able to pretty safely ignore common tracking parameters.

We perhaps could even ignore everything after any example of any known tracking parameter. The few instances where that is incorrect would probably be heavily outweighed by all of the accurately tracked links which would otherwise be ruined by infinitely variable possible URLs.

shanness · 2017-04-10T05:02:04Z

Yeah, this is a painful and error prone solution, but I don't think there is much that can be done about it, although I agree with the idea of stripping off known junk like google analytics stuff. I can re-hash the DB fairly easily (I have a python script to do that), so we could get that to match. Good call @gagarine !

gagarine · 2017-04-10T16:56:24Z

Ok, let's try to remove very well known tracking parameters to mitigate this problem.

I think in the future, we can compare page's content and analyses if they are similar to others. This can even be used to understand when a page is moved or duplicated on other websites. In other words a duplication detection and consolidation system but that keep all the url variant, to be able to match with the hash send by the plugin. But this part is a server side things and should be some kind of background task.

tomlutzenberger · 2017-04-10T18:01:16Z

https://en.wikipedia.org/wiki/UTM_parameters

tomlutzenberger added data js labels Apr 10, 2017

tomlutzenberger added this to the v1.0 milestone Apr 10, 2017

gagarine mentioned this issue Apr 13, 2017

Use DOI has URL if we have one #20

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

What to do when we do not have canonical URL attribute? #7

What to do when we do not have canonical URL attribute? #7

gagarine commented Apr 10, 2017

Aegist commented Apr 10, 2017

shanness commented Apr 10, 2017

gagarine commented Apr 10, 2017

tomlutzenberger commented Apr 10, 2017

What to do when we do not have canonical URL attribute? #7

What to do when we do not have canonical URL attribute? #7

Comments

gagarine commented Apr 10, 2017

Aegist commented Apr 10, 2017

shanness commented Apr 10, 2017

gagarine commented Apr 10, 2017

tomlutzenberger commented Apr 10, 2017