PHP class to filter URL duplicates, with integrated support for Yandex Clean-Param specifications.
What does it do?
It filters your URL lists so that any duplicate pages are removed.
What to expect if I'll filter my URLs?
- You'll never have to reload duplicate information again.
- More efficient web crawling.
- Server load will decrease.
What is Clean-Param?
It's a robots.txt directive witch describes dynamic parameters that do not affect the page content (e.g. identifiers of sessions, users, referrers etc.). When added, it has an significant impact on the number of URLs considered as duplicates. Learn more.
The library is available for install via Composer package. To install via Composer, please add the requirement to your composer.json
file, like this:
{
"require": {
"VIPnytt/CleanParam-URL-Filter": "dev-master"
}
}
and then use composer to load the lib:
<?php
require_once('vendor/autoload.php');
$filter = new \VIPnytt\CleanParamFilter($urls);
You can find out more about Composer here: https://getcomposer.org/
$filter = new \vipnytt\CleanParamFilter($urlArray);
// Optional: Add Clean-Param
$filter->addCleanParam($parameter, $path);
// List duplicates
print_r($filter->listDuplicate());
// List non-duplicates
print_r($filter->listApproved());
Pro tip: If you're going to filter tens of thousands of URLs, (or even more), it is recommended to break down the list to a bare minimum. This can be done by grouping the URLs by domain (or even host), and then filter each group individually. This is for the sake of performance!
Reason: You're probably trying to filter thousands of URLs.
- It is recommended to break down the list of URLs to a bare minimum. This can be done by grouping the URLs by domain (or even host), and then filter each group individually.
- Increase PHPs max execution time limit by using
set_time_limit(60);
. When called, it sets the time limit to 60 seconds, and restarts the timeout counter from zero. - If you're already looping thou groups of URLs (like suggested), put
set_time_limit(60);
into the loop, so that each time a new set of URLs is parsed, the timeout counter is restarted.
Reason: You're probably trying to filter tens of thousands of URLs, maybe even more.
- At this point, you're required to break down the list of URLs to a bare minimum. This can be done by grouping the URLs by domain (or even better, host), and then filter each group individually.
- Increase PHPs memory limit. This could be done by setting
ini_set('memory_limit', '256M');
or by changing thememory_limit
variable in yourphp.ini
file.