aranea

https://www.bananas-playground.net/projekt/aranea

A small web crawler named aranea (Latin for spider). The aim is to gather unique domains to show what is out there.

Fetch

It starts with a given set of URL(s) and parses them for more URLs. Stores them and fetches them too. Execute: perl fetch.pl

Parse

Each URL result (Stored result from the call) will be parsed for other URLs to follow. perl parse-results.pl

Cleanup

After a run cleanup will gather all the unique Domains into a table. Removes URLs from the fetch table which are already enough. perl cleanup.pl

Usage

Either run fetch.pl, parse-results.pl and cleanup.pl in the given order manually or use aranea-runner with a cron. The cron schedule depends on the amount of URLs to be fetched and parsed. Higher numbers needs longer run times. So plan the schedule around that by running the perl files manually first.

Ignores

The table url_to_ignore does have a small amount of domains and part of domains which will be ignored. Adding a global SPAM list would be overkill.

A good idea is to run it with a DNS filter, which has a good blocklist.

Webinterface

The folder webroot does contain a webinterface which displays the gathered data and status. It does not provide a way to execute the crawler.

Contribute

Want to contribute or found a problem?

See Contributing document: CONTRIBUTING.md

Uses

See uses document: USES

Name		Name	Last commit message	Last commit date
Latest commit History 59 Commits
crawler		crawler
webroot		webroot
.gitignore		.gitignore
CHANGELOG		CHANGELOG
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
COPYING		COPYING
LICENSE		LICENSE
README.md		README.md
TODO		TODO
USES		USES
VERSION		VERSION

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Licenses found

Repository files navigation

aranea

Fetch

Parse

Cleanup

Usage

Ignores

Webinterface

Contribute

Uses

About

Licenses found

Releases

Packages

Languages

License

Licenses found

bananas-repos/aranea

Folders and files

Latest commit

History

Repository files navigation

aranea

Fetch

Parse

Cleanup

Usage

Ignores

Webinterface

Contribute

Uses

About

Resources

License

Licenses found

Code of conduct

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages