A small web crawler named aranea (Latin for spider). https://www.bananas-playground.net/projekt/aranea/

1 Větve

Banana 5800264126 better readme		před 1 týdnem
documentation	32e5d2b1d9 transactions	před 1 týdnem
lib	7271145682 license change and new config value	před 1 měsícem
storage	24fb355861 fetch.pl	před 2 roky
.gitignore	546a78d2ab ignore config file	před 1 týdnem
CHANGELOG	36b6e40cf8 avoid big downloads	před 1 týdnem
COPYING	7271145682 license change and new config value	před 1 měsícem
LICENSE	7271145682 license change and new config value	před 1 měsícem
README.md	5800264126 better readme	před 1 týdnem
TODO	36b6e40cf8 avoid big downloads	před 1 týdnem
VERSION	17aef3b5ab cleanup of the code and some paperwork	před 2 roky
cleanup.pl	32e5d2b1d9 transactions	před 1 týdnem
config.default.txt	36b6e40cf8 avoid big downloads	před 1 týdnem
fetch.pl	36b6e40cf8 avoid big downloads	před 1 týdnem
parse-results.pl	32e5d2b1d9 transactions	před 1 týdnem
setup.sql	cfdca6000e project cleanup and updated project website links	před 2 roky

aranea

A small web crawler named aranea (Latin for spider). The aim is to gather unique domains to show what is out there.

Fetch

It starts with a given set of URL(s) and parses them for more URLs. Stores them and fetches them too. perl fetch.pl

Each URL result (Stored result from the call) will be parsed for other URLs to follow. perl parse-results.pl

After a run cleanup will gather all the unique Domains into a table. Removes URLs from the fetch table which are already enough. perl cleanup.pl

The table url_to_ignore does have a small amount of domains and part of domains which will be ignored. Adding a global SPAM list would be overkill.

A good idea is to run it with a DNS filter, which has a good blocklist.