A small web crawler named aranea (Latin for spider). https://www.bananas-playground.net/projekt/aranea/

Banana a8bdf7dc15 do not use finish/(		hai 1 semana
documentation	9137d46a9b updated requirements	hai 1 semana
lib	7271145682 license change and new config value	hai 1 mes
storage	24fb355861 fetch.pl	%!s(int64=2) %!d(string=hai) anos
.gitignore	7271145682 license change and new config value	hai 1 mes
CHANGELOG	7271145682 license change and new config value	hai 1 mes
COPYING	7271145682 license change and new config value	hai 1 mes
LICENSE	7271145682 license change and new config value	hai 1 mes
README.md	a9e407d96a url fixed	hai 1 semana
TODO	a8bdf7dc15 do not use finish/(	hai 1 semana
VERSION	17aef3b5ab cleanup of the code and some paperwork	%!s(int64=2) %!d(string=hai) anos
cleanup.pl	a8bdf7dc15 do not use finish/(	hai 1 semana
config.txt	7271145682 license change and new config value	hai 1 mes
fetch.pl	a8bdf7dc15 do not use finish/(	hai 1 semana
parse-results.pl	a8bdf7dc15 do not use finish/(	hai 1 semana
setup.sql	cfdca6000e project cleanup and updated project website links	%!s(int64=2) %!d(string=hai) anos

aranea

A small web crawler named aranea (Latin for spider). The aim is to gather unique domains to show what is out there.

Fetch

It starts with a given set of URL(s) and parses them for more URLs. Stores them and fetches them too. -> fetch.pl

Each URL result (Stored result from the call) will be parsed for other URLs to follow. -> parse-results.pl

After a run cleanup will gather all the unique Domains into a table. Removes URLs from the fetch table which are already enough. -> cleanup.pl

The table url_to_ignore does have a small amount of domains and part of domains which will be ignored. Adding a global SPAM list would be overkill.

A good idea is to run it with a DNS filter, which has a good blocklist.