A small web crawler named aranea (Latin for spider). https://www.bananas-playground.net/projekt/aranea/

Banana 18c49ef61f callback does work differently 1 долоо хоног өмнө
documentation 32e5d2b1d9 transactions 1 долоо хоног өмнө
lib 7271145682 license change and new config value 1 сар өмнө
storage 24fb355861 fetch.pl 2 жил өмнө
.gitignore 546a78d2ab ignore config file 1 долоо хоног өмнө
CHANGELOG 36b6e40cf8 avoid big downloads 1 долоо хоног өмнө
COPYING 7271145682 license change and new config value 1 сар өмнө
LICENSE 7271145682 license change and new config value 1 сар өмнө
README.md 5800264126 better readme 1 долоо хоног өмнө
TODO 36b6e40cf8 avoid big downloads 1 долоо хоног өмнө
VERSION 17aef3b5ab cleanup of the code and some paperwork 2 жил өмнө
cleanup.pl 32e5d2b1d9 transactions 1 долоо хоног өмнө
config.default.txt 36b6e40cf8 avoid big downloads 1 долоо хоног өмнө
fetch.pl 18c49ef61f callback does work differently 1 долоо хоног өмнө
parse-results.pl 32e5d2b1d9 transactions 1 долоо хоног өмнө
setup.sql cfdca6000e project cleanup and updated project website links 2 жил өмнө

README.md

aranea

https://www.bananas-playground.net/projekt/aranea

A small web crawler named aranea (Latin for spider). The aim is to gather unique domains to show what is out there.

Fetch

It starts with a given set of URL(s) and parses them for more URLs. Stores them and fetches them too. perl fetch.pl

Parse

Each URL result (Stored result from the call) will be parsed for other URLs to follow. perl parse-results.pl

Cleanup

After a run cleanup will gather all the unique Domains into a table. Removes URLs from the fetch table which are already enough. perl cleanup.pl

Ignores

The table url_to_ignore does have a small amount of domains and part of domains which will be ignored. Adding a global SPAM list would be overkill.

A good idea is to run it with a DNS filter, which has a good blocklist.