A small web crawler named aranea (Latin for spider). https://www.bananas-playground.net/projekt/aranea/

Banana 36b6e40cf8 avoid big downloads 1 週間 前
documentation 32e5d2b1d9 transactions 1 週間 前
lib 7271145682 license change and new config value 1 ヶ月 前
storage 24fb355861 fetch.pl 2 年 前
.gitignore 546a78d2ab ignore config file 1 週間 前
CHANGELOG 36b6e40cf8 avoid big downloads 1 週間 前
COPYING 7271145682 license change and new config value 1 ヶ月 前
LICENSE 7271145682 license change and new config value 1 ヶ月 前
README.md a9e407d96a url fixed 1 週間 前
TODO 36b6e40cf8 avoid big downloads 1 週間 前
VERSION 17aef3b5ab cleanup of the code and some paperwork 2 年 前
cleanup.pl 32e5d2b1d9 transactions 1 週間 前
config.default.txt 36b6e40cf8 avoid big downloads 1 週間 前
fetch.pl 36b6e40cf8 avoid big downloads 1 週間 前
parse-results.pl 32e5d2b1d9 transactions 1 週間 前
setup.sql cfdca6000e project cleanup and updated project website links 2 年 前

README.md

aranea

https://www.bananas-playground.net/projekt/aranea

A small web crawler named aranea (Latin for spider). The aim is to gather unique domains to show what is out there.

Fetch

It starts with a given set of URL(s) and parses them for more URLs. Stores them and fetches them too. -> fetch.pl

Parse

Each URL result (Stored result from the call) will be parsed for other URLs to follow. -> parse-results.pl

Cleanup

After a run cleanup will gather all the unique Domains into a table. Removes URLs from the fetch table which are already enough. -> cleanup.pl

Ignores

The table url_to_ignore does have a small amount of domains and part of domains which will be ignored. Adding a global SPAM list would be overkill.

A good idea is to run it with a DNS filter, which has a good blocklist.