A small web crawler named aranea (Latin for spider). https://www.bananas-playground.net/projekt/aranea/
Banana 18c49ef61f callback does work differently | 1 week ago | |
---|---|---|
documentation | 1 week ago | |
lib | 1 month ago | |
storage | 2 years ago | |
.gitignore | 1 week ago | |
CHANGELOG | 1 week ago | |
COPYING | 1 month ago | |
LICENSE | 1 month ago | |
README.md | 1 week ago | |
TODO | 1 week ago | |
VERSION | 2 years ago | |
cleanup.pl | 1 week ago | |
config.default.txt | 1 week ago | |
fetch.pl | 1 week ago | |
parse-results.pl | 1 week ago | |
setup.sql | 2 years ago |
https://www.bananas-playground.net/projekt/aranea
A small web crawler named aranea (Latin for spider). The aim is to gather unique domains to show what is out there.
It starts with a given set of URL(s) and parses them for more
URLs. Stores them and fetches them too. perl fetch.pl
Each URL result (Stored result from the call) will be parsed
for other URLs to follow. perl parse-results.pl
After a run cleanup will gather all the unique Domains into
a table. Removes URLs from the fetch table which are already
enough. perl cleanup.pl
The table url_to_ignore
does have a small amount of domains and part of domains which will be ignored.
Adding a global SPAM list would be overkill.
A good idea is to run it with a DNS filter, which has a good blocklist.