A small web crawler named aranea (Latin for spider). https://www.bananas-playground.net/projekt/aranea/
Banana ac124642ea possible speed improvement while inserting the rows | il y a 1 semaine | |
---|---|---|
documentation | il y a 1 semaine | |
lib | il y a 1 mois | |
storage | il y a 2 ans | |
.gitignore | il y a 1 semaine | |
CHANGELOG | il y a 1 mois | |
COPYING | il y a 1 mois | |
LICENSE | il y a 1 mois | |
README.md | il y a 1 semaine | |
TODO | il y a 1 semaine | |
VERSION | il y a 2 ans | |
cleanup.pl | il y a 1 semaine | |
config.default.txt | il y a 1 semaine | |
fetch.pl | il y a 1 semaine | |
parse-results.pl | il y a 1 semaine | |
setup.sql | il y a 2 ans |
https://www.bananas-playground.net/projekt/aranea
A small web crawler named aranea (Latin for spider). The aim is to gather unique domains to show what is out there.
It starts with a given set of URL(s) and parses them for more URLs. Stores them and fetches them too. -> fetch.pl
Each URL result (Stored result from the call) will be parsed for other URLs to follow. -> parse-results.pl
After a run cleanup will gather all the unique Domains into a table. Removes URLs from the fetch table which are already enough. -> cleanup.pl
The table url_to_ignore
does have a small amount of domains and part of domains which will be ignored.
Adding a global SPAM list would be overkill.
A good idea is to run it with a DNS filter, which has a good blocklist.