]>
description | A small web crawler |
owner | Banana |
last change | Tue, 12 Nov 2024 07:18:49 +0000 (08:18 +0100) |
https://www.bananas-playground.net/projekt/aranea
A small web crawler named aranea (Latin for spider). The aim is to gather unique domains to show what is out there.
It starts with a given set of URL(s) and parses them for more
URLs. Stores them and fetches them too. Execute: perl fetch.pl
Each URL result (Stored result from the call) will be parsed
for other URLs to follow. perl parse-results.pl
After a run cleanup will gather all the unique Domains into
a table. Removes URLs from the fetch table which are already
enough. perl cleanup.pl
Either run fetch.pl
, parse-results.pl
and cleanup.pl
in the given order manually
or use aranea-runner
with a cron. The cron schedule depends on the amount of URLs to be fetched and parsed.
Higher numbers needs longer run times. So plan the schedule around that by running the perl files
manually first.
The table url_to_ignore
does have a small amount of domains
and part of domains which will be ignored. Adding a global SPAM list would be overkill.
A good idea is to run it with a DNS filter, which has a good blocklist.
The folder webroot
does contain a webinterface which displays the gathered data and status.
It does not provide a way to execute the crawler.
Want to contribute or found a problem?
See Contributing document: CONTRIBUTING.md
See uses document: USES
12 days ago | master | shortlog | log | tree |