aranea

https://www.bananas-playground.net/projekt/aranea

A small web crawler named aranea (Latin for spider). The aim is to gather unique domains to show what is out there.

Fetch

It starts with a given set of URL(s) and parses them for more URLs. Stores them and fetches them too. Execute: perl fetch.pl

Parse

Each URL result (Stored result from the call) will be parsed for other URLs to follow. perl parse-results.pl

Cleanup

After a run cleanup will gather all the unique Domains into a table. Removes URLs from the fetch table which are already enough. perl cleanup.pl

Usage

Either run fetch.pl, parse-results.pl and cleanup.pl in the given order manually or use aranea-runner with a cron. The cron schedule depends on the amount of URLs to be fetched and parsed. Higher numbers needs longer run times. So plan the schedule around that by running the perl files manually first.

Ignores

The table url_to_ignore does have a small amount of domains and part of domains which will be ignored. Adding a global SPAM list would be overkill.

A good idea is to run it with a DNS filter, which has a good blocklist.

Webinterface

The folder webroot does contain a webinterface which displays the gathered data and status. It does not provide a way to execute the crawler.

Contribute

Want to contribute or found a problem?

See Contributing document: CONTRIBUTING.md

Uses

See uses document: USES

description	A small web crawler
owner	Banana
last change	Tue, 12 Nov 2024 07:18:49 +0000 (08:18 +0100)

12 days ago	Banana	adding information about cron and local perl lib usage master	commit \| commitdiff \| tree \| snapshot
12 days ago	Banana	better formatting	commit \| commitdiff \| tree \| snapshot
12 days ago	Banana	moved last.run file	commit \| commitdiff \| tree \| snapshot
12 days ago	Banana	fixed requirements. It was the wrong debian package.	commit \| commitdiff \| tree \| snapshot
12 days ago	Banana	adding code of conduct and contribution file	commit \| commitdiff \| tree \| snapshot
12 days ago	Banana	adding the empty log folder	commit \| commitdiff \| tree \| snapshot
12 days ago	Banana	Merge branch 'master' of ssh://91.132.146.200:443/home...	commit \| commitdiff \| tree \| snapshot
12 days ago	Banana	adding arane-runner and made some changes	commit \| commitdiff \| tree \| snapshot
2024-10-21	Banana	readme additons	commit \| commitdiff \| tree \| snapshot
2024-10-21	Banana	typo	commit \| commitdiff \| tree \| snapshot
2024-10-18	Banana	needs more commits	commit \| commitdiff \| tree \| snapshot
2024-10-18	Banana	missing;	commit \| commitdiff \| tree \| snapshot
2024-10-17	Banana	more stable connection...	commit \| commitdiff \| tree \| snapshot
2024-10-16	Banana	adding big results also to the failed ones	commit \| commitdiff \| tree \| snapshot
2024-10-13	Banana	revert to lower charset in db because mariadb does...	commit \| commitdiff \| tree \| snapshot
2024-10-13	Banana	config update and origin table	commit \| commitdiff \| tree \| snapshot
...