]> 91.132.146.200 Git - aranea.git/summary
 
descriptionA small web crawler
ownerBanana
last changeTue, 12 Nov 2024 07:18:49 +0000 (08:18 +0100)
readme

aranea

https://www.bananas-playground.net/projekt/aranea

A small web crawler named aranea (Latin for spider). The aim is to gather unique domains to show what is out there.

Fetch

It starts with a given set of URL(s) and parses them for more URLs. Stores them and fetches them too. Execute: perl fetch.pl

Parse

Each URL result (Stored result from the call) will be parsed for other URLs to follow. perl parse-results.pl

Cleanup

After a run cleanup will gather all the unique Domains into a table. Removes URLs from the fetch table which are already enough. perl cleanup.pl

Usage

Either run fetch.pl, parse-results.pl and cleanup.pl in the given order manually or use aranea-runner with a cron. The cron schedule depends on the amount of URLs to be fetched and parsed. Higher numbers needs longer run times. So plan the schedule around that by running the perl files manually first.

Ignores

The table url_to_ignore does have a small amount of domains and part of domains which will be ignored. Adding a global SPAM list would be overkill.

A good idea is to run it with a DNS filter, which has a good blocklist.

Webinterface

The folder webroot does contain a webinterface which displays the gathered data and status. It does not provide a way to execute the crawler.

Contribute

Want to contribute or found a problem?

See Contributing document: CONTRIBUTING.md

Uses

See uses document: USES

shortlog
12 days ago Bananaadding information about cron and local perl lib usage master
12 days ago Bananabetter formatting
12 days ago Bananamoved last.run file
12 days ago Bananafixed requirements. It was the wrong debian package.
12 days ago Bananaadding code of conduct and contribution file
12 days ago Bananaadding the empty log folder
12 days ago BananaMerge branch 'master' of ssh://91.132.146.200:443/home...
12 days ago Bananaadding arane-runner and made some changes
2024-10-21 Bananareadme additons
2024-10-21 Bananatypo
2024-10-18 Banananeeds more commits
2024-10-18 Bananamissing;
2024-10-17 Bananamore stable connection...
2024-10-16 Bananaadding big results also to the failed ones
2024-10-13 Bananarevert to lower charset in db because mariadb does...
2024-10-13 Bananaconfig update and origin table
...
heads
12 days ago master