URLs. Stores them and fetches them too.
-> fetch.pl
-# Parse
+## Parse
Each URL result (Stored result from the call) will be parsed
for other URLs to follow.
-> parse-results.pl
-# Cleanup
+## Cleanup
After a run cleanup will gather all the unique Domains into
a table. Removes URLs from the fetch table which are already
enough.
-> cleanup.pl
+
+# Ignores
+
+The table `url_to_ignore` does have a small amount of domains and part of domains which will be ignored.
+Adding a global SPAM list would be overkill.
+
+A good idea is to run it with a DNS filter, which has a good blocklist.
push(@toBeDeletedFromFetchAgain, $baseUrl);
}
$query->finish();
+
sayYellow "Remove baseurls from url_to_fetch: ".scalar @toBeDeletedFromFetchAgain;
$queryStr = "DELETE FROM url_to_fetch WHERE `baseurl` = ?";
sayLog($queryStr) if $DEBUG;
Use setup.sql to create the `aranea` database and its tables. `mysql --user=user -p < setup.sql`
# Config
+
+Edit `config.txt` at least to match the database server settings.
+
+Make sure the directory `storage` can be written.
# Perl modules
+ [ConfigRead::Simple](https://metacpan.org/pod/ConfigReader::Simple)
++ [Data::Validate::URI](https://metacpan.org/pod/Data::Validate::URI)
my %urlsToFetch;
my $query = $dbh->prepare("SELECT `id`, `url`
FROM `url_to_fetch`
- WHERE `last_fetched` < NOW() - INTERVAL 1 WEEK
+ WHERE `last_fetched` < NOW() - INTERVAL 1 MONTH
OR `last_fetched` IS NULL
AND `fetch_failed` = 0
LIMIT ".$config->get("FETCH_URLS_PER_RUN"));
push(@urlsFailed, $id);
next;
}
- open(my $fh, '>', "storage/$id.result") or die "Could not open file 'storage/$id.result' $!";
+ open(my $fh, '>:encoding(UTF-8)', "storage/$id.result") or die "Could not open file 'storage/$id.result' $!";
print $fh $res->decoded_content();
close($fh);
push(@urlsFetched, $id);