Motivation
I blogged about the motivation for this script here: Interesting website activities makes me worry about Drive-by DownloadsBackground
Need to check all the links to see which ones might point to spamming sites or worse, drive by downloads sites.Found an existing script that I then heavily modified. The original script was checklinks.pl by Jim Weirich. I added:
- Traverses documents by mime type rather than by the extension .html
- Obeys the robots.txt file if one exists
- Added signal handlers that dump the progress so far
- Added a configurable user agent
The file is attached to this page:
Theory of Operations
References
Similar ideas I've seen.Stop spam flood attack with postfix and iptables
Perl/Apache: Parsing Apache HTTPD Logs with Perl Patterns
apache-tools
WWW::RobotRules - Perl module for obeying robot rules
Last wiki comments