About this post
ABOUT: This entry was posted January 2, 2007 at 8:58 p.m. It is 967 words long, which, in case you're curious, translates to about 27 inches. There are currently 627 comments on this post. Click here to add your own.
SUMMARY: A plea for etiquette in journalistic Web scraping in which I advocate for transparency and courtesy.
Spread the love
- subscribe to its comments
- bookmark it on del.icio.us
- digg it
- bookmark it on ma.gnolia
- seed it to newsvine
- see who is bookmarking it
Recent posts
Polite Web scraping
Note: This article was originally posted in December 2005
I've always thought of scraping entire Web databases as a last resort, useful when government agencies raise a fuss over data otherwise available online. There are good reasons for that. For one, Web scraping can be a tremendous strain on a Web server. For another, every scrape is recorded, quite obviously, in Web server logs. And because of those things and many others, running a scraping program over and over on an unsuspecting Web server really pisses site admins off.
As I've mentioned in previous, I think journalists' access to scrapable Web databases eventually will be shut down by these angry Webmasters who are sick of being battered by thousands of automated requests. And that's why scrapers should be polite. There are dozens of ways to do this, but I'm going to suggest three basics: Be transparent, walk softly, and pay attention to the Webmaster's ground rules. Feel free to add your own.
Being transparent
Transparency is easy, mostly because it takes so much effort to hide your scrape that you won't gain much from trying. Most of the time, a Webmaster will notice an unnatural spike in site traffic, and he's going to investigate. Here's some important information he'll find in his server logs. This example is a record of me accessing car-chase.net from my home computer:
- IP address: 12.219.218.86
- When the site was accessed: [06/Apr/2005:18:54:38 -0700]
- What was accessed: GET /wordpress/wp-admin/post.php?action=edit&post=24
- The scraper's user agent: Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.7.6) Gecko/20050317 Firefox/1.0.2
The information that anyone can glean from this entry is revealing enough. In my case ...
A quick tracert of the IP address reveals who's hosting my Internet connection (Mediacom Cable). IP geolocation thinks I live in New York. I don't, but it is right sometimes, especially with company LANs. If I access from a company network, the log usually reveals my often-incriminating username (cdavis, perhaps). If I run the script off of my Web server, a WHOIS search can give you my name, phone number, etc. There's a lot more for anyone who wants to dig.
You could try hiding your IP, but it might get you in trouble if government agencies get suspicious. A much easier solution is letting Webmasters know exactly who you are and how to reach you if they have a problem. With Perl, that means altering the user agent variable in the two most popular scraping modules: LWP::UserAgent and its offshoot WWW::Mechanize.
LWP::UserAgent's user agent-definition method is invoked like so: $useragent->agent( "Agent name" ); where "Agent name" is the name of the user agent that will appear in the server logs. Most of the time, folks set this to emulate popular Web browsers like "Mozilla/5.0". Expecially if you're scraping for a professional organization, I recommend you pass an argument that explains what you're all about:
$useragent->agent( "This script is run by Chase Davis of the ABC Times. Contact me at (555)-555-5555 if you have any questions" )
You can also use the $useragent -> from() method to pass your e-mail address, but that won't show up on many server logs. The user agent won't show up all the time, either, but it's become much more standard.
For WWW::Mechanize, which is based on LWP::UserAgent, you can pass the same user agent argument in the constructor for a mechanize object.
my $mech = WWW::Mechanize->new( agent=>"This script is run by ... etc." );
Naturally, you'll have to play full disclosure by ear. Sometimes you don't want the Webmaster knowing about you, and you're not going to be scraping long enough for him to notice. In those cases, consider hiding behind a browser.
Walking softly
Scraping bots can gum up Web servers if you launch them rapid-fire, especially during peak hours. Considering that, I usually run my scrapes late at night, when they won't slow down the site's casual users. Using the crontab, I usually set 2 to 6 a.m. as a good window. If it takes a couple nights, that's fine - at least I'm leaving the server alone during the day. I also use sleep statements.
Unless you tell it otherwise, your script will scrape record after record as fast as it can. That's good for you - you'll be done faster - but the server won't be so happy. That's why you should always wait a few seconds before every iteration of your main scraping loop. This statement should do the job:
sleep int(rand(5));
As you might guess, this causes the program to delay for a few seconds, giving the server time to breathe.
Following the rules
I save this for last beacuse it's the most obvious and least technical. Check your target site's robots.txt file (found at www.site.com/robots.txt) and terms of service. Robots tells automated Web spiders where they're not welcome, and not only should you check it out of courtesy, but you also might find something interesting. Take mine for example: You'll notice I'm hiding access to a directory called "topsecret", yet there isn't a link like that anywhere to be found. I wonder why that might be ...
In any case, if a Webmaster goes through the trouble of hiding something, chances are he'll notice if it's been scraped. Same if a site's terms of service, usually linked to from the bottom of the homepage, explicitly prohibits "automated access". In that case, it's probably best to speak with the Webmaster before launching a huge scraping project. They can (and probably will) cut off your access if you don't. Not to mention the legal concerns.
As with anything, a little courtesy here will go a long way. I've come up with a few basics, but scrapers can do much more. What polite scraping tips would you suggest?

Comments | Post yours
Post your comment