Poor Man's Webspider

posted mar 1, 2010 at 3:00pm on borderstylo

This blog post was originally written for Border Stylo. The best way to thank them for letting me republish it is to check out spire.io, kickass APIs for web and mobile apps.

Webspiders are fun, but the learning curve is awfully steep. Websites don't like crawlers stumbling about where they're not wanted, and barriers as simple as a login screen can stymie a beginner. Add in checks on user agents strings and javascipt-heavy links, and your weekend is over before you've gotten anything to work—side project over! This blog post will show you how to turn your browser and LAMP server into a spider capable of taking you straight to the fun.

The Basic Idea

A greasemonkey script will pull data off the pages we're interested in and send them to a php script. The php script will then tell the greasemonkey script what to do next: either open an alert box telling the user something went wrong, or move on to another url.

In this way, we can foist all the page loading/cookie-storing/javascript-handling chores onto firefox and play with a dom instead of a big string of html. To make the javascript bit even easier, we'll add jquery to any page we visit.

A Working Example

For a silly example, let's scrape the urls of all the blog posts on Border Stylo's website and save them as a big text file with one url per line.

The code for this example lives over at my github project eensyweensy. Please take a look at README.md and head back here when you're done (I'll wait).

eensyweensy.user.js starts off with the following metadata block:

The most important line here is the @include; this tells greasemonkey which pages to run this script on. Check out the Metadata Block page at Greasepot for more details.

The javascript file is mostly the definition of Spider, and it's not until the lines after that we get to the interesting bits:

spider.grabber() is the most important part of the greasemonkey script. It takes a jquery $ as an argument, scrapes the page for the data we're interested in, and returns an object literal which will be serialised and sent to the php script.

eensyweensy.php is an ugly php script (aren't they all), so I'll talk about the salient portions (lines 28-45 for the brave of heart) without the aid of a gist.

If we're not on the last page, we go to the next page (through some regex to move from /posts to /posts?page=1 and so on). This causes the greasemonkey script to fire again, continuing the party. If we are on the last page, we jump to hampsterdance.com, which ends our particular run.

So what did we win? The file output.txt has been generated and filled with urls. Yay!

Where Do We Go From Here?

At this point, you should be itching to alter eensyweensy to go somewhere interesting and do something cool. Some prompts to get you started:

For pages where a login is required, you can turn off greasemonkey (by left-clicking on the monkey icon), fill in the login credentials, and then turn greasemonkey on once you're inside.
For more interesting storage options than text files, plug mysql up to php and party on.
Feel free to switch out the backend language for something you like more. For example, the perl, python, or ruby equivalents should be trivial to write (if you do port it, I'd love to add it to eensyweensy).
The greasemonkey script for eensyweensy is already ready to run in parallel on multiple computers. The backend would need to do some extra legwork to make sure it doesn't send two instances to the same url, but that's it.
My example just pulls one kind of data from one kind of url, but you don't have to. When I first came up with this idea, I had a script that scraped a popular auction site's closed auctions in two passes. The first pass grabbed the urls of individual items from a series of search results pages, and the second run went to each of those urls to gather more data. I'd recommend writing one greasemonkey script for each kind of page you want to grab.
Since the backend's sending JSON around anyway, there's no need to even define grabber on the frontend. You could write a more general @include that would run on any page, and send the definition of grabber along with each new url. Greasespot has enough information about storing data between runs to get you started.

Closing Remarks

My hope is that eensyweensy can serve as a gateway drug—it's not as cool, fast, or strong as its' older brothers, but it should give you just enough fun to encourage you to stick with it. Basically, if you find yourself taking any of my advice in embellishments, you are probably ready to stop playing with a toy spider and move up to the big time.