Poor Man's Webspider

posted mar 1, 2010 at 3:00pm on borderstylo

This blog post was originally written for Border Stylo. The best way to thank them for letting me republish it is to check out spire.io, kickass APIs for web and mobile apps.

Webspiders are fun, but the learning curve is awfully steep. Websites don't like crawlers stumbling about where they're not wanted, and barriers as simple as a login screen can stymie a beginner. Add in checks on user agents strings and javascipt-heavy links, and your weekend is over before you've gotten anything to work—side project over! This blog post will show you how to turn your browser and LAMP server into a spider capable of taking you straight to the fun.

The Basic Idea

A greasemonkey script will pull data off the pages we're interested in and send them to a php script. The php script will then tell the greasemonkey script what to do next: either open an alert box telling the user something went wrong, or move on to another url.

In this way, we can foist all the page loading/cookie-storing/javascript-handling chores onto firefox and play with a dom instead of a big string of html. To make the javascript bit even easier, we'll add jquery to any page we visit.

A Working Example

For a silly example, let's scrape the urls of all the blog posts on Border Stylo's website and save them as a big text file with one url per line.

The code for this example lives over at my github project eensyweensy. Please take a look at README.md and head back here when you're done (I'll wait).

eensyweensy.user.js starts off with the following metadata block:

The most important line here is the @include; this tells greasemonkey which pages to run this script on. Check out the Metadata Block page at Greasepot for more details.

The javascript file is mostly the definition of Spider, and it's not until the lines after that we get to the interesting bits:

spider.grabber() is the most important part of the greasemonkey script. It takes a jquery $ as an argument, scrapes the page for the data we're interested in, and returns an object literal which will be serialised and sent to the php script.

eensyweensy.php is an ugly php script (aren't they all), so I'll talk about the salient portions (lines 28-45 for the brave of heart) without the aid of a gist.

If we're not on the last page, we go to the next page (through some regex to move from /posts to /posts?page=1 and so on). This causes the greasemonkey script to fire again, continuing the party. If we are on the last page, we jump to hampsterdance.com, which ends our particular run.

So what did we win? The file output.txt has been generated and filled with urls. Yay!

Where Do We Go From Here?

At this point, you should be itching to alter eensyweensy to go somewhere interesting and do something cool. Some prompts to get you started:

Closing Remarks

My hope is that eensyweensy can serve as a gateway drug—it's not as cool, fast, or strong as its' older brothers, but it should give you just enough fun to encourage you to stick with it. Basically, if you find yourself taking any of my advice in embellishments, you are probably ready to stop playing with a toy spider and move up to the big time.