Screen-scraping made .. kind of easier, I guess
Screen scraping is a somewhat still common technique that script-writers use to take the output of a web page (which wasn’t designed to be consumed by anything but a person viewing it in a web browser), grabs the portions relevant to their purposes, and spits out the parts that it wants. In fact, one of the first plugins for the old moobot project that I wrote was a Google plugin which I thought I was so clever to write using the IE search results version of Google since it was much lighter and easier to parse. It has no ads, no real formatting to speak of, and it was pretty much 80% data to 20% “other crap”, making it easy to scrape out the bits I wanted. Thankfully they started providing an API until, well, see the below post
But, Google has plenty of great web designers at work there, so they don’t have horribly malformed HTML or tons of inline style crap. Most other websites out there aren’t quite in the same boat. In fact, looking at the source of the vast majority of sites will make most competent web designers wince if not cry. Even well-formed HTML is not that easy to parse, and folks often resort to imprecise string matching or nasty regular expressions to get the job done. If only there was a way to get that nasty HTML into a more nicely-parseable format…
Well, I saw this blog post on programming.reddit.com (highly recommended, btw), and apparently he has set up a service that will fetch webpages and transmogrify them into either (presumably well-formed) XML or JSON output, two flavors of output that have become popular with the rise of AJAX. Unfortunately, you still get a lot of crap because, hey, Garbage In, Garbage Out. But, at least it’s crap that’s in a somewhat prettier outfit. Or if not prettier, at least, easier to dig through. However, take heed in the authors plea at the end and don’t hammer the crap out of this.

Leave a comment