The Futility of Screen-Scraping

Martin’s article today about the Daily Express web site reminded me that it’s been some months since I looked at my list of Newspaper RSS feeds. As the list is created by screen-scraping the individual papers’ web sites, it’s no surprise that it all goes out of date as the sites are redesigned and updated.

And sure enough, it was a real mess. When I ran the program that generates the pages, about half of them were broken. But it wasn’t too serious, and after half an hour or so of tinkering with regular expressions, it all seems to be working again.

But all in all, it’s a good lesson in why screen-scraping is a really bad idea. This would be far easier (in fact it would pretty much be unnecessary) if the papers took the next step and released OPML files of their feeds, rather than free-form web pages.

Anyway, it’s all back again now. Please take a look and let me know if I’m missing anything obvious.

3 comments

Regexps? How retro.Template::Extract is my new best friend.

You’re right, of course. But this was a quick fix of an existing prototype, I didn’t have time to reimplement it.When I _do_ reimplement it, I’ll definitely be looking at Template::Extract.

Ah, exellent web synergy. I was just trying to work out the quickest way of subscribing to whole bunches of newspaper RSS feeds in one go to do some stats on them, and viola, there are your updated OPML files for them. Cheers :-)

The Futility of Screen-Scraping

Related

3 comments

Leave a comment Cancel reply

Share this:

Related

3 comments

Leave a comment Cancel reply