The Futility of Screen-Scraping

Martin’s article today about the Daily Express web site reminded me that it’s been some months since I looked at my list of Newspaper RSS feeds. As the list is created by screen-scraping the individual papers’ web sites, it’s no surprise that it all goes out of date as the sites are redesigned and updated.

And sure enough, it was a real mess. When I ran the program that generates the pages, about half of them were broken. But it wasn’t too serious, and after half an hour or so of tinkering with regular expressions, it all seems to be working again.

But all in all, it’s a good lesson in why screen-scraping is a really bad idea. This would be far easier (in fact it would pretty much be unnecessary) if the papers took the next step and released OPML files of their feeds, rather than free-form web pages.

Anyway, it’s all back again now. Please take a look and let me know if I’m missing anything obvious.

Join the conversation

3 Comments

  1. Ah, exellent web synergy. I was just trying to work out the quickest way of subscribing to whole bunches of newspaper RSS feeds in one go to do some stats on them, and viola, there are your updated OPML files for them. Cheers :-)

Leave a comment

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.