Guardian URLs

I’m a great believer in the idea that URLs should be permanent. When I publish something on the web then (hopefully) people link to it, and it would be nice to think that those links still work in five, ten or fifty years time. A few months ago I changed the URL scheme for davblog, but I ensured that the old-style URLs would redirect to the new ones.

Of course, this is a relatively small site. It has a couple of thousand entries. My fix to ensure that the old URLs still worked simply consisted of a few pages of Apache RedirectPermanent directives. If you’re dealing with a site that is larger and more important than this one, then the problems become far harder.

So it was nice to see Simon’s post pointing out that the Guardian had taken this problem seriously and had put some work into making sure that their old URLs still work correctly now they are in the process of switching to a new URL scheme. As an example, he links to an old blog entry which contains a link to

http://www.guardian.co.uk/religion/Story/0,2763,1382899,00.html

No prizes for guessing which CMS generated that nasty URL. Clicking on that URL now redirects you to the (far saner)

http://www.guardian.co.uk/world/2005/jan/04/religion.uk

And all is well with the world.

Well, almost. Digging around on some old (and rather embarrassing) web sites that I haven’t got round to taking down yet (because URLs are permanent!) I find this page (love that 1997 web design) which contains a number of links to Guardian web pages. Here’s an example:

http://politics.guardian.co.uk/News/1997_07_24/1997_07_24_29440.html

Clicking on that page leads to a shiny new “URL not found” page.

Which, I think, demonstrates a couple of interesting things. Firstly, at some point when the Guardian were moving from one CMS to another the permanence of the URLs wasn’t considered a high priority. There is no chain of redirection in place which converts this old URL to a newer style one. It looks like when the Guardian moved from this URL scheme, they broke all incoming links to their site. I wonder if that problem was even considered ten years ago.

Secondly, look at that really old URL. It’s not perfect by a long way but, to me, it looks a lot easier to understand than the first URL example above (the one generated by the CMS they are currently moving away from). There’s one “magic number” in it – 29440 is probably the article ID in some database – but you can work out the date that the article was published (24th July 1997) and the section it was in (Politics News). The other URL tells you that it points to a religious story, but those four numbers at the end make most of the URL completely meaningless.

Working out a good URL scheme isn’t a trivial task. That’s particularly true for a complex site like the Guardian. I’m really glad to see that they are making great progress in this area. But it’s interesting to see that at some point in the history of their site their URL scheme took what seems to be a big step backwards. Presumably, switching to the CMS which produced those nasty URLs was seen a giving them many other advantages that outweighed the URL damage.

I wonder if there’s anyone around who remembers this change.

Update: Searching the Guardian site finds only one article that was published on July 24th 1997. And that doesn’t look at all like the one that I was trying to link to, which was apparently about student fees. So it appears that not only are the links broken, but that some of the content from that era is no longer available on the site.

Oh, and thanks to Robin for adding his comments. I was hoping that someone like him might drop by and chip in.

One comment

  1. I was there. I don’t remember the question of URL permanence being seriously considered. It would have been rather difficult to achieve at that stage, because there was no global article identifier that could have been used to correlate the article ID numbers between the old and new systems. (It presumably would have been possible to use some sort of textual-similarity heuristic to forge the association reasonably well, if it had been considered important enough.)As for Vignette… it’s easy to forget, now, that they once boasted of being the fastest-growing software company in the history of the world. They had a product that just about worked, and an ingenious business model that rendered the quality of the product largely irrelevant. Their strategy was to charge such high prices that the decision to purchase the product could only be made at the highest levels of an organisation, and to aim their marketing at the boardroom, rather than the techies. I doubt that the clarity of URL schemes played an important part in the decision-making process (with which I was not personally involved).

Leave a comment

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.