Today Twitter got very excited about a story on the Independent web site. Actually, it wasn’t the story that got people excited, it was the URL that was being shared for the story. The story was some nonsense about Kate Middleton’s face being seen in a jelly bean. The URL was:
And if you click on it, sure enough, it takes you to the story on the Independent web site. Some people presented this as evidence of a joker (or, worse, a republican) taking control of the web site. But the actual explanation is a little more complex than that.
The real URL – the one that the Independent published on the site and in its web feed – was somewhat different. It was:
That seems far more reasonable, doesn’t it (well, of course, the story is still completely ridiculous, but we’ll ignore that). So what was going on?
Well, if you look closely at both URLs you’ll see that the number at the end of them (2269573) is the same. That number is obviously the unique identifier for this story in the Independent’s database. That is the only information that the web site needs in order to present a visitor with the correct story. So the web site is being quite clever and ignoring any text that precedes that number. This means that you can put any text that you want in the URL and it will still work correctly as long as you have the correct identifier at the end. So the URL could just as easily have been one of these:
The slight problem that the Independent had was that the alternative version of the URL was being shared so widely that Google was ranking it higher than the official version. So when people were Googling for the “kate middleton jelly bean” story, Google was presenting them with the dodgy version of the URL.
So why do the Independent use such a clever system if it’s so open to abuse?
One reason is for search engine optimisation. As I said above, you only really need the unique identifier for the story in order to find it in the database. And that means that the URL can be simplified to:
But that doesn’t give Google much information about the content. So it’s generally considered good practice to have some text in the URL as well. And I suppose one of the simplest ways to implement that is to ignore everything in the URL except the last sequence of digits. That’s apparently what the Independent do.
There’s an alternative approach. And that’s to include both the text and the identifier in the URL. And to only accept a URL as valid if both match exactly. I can think of a good reason why that might not work for a newspaper web site. Sometimes newspapers change the headline on a story. And sometimes that change is for legal reasons. In cases like that you really don’t want to have the old headline left around in the URL. And you don’t want to change the URL as any links to the original URL will no longer work. In cases like that, the Independent’s approach works well. You can change the headline (and, hence, the URL) as often as you like and everything will still work.
Incidentally, whilst researching this post I found that the Daily Mail had written a rather gloating article about the Independent’s problems today. The URL for that article is:
What’s interesting to note is that the text portion of that link is just as flexible as the Independent link. I can change it to:
And everything still works correctly. The big difference between the two implementation is that the Mail version will redirect the browser to the canonical version of the URL whereas the Independent will leave the alternative URL in the browser address bar. I have to say that, in this case, I think the Daily Mail is right.
It’s not just newspapers that have this flexible approach to URLs. Amazon URLs have a flexible text section in them too. Each item that Amazon sells has a unique identifier, so the canonical Amazon URL looks like:
But whenever you see a URL on Amazon, they have added a descriptive text field:
But, as with the newspaper URLs, that text field can be changed to anything. It’s only the identifier that is required.
Hours of fun for all the family.
Your mission, should you choose to accept it, is to find other web sites where there’s a ignored text section in their URLs. Please post the best ones you find in the comments.
Bonus points for getting one of the papers to write about your prank.
Update: Here’s Independent editor Martin King’s take on the incident. He says that the system is used for exactly the two reasons I mentioned above – “The feature has search engine benefits but from an editorial perspective it enables us to change repeatedly a headline on a moving article.”