Categories
tech

Political Web Sites

It’s the BBC Mashed hack session this coming weekend, so I suppose lots of people are trying to think of a project to work on. I have an idea that involves UK political web sites.

I expect that most MPs have a web site. A far smaller number of them have a blog. Some of them use blog-like software to publish news and other similar web feeds. Part of what I want to do is to build a directory of those URLs. I can’t see any way to do this other than trying to track down each of the 646 MPs on the web and poking around on their sites to find all of the interesting URLs. I’m hoping I can get some help with that.

But there are also sites about MPs that are run by other organisations. These ones are easier to track down. For example, They Work For You has a page for every MP. The page for my MP is

http://www.theyworkforyou.com/mp/martin_linton/battersea

So that looks easy enough. You just use the name of the MP and the name of the constituency. Public Whip has a similar mechanism. My MP’s page on Public Whip is

http://www.publicwhip.org.uk/mp.php?mpn=Martin_Linton&mpc=Battersea

Then there are the news organisations. The BBC has a page about my MP. It’s at

http://news.bbc.co.uk/1/shared/mpdb/html/35.stm

In this case there’s a magic number (35) and in order to construct these URLs for other MPs, you’d need to map these numbers to MPs (or constituencies).

The Guardian has two pages. One for the MP and one for the constituency.

http://politics.guardian.co.uk/person/0,,-3146,00.html http://politics.guardian.co.uk/hoc/constituency/0,,-696,00.html

Again, each of those contain magic numbers that you’d need to get a complete list of. And I assume (or, at least, hope) that these pages will one day be given shiny new URLs like other parts of the Guardian site.

It’s this second part of the problem that I want to concentrate on first. Building URLs to external sites pro grammatically. And this is where you, gentle reader, can help me. I have two questions that you might be able to answer.

  1. I’m sure I’m missing some external sites. To be honest, I haven’t really looked very hard yet. I’d be surprised if some of the other national papers didn’t have similar pages to the Guardian. Do you know of any other good sites that have pages dedicated to each individual MP or constituency?
  2. Do any of you work for organisations that publish these pages? If so, is there any chance that you could sent me lists of the “magic numbers” that appear in your URLs? What I’d need is something like a CSV file (or whatever format is convenient for you) that maps your magic numbers to a recognisable name for an MP or a constituency. I can then map your data to my list of MPs. I know it’s a bit cheeky, but it doesn’t hurt to ask.

There’s still one part of the puzzle missing. A surprising number of people don’t know the name their MP or even their constituency. So an important part of the system will be a search engine. At the very least, I’ll need the ability to convert a postcode to a constituency (or MP). All of the sites I mentioned above do this. It would be great if one of them made the look-up available as a web service.

Of, and one last thing. If you’re going to Mashed and don’t have a project to work on and this sounds interesting to you, then please get in touch. Feel free to link up with me on the backnetwork site.

Update: If you listen, you can probably hear the sounds of my kicking myself very hard because I forgot to check the They Work For You API before writing this entry. It does a lot of the kinds of things that I will need. There’s even a Perl module – which makes me very happy.

Categories
web

Human Dinosaurs

Having just been saying how much I like the new Guardian URL scheme, it was interesting to see the URL for this article from today’s paper. The article is about some early hominan[1] remains that have been found in northern Spain. The URL is

http://www.guardian.co.uk/science/2008/mar/27/archaeology.dinosaurs

I can obviously see why it’s in the science section. And of course it’s about archaeology. But “dinosaurs”? What connection do hominina have with dinosaurs? They are separated in time by about sixty million years. URLs like these only work if the person assigning them has an understanding of the subject area.

And, of course, it’s too late to correct it now as URLs are permanent :-)

[1] I originally put “hominid” there believing it to be the correct word. But according the Wikipedia, the definition of hominid has gradually changed to encompass all the great apes. Humans and their closely related species are now apparently described as hominina. That’s something new I’ve learned today.

Categories
web

Guardian URLs

I’m a great believer in the idea that URLs should be permanent. When I publish something on the web then (hopefully) people link to it, and it would be nice to think that those links still work in five, ten or fifty years time. A few months ago I changed the URL scheme for davblog, but I ensured that the old-style URLs would redirect to the new ones.

Of course, this is a relatively small site. It has a couple of thousand entries. My fix to ensure that the old URLs still worked simply consisted of a few pages of Apache RedirectPermanent directives. If you’re dealing with a site that is larger and more important than this one, then the problems become far harder.

So it was nice to see Simon’s post pointing out that the Guardian had taken this problem seriously and had put some work into making sure that their old URLs still work correctly now they are in the process of switching to a new URL scheme. As an example, he links to an old blog entry which contains a link to

http://www.guardian.co.uk/religion/Story/0,2763,1382899,00.html

No prizes for guessing which CMS generated that nasty URL. Clicking on that URL now redirects you to the (far saner)

http://www.guardian.co.uk/world/2005/jan/04/religion.uk

And all is well with the world.

Well, almost. Digging around on some old (and rather embarrassing) web sites that I haven’t got round to taking down yet (because URLs are permanent!) I find this page (love that 1997 web design) which contains a number of links to Guardian web pages. Here’s an example:

http://politics.guardian.co.uk/News/1997_07_24/1997_07_24_29440.html

Clicking on that page leads to a shiny new “URL not found” page.

Which, I think, demonstrates a couple of interesting things. Firstly, at some point when the Guardian were moving from one CMS to another the permanence of the URLs wasn’t considered a high priority. There is no chain of redirection in place which converts this old URL to a newer style one. It looks like when the Guardian moved from this URL scheme, they broke all incoming links to their site. I wonder if that problem was even considered ten years ago.

Secondly, look at that really old URL. It’s not perfect by a long way but, to me, it looks a lot easier to understand than the first URL example above (the one generated by the CMS they are currently moving away from). There’s one “magic number” in it – 29440 is probably the article ID in some database – but you can work out the date that the article was published (24th July 1997) and the section it was in (Politics News). The other URL tells you that it points to a religious story, but those four numbers at the end make most of the URL completely meaningless.

Working out a good URL scheme isn’t a trivial task. That’s particularly true for a complex site like the Guardian. I’m really glad to see that they are making great progress in this area. But it’s interesting to see that at some point in the history of their site their URL scheme took what seems to be a big step backwards. Presumably, switching to the CMS which produced those nasty URLs was seen a giving them many other advantages that outweighed the URL damage.

I wonder if there’s anyone around who remembers this change.

Update: Searching the Guardian site finds only one article that was published on July 24th 1997. And that doesn’t look at all like the one that I was trying to link to, which was apparently about student fees. So it appears that not only are the links broken, but that some of the content from that era is no longer available on the site.

Oh, and thanks to Robin for adding his comments. I was hoping that someone like him might drop by and chip in.

Categories
admin

Blog URLs

Argh. I’ve just noticed that having rebuilt all of the entries in the blog, the URLs have all changed. For example the URL of my entry about finding Perl programmers in London has changed from http://blog.dave.org.uk/archives/2007/02/perl-programmer.html to http://blog.dave.org.uk/2007/02/perl-programmer.html. They’ve dropped the “archives” directory. No doubt there’s a good reason for it, but until I put a some redirections in place, I’m going to have lost all the Googlejuice that those entries used to have. And all the Technorati pings too.

I’m beginning to wish I’d just stuck with the old version of the software.

I promise to get back to blogging about something other than my blog any day now.

Categories
tech

Using TinyURL

The BBC Backstage mailing list has briefly turned its attention from the iPlayer’s DRM and Ashley Highfield’s estimates of Linux usage and is actually having an interesting conversation about URL schemes.

This was all set off by an email sent out to participants in the BBC archive trial. The mail used TinyURL to shorten a couple of long URLs that it mentioned. The URLs in question were

Long-time readers might remember that I’m interested in the problems that people have with URLs, so you won’t be surprised that this discussion piqued my interest.

The first interesting point that was raised was that URL-shortening services like TinyURL can be used to disguise dubious addresses in a phishing attack. When clicking on links in mail it’s always a good idea to ensure that you know which site the link is taking you too. URL-shortening services prevent you from doing this as the URL you see it is to, for example, tinyurl.com. It’s unlikely, of course, that anyone wants to get your login details to the BBC archive trial, but it’s certainly a bad habit for an organisation like the BBC to be encouraging.

The main point, for me, of a URL shortening service is that it’s an easy way to share URLs from sites that have nasty addressing schemes which lead to unmanageably long URLs – like the URLs created by most e-commerce and content management systems (or, at least, most of the ones that I see being used). It’s just a fact of internet life that you often want to share a URL which is far too long for sane people to deal with. And URL-shortening services are perfect for cases like that. You can shorten a long URL to a short link that won’t get broken by your friends’ email program.

But I see that as a solution to a temporary problem. As some point in the future, we will no longer have unmanageably long URLs. Everyone designing URL schemes will understand how they should work and no-one will encode session information in URLs. Well, I can dream can’t I?

More practically, URL-shortening is a solution to the problem of sharing problematic URLs when you have no control over the URL scheme in question. In other words, the problem of sharing other people’s URLs. If you’re trying to share one of your own URLs and you find yourself wanting to use a URL shortening service, then perhaps you should be reconsidering your URL scheme.

And that’s why I don’t think that the BBC should be using things like TinyURL. They shouldn’t need to as they control the URLs that they are sharing. Personally I think that the two URLs in question are pretty good URLs. They are both easily readable and they aren’t too long. Oh, you can make picky suggestions for improving them (I’d want to lose the ‘2’ from ‘login2.shtml’ at least) but they are a vast improvement over most of the URLs you see out on the web. But if however composed the email thought that they were too long to include, then they should have fixed the URLs rather than resorting to TinyURL. I realise that in an organisation like the BBC getting the relevant web server configuration in place might take time, but that’s just another argument for getting your URL scheme right from the start.

Your URLs are your address on the web. They are how people find the information that you want to share with them. It’s well worth putting some effort into them.

Categories
tech

Sun RSS Feeds Broken

It’s been a while since the “sensational soaraway” Sun started publishing RSS feeds of their stories. I’m subscribed to a couple of their feeds (it keeps my blood pressure up) but I noticed a couple of hours ago that the feeds I was subscribed to no longer exist.

It seems that at some point in the last few weeks they have completely revamped all of their feeds. The details of the new feeds are on their site. Unfortunately the new feeds have been designed by someone who apparently knows very little about how RSS is supposed to work. These best example is that the links within the feeds are all relative instead of absolute – by which I mean that they don’t include the server address. For example, one story in the current news feed contains the URL:

  • /sol/homepage/sport/article420662.ece

where it should be the full URL

  • http://www.thesun.co.uk/sol/homepage/sport/article420662.ece

Relative links only work for links within the same site. RSS feeds are (almost by definition) supposed to be displayed on other sites and therefore relative links won’t work.

Having discovered this, I decided to check the feed with the online web feed validation tool (something that the developers really should have done for themselves) only to find that they really haven’t done very well at all.

Earlier this week, Martin pointed out that the Daily Star have also started to publish RSS feeds, so I was planning to do some work on my newspaper feeds page this week. Looks like I’ll have to do some work on the Sun section of that site as well.

Update: I was just looking at Martin’s post about the top 100 UK newspaper web feeds and I noticed that the most popular Sun feed (their news feed) had 12,000 subscribers (and that’s just in Google Reader). The figures are for the old feed. As the old Sun feed now just returns a 404 error, the Sun have just potentially lost 12,000 readers. RSS feed addresses are as important as any other URL on you web site. It should be as permanent as you can possibly make it. If you change feed URLs for some reason then you should put redirections in place so that your old readers can still find you.

This change gives every impression of being carried out by a complete amateur. I hope the Sun didn’t pay too much for it.

Categories
tech

Basic URL Advice

It’s about time for another look at some basic mistakes that people make on the internet. Today we’re going to be looking at URLs. It’s important to design a useful URL scheme for your web site. The easier your URLs are to understand, the more likely it is that people with share them with their friends.

The basic premise behind all of these ideas is that by making it easier for people to link to specific parts of your site then they will link to your site and will bring you more visitors. Of course this means that you’ll be encouraging people to visit your site using routes that don’t bring them through the front page. Some people have a problem with that. I believe that sites that don’t encourage that flexibility will slowly lose visitors to sites that do.

Domain names

We’ll start with the first part of a URL – the domain name. I’m not going to talk about registering a domain, I’ll assume that you’ve already done that. But how is your web server configured? Do you insist that people type ‘www’ at the start of your URL? And if so, why? There is no good reason why a visitor should type those extra three characters (four including the dot) each time they want to visit your site. It’s very simple to configure your web server to respond to both names.

Memorable

A good URL is memorable. If you have an article about sheep farming that has the URL http://your_domain.com/sheep_farming.html then people will be able to find it more easily than if the URL is http://your_domain.com/0,,1704174,00.html or something like that.

This often isn’t an easy problem to solve. In my experience all commercial Content Management Systems produce horrible URLs (which is why most newspaper sites have horrible URLs) and most blog software isn’t much better (I realise that the URLs on this site aren’t at all memorable – I have plans to improve that).

Simple

Part of being memorable is making your URLs as simple as possible. A good example of not doing that is Amazon. The URL for any product on Amazon is very simple. It looks like this http://www.amazon.co.uk/exec/obidos/ASIN/0596004761. The only bit that changes is the number at the end which defines the particular product that we are looking at.

So far, so good. That’s the URL as you need it if you want to pass it on to someone else or to link to the product. But Amazon never shows you that URL. It always adds some tracking information to the end of the URL. So Amazon URLs always appear more complex than they need to be. Of course I don’t expect Amazon to take this advice and change the way their systems overnight. This seems to be a good example of a company that is soe successful that they can ignore good practice whenever they want.

Other good examples of overcomplex URLs are often found on mapping sites. Try searching for a postcode on Multimap. I just got a URL back that containing 13 parameters. It looked like this (I’ve inserted spaces so that it wraps):

http://www.multimap.com/map/browse.cgi?client=public &search_result= &db=pc &cidr_client=none &lang= &keepicon=true &pc=SW129RW &advanced= &client=public &addr2= &quicksearch=sw12+9rw &addr3= &addr1=

A bit of experimentation revealed that only one of them was necessary:

http://www.multimap.com/map/browse.cgi?pc=SW129RW

Which of those two would you rather send to a friend?

Permanent

If you want people to link to your pages then you need to give them fixed places to link to. A few years ago I was trying to discuss a particular news story with some friends over email. The site in question (and I can’t remember what it was) had ten news stories on its front page. The newest story was always http://some_news_site.com/news1.html and so on through to news10.html. As a new story was published, all the existing stories had their URLs changed. It was difficult to hold a conversation about the story. It would have been impossible to link to it in a blog entry.

Not all permanence problems are so obvious. Some news sites have free access to stories for a few days after they are published but later move them behind a registration screen (or require payment). Others move stories to a different location after some time. If your content isn’t always available at the same URL then people won’t link to it. Of course, some newspapers might see that as a advantage.

That’s not to say that you shouldn’t also have transient URLs. It’s perfectly acceptable to have a latest.html link which always points to your latest article. But each article should also have a permanent URL and it should be easy to find out what that URL is.

Hackable

(And I’m using “hack” in the positive sense here.)

If your URLs reflect the structure of your site then people will be able to navigate round the site by editing the URL in the location bar of their browser. For example if /news/uk/politics/blair_resigns.html is the URL of a particular story, then /news/uk/politics should contain a list of current UK political news stories and /news/uk should contain a list of UK news.

One side-effect of this is that you need to work out all possible URLs that someone might try to visit and put something in place. You can’t assume that people will only visit URLs that you publish. In the previous example you might not ever publish a link to /news/uk but you still need to put some kind of content there as otherwise anyone trying to visit that URL will get either an error page or (probably worse) a list of files in that directory.

Anchors

It’s always worth adding internal anchors within your page so that people can link to specific sections of the page. For example, all comments on my blog have their own anchor so that you can link to them.

Frames

Only one thing to say about frames – don’t use them. The URL for a frameset refers to its initial state. Once you’ve clicked on something and changed the view, it becomes impossible for someone to construct a URL which will bring someone else back to that exact view.

This is part of an occasional series of articles about basic internet technologies. The previous articles in the series are Basic Password Handling and Basic Bulk Emailing.

Categories
tech

Hacking URLs

How do you surf the web? Chances are that you’re like most people and you just click on links to move from page to page. Seems that most people don’t use the location bar in their browser. That’s the text box near the top of your browser window that contains the URL (or, in plain English, the address) of the current web page. Even less people realise that they can edit that address and thereby go to different pages. For example, if I follow a link to http://example.com/some/interesting/page and I then want to see more of the site I’ll often just edit the URL to remove “some/interesting/page” and end up at http://example.com/ which is hopefully the site’s main page.

For me, and most of my geekier friends, that’s a common part of our day. We’ll often poke around on sites like that. It’s not “hacking” (at least not in the nasty meaning of the word used by most mainstream media) it’s just curiousity.

But it looks like this has just become a potentially dangerous activity. On New Years Eve, Daniel Cuthbert was using the DEC web site to make a donation to the tsunami appeal. Something went wrong with his transaction and he became suspicious and began to think that the site might be a phishing site[1]. As a bit of a geek, he poked around on the site a bit to find out what was going on. After a couple of probes he gave up and thought no more of it.

But his probes had set off an intruder detection system and his actions were reported to the police. They were able to track him down using the details of his credit card and he was prosecuted under the Computer Misuse Act.

Here’s where it gets really surreal. Even though the judge accepted that there was no malicious intent in anything that Cuthbert had done, he said that he had no choice but to follow the letter of the law and to find Cuthbert guilty. He was fined £400 and ordered to pay £600 in costs. Full details of the case are here and there is comment from various security experts here.

I find this whole story incredible. There is now a precedent that says that any time you visit a web site in a way not foreseen by the site’s owners, you are liable to be prosecuted. And that might cost you £1,000. As someone who regularly “hacks” URLs, I now need to be a lot more careful about the sites that I visit. Any site could potentially be monitoring accesses and looking for unusual ones. Does this mean that every time I get a 404 error, I could get fined?

It also has potential impact on me as a site owner. All web sites come under attack. Every day my web servers get probed to see if they are running software that has security holes. I just shrug and ignore it. Should I report all of these to the police? Should I report all 404 errors to the police? Can the police handle the thousands of new reports they’ve just opened themselves up to each day? Haven’t they got more important things to do?

It just goes to show that laws which effect the ways that people use technology should really be written by people who understand that technology.

[1] A web site that pretends to be something it isn’t in order to get confidential information from visitors.

Update: More detail here and the original posting about the story (from January) is here.