A few days ago (5 May) saw a great leap forward in the development of the internet. For the first time, top level domain names (TLD) are permitted using non-Latin scripts. In particular, three country codes have been assigned by ICANN. These are for مصر (Egypt), السعودية (Saudi Arabia) and امارات (United Arab Emirates). They are the first country codes which are not two characters (except the “cat” = Catalan anomaly), possibly because they thought there was no need to maintain the restriction if they were branching out into other scripts.
These first ones are all in Arabic which is a right to left language. That means that when you see one in the address bar it will appear the other way round to usual with http://TLD dot then the lower level parts of the domain name in reverse order but still followed by the / and the directory path as usual, even if in Arabic. Actually this is more logical all round and is how all URLs should have been but it is too late for that now.
[I would like to have shown you examples directly here but my editor and WordPress don’t work well with these scripts—I will need to work on that.] A good place to look is the Wikipedia page towards the bottom.
The implementation in browsers seems to vary and may also be dependent on what the server does as well. The ICANN Arabic test page http://مثال.إختبار/ works well in Firefox (Mac and Win) and Safari (Mac & iPhone)—the whole of the URL in the address bar after the http:// is in Arabic. In IE7 & 8 (Win) the address you see in the top bar is what looks like random Latin characters. For the tests I have done, Safari always gets it right, Firefox sometimes and IE never; I would be interested to hear of other results. An example of one that doesn’t work well in Firefox is the Egyptian Ministry of Communications and Information Technology http://وزارة-الأتصالات.مصر/ .
The code conversion is called Punycode and uses a rather strange algorithm to convert any Unicode text into ASCII. It is pretty unreadable but has to exist because the DNS system only allows ASCII so Punycode allows domain names in any character set (and any mix) to be uniquely resolved. I don’t know if this is always the case but the ones I have seen all start “xn--“. I imagine that, in time when implementations are sorted out, that this will become transparent to the user.
One worrying security implication of these “foreign” character codes in URLs is that some letters look very similar to Western Latin ones. So if you see a familiar link, to your bank say, it may not be quite what it seems. For example if the “ο” in “www.llοydstsb.com” is actually a Greek Omicron (which it is on this page) the fake address could direct you to a phishing site. It is possible that the behaviour of IE is deliberate to avoid this problem but I somewhat doubt it.
[This post has been revised since I discovered how to insert the Arabic characters. I will write up how it is done later.] [Updated to include IE7 & iPhone]