Anne van Kesteren

IRI problems

8 March 2005

I think Internationalized Resource Identifiers are going to be part of the web, eventually. They are very useful for foreign languages that have words which are using characters outside of the US-ASCII range. To name a Dutch one: Café. Of course, IRIs are cool as well. However, there are some problems with the deployment of them. More specifically:

Your website better be encoded in UTF-8. When you are encoding your IRIs to URIs to have some compatibility with current browsers you must encode them in UTF-8. In non UTF-8 websites you must make sure that you are not using the encoding of the website to transform your IRIs into URIs.
Your server needs to use an UTF-8 encoded file and directory system.
Your users’ browsers need to support it.

Encoding your website in UTF-8 might take some time, but the process is trivial. The server thing is of course all but trivial. You need to have access to it. Not only that, if you are running Apache you need to have Apache 2.0 and mod_fileiri which is still under development. I am not sure how this works out for other servers, but I guess it is difficult.

Oh, the browser. It seems some browsers have some kind of support. At least for the domain name part, but implementing the other parts are harder since for IRIs you should not look at the encoding of the page, which until now every browser did. You need to see every IRI as if it was UTF-8 encoded. Not whatever happens to be the encoding of the page. And if you are converting your IRIs to URIs you need to make sure you do that in such a way that the URI uses the UTF-8 characters for every escaped character, not whatever happens to be the encoding on the particular page.

The best thing you can do right now is making sure your pages are using UTF-8 although that is of course not strictly required it would ensure some interoperability with the browser. (Not that I would recommend using IRIs at all at the moment. Except perhaps for IRI that do not need to be retrieved, like the value of the atom:id element.)

Comments

Your IRIs must be encoded in UTF-8

I don't think so. Your IRIs must be Unicode. If the Unicode characters in question do not have representatives in the encoding you are using, you are free to use NCRs to represent them. As far as I can tell, that works perfectly fine.
Posted by Jacques Distler at 6:31AM
[off-topic] Wow, I thought I was on the wrong site for a moment. Nice looks, stylish!
Posted by Sjoerd Visscher at 6:33AM
I don't quite get why I should use UTF-8/unicode.
Unless I deliberately want to shake off legacy browsers, I think I should code my href as http://xn--hda.landsbank.fo/ rather than http://ð.landsbank.fo/?
Normal URI escaping does not work in host names. http://%F0.landsbank.fo/ results in "%F0.landsbank.fo" not found in Firefox, Opera end up looking for ".landsbank.fo", IE6 actually claimed to be looking for ð.landsbank.fo, but did not find it. Because in DNS, only "xn--hda.landsbank.fo" exists.
I don't know why we have to use different escaping for host names, but my guess is that "%F0" is a legal host name in its own right?
btw—café isn't Dutch any more than it is English? Isn't Dutch the only non-English language that can spell its native words in US-ASCII? But thanks for supporting us who really need IRIs.
Posted by Jan Egil Kristiansen at 4:18PM
The user probably should have the right fonts on their system as well. Consider a Japanese IRI. This will contain various kanji. I guess the link will still work without the right fonts installed, but the user will see a load of boxes in the address. (And there are at least 2000 possible kanji in Japanese. Then there is Chinese, with many thousand characters possible!) Also think of Korean, or Arabic, etc, each with a unique set of characters.
And then how do we represent these links accurately on a page?
Posted by Chris Hester at 6:45PM
Jacques, you are right. I should have said URIs there.
Jan, IRIs cover more than just IDNs. Dutch has also words as financiën. We need other encoding than US-ASCII.
Chris, I believe all of those characters are covered in Unicode and therefore UTF-8 and most browsers support that fine.
Posted by Anne at 7:09PM
Jan, by the way, such URIs should work as long as you are using UTF-8 characters. See also: URL encoding code.
Posted by Anne at 7:17PM
Thanks you, I learned something today, too: If I encode "ð" as two UTF-8 bytes ("Ã°") and then escape those bytes as "%C3%B0", my URL becomes http://%C3%B0.landsbank.fo, which actually works in Opera. But not in Firefox, not in IE, so I think maybe I wait.
And while the UTF-8 character escaping http://landsbank.fo/test/%C3%B0.html works as well as the byte escaping http://landsbank.fo/test/%F0.html on my IIS server at work, the Apache running my hobby http://heima.olivant.fo/~egilstro/ei%F0i.html does not UTF-8 decode http://heima.olivant.fo/~egilstro/ei%C3%B0i.html.
Posted by Jan Egil Kristiansen at 8:03PM
That is probably because your Apache does not use UTF-8 for the file system as it should. (See the post.)
Posted by Anne at 8:09PM
I might not be understanding this completely, but would having IRI's that allow more than just US-ASCII require also that people who's keyboards only have English characters on it to type the non-US-ASCII characters? Take the example Jan gave earlier:
http://ð.landsbank.fo/
Would I be required to type the ð even when my keyboard doesn't support it?
Posted by Dustin Wilson at 12:05AM
Yes. That is probably issue number 4, although that is probably a browser UI issue as well.
Posted by Anne at 2:08AM
I don't know if typing is much of a problem. If I tell you about eiði.com here, you do not type that. You copy it, Google for it, and Google's link will be to http://xn--eii-4ma.com/ or http://heima.olivant.fo/~egilstro/ei%F0i.html, both work, even in IE.
If I talk to a foreigner on the phone, I guess I'd have to ask him to google for ISBN 1-880654-11-3 . Or read him NATO style Xray November Dash Dash Echo India India Dash Four Mike Alpha dot com.
Posted by Jan Egil Kristiansen at 4:47PM