Unicode issues (UTF-8 vs. Punycode)

Post Reply
gst
Posts: 16
Joined: Sun May 15, 2011 3:53 pm

Unicode issues (UTF-8 vs. Punycode)

Post by gst »

I'm currently a little bit unsure on how to treat the value inside the individual fields:
http://dot-bit.org/Domain_names wrote:The IDNA standard encoding is used for internationalized domain names. This means that Unicode names need to be converted to punycoded lowercase ASCII before registering, otherwise they might not be accepted by DNS bridges and proxies. Also, names containing dots (.), slashes (/) or other non-standard symbols will not be accepted.
My issue here is the "might". IMO this field should contain a value in exactly one well defined encoding, otherwise there will be all kinds of potential problems in the future. In addition, even if we specify a given charset without sufficiently restricting the allowed values there will be the following problems:
  • If we use Punycode (IDNA standard) for domains, but still allow values outside the ASCII range, there might be the issue that two distinct names are displayed in the same way. For example, http://namecoin.bitcoin-contact.org/ seems to display Unicode (UTF-8) values directly and converts Punycode two Unicode. Therefore, one given domain name can be encoded in two different ways (Punycode and UTF-8). If we use Punycode for names I think that there is no valid usecase for strings outside the ASCII/Punycode range, and that the spec should therefore state that such invalid strings must be ignored.
  • If we would use Unicode/UTF-8 for the encoding instead of Punycode we must ensure that only normalized Unicode strings are stored and accepted. Otherwise, it would be possible to encode the same domain name multiple times as there are certain Unicode sequences that essentially represent the same characters (see http://en.wikipedia.org/wiki/Unicode_equivalence).
  • A similar problem also occurs with the currently used Punycode if we don't normalize the strings with RFC 3491 Nameprep (see http://en.wikipedia.org/wiki/Nameprep). Again, it would be good to specify that only Nameprep-normalized strings must be used as names and that all non-normalized names must be ignored by software using the entries.
  • Even when taking care of all those problems there are still issues with IDN homograph attacks (see http://en.wikipedia.org/wiki/IDN_homograph_attack). Some registries take care of this by restricting the range of allowed Unicode characters to the characters used in the given countries. When having a registry where we don't want to restrict the value to a specific charset I think that there is no good way to defend against those attacks (except for completely disabling Unicode/Punycode).
Personally, I think that the most elegant solution would be to store normalized/nameprep'ed UTF-8 values as the domain name (and ignore names that aren't normalized) and let DNS frontends handle the conversion to Punycode. My reason for this that UTF-8 is basically the standard for handling Unicode that and that Punycode is just a workaround for representing this same data in the DNS namespace. So we would have a domain name that is basically a nameprep'ed Unicode string, and this domain name is encoded in the most optimal way for a given transport medium (UTF-8 for the block chain, Punycode for DNS).

Of course the same issue does not only apply to the domain names directly, but also to the subdomains stored in the value field.

gsan
Posts: 19
Joined: Wed May 11, 2011 10:49 pm
Contact:

Re: Unicode issues (UTF-8 vs. Punycode)

Post by gsan »

Thank you for the research.

The "might" clause should of course be removed from the proposal, it's a residue from the transition phase. We already announced that the domain names should be nameprep'ed and punycoded. This should be done using libidn (e.g. the idn command-line tool), since most online tools don't do this kind of conversion properly, and people seem to be following this. Better tutorials are needed in the wiki, and I'm guessing we will have online tools that prep names and check for collisions (and maybe even register names) pretty soon.

The reasons ASCII is preferred rather than UTF-8 were:
  • Currently, there are no tools to check for collisions, so it seemed better to let people encode the name and check for it themselves. We could require users to convert the name back to UTF-8 but it seemed too involved. Also the user could mistake the situation and directly register the names without checking. An example is the case of "Ⓑ.bit", which maps to "b.bit" when encoded, which isn't at all obvious.
  • The conversion would need to be done for each implementation (DNS bridges, ncproxy, etc.), including sub-domain names. Since nameprep converts a name to proper form, a double check would be necessary to determine if the name was prepped properly before accepting. Keeping things simpler for these implementations makes sense. For the show part, the opposite requirement is the case of course (converting from ASCII compatible encoding to display unicode domains).
I agree that keeping names in UTF-8 looks more elegant, but it doesn't make a whole lot of difference. name_scan output currently doesn't even display unicode properly. The beautifying code will always have to do some kind of conversion.

Either way, we currently don't have many registered international names, so we could still change this. More comments are welcome.

What about the IDN homograph attacks? What can we do about it? This is something to be considered for the GUI tools, but that wouldn't actually solve the "attack" part. :)
bitcoin:13uSLCLqURqjJkfH6ny56h65oF6bfsVik1
namecoin:NEuYvWtQ8pZzFx7sAnfjeRVtzvieCWF5Ug

khal
Site Admin
Posts: 708
Joined: Mon May 09, 2011 5:09 pm
os: linux

Re: Unicode issues (UTF-8 vs. Punycode)

Post by khal »

After having tested both methods in my script, i prefer to use only ascii in names (and for the reason of avoiding collisions too). Namecoin could display the ascii form + an eventual unicode form for names under the "d/" namespace, but registrars will do it anyway.

All the following domains are registered in their ascii form in namecoin (none of them show someting in a browser for now) :
π.bit
¢.bit
é.bit
ϾϿ.bit
·.bit
Equivalent ascii forms, in disorder : "d/xn--8a", "d/xn--1xa", "d/xn--9ca", "d/xn--uba", "d/xn--tzac".

Another way to find the ascii form of one of these domains : type it in your browser, it'll be converted.
NamecoinID: id/khal
GPG : 9CC5B92E965D69A9
NMC: N1KHAL5C1CRzy58NdJwp1tbLze3XrkFxx9
BTC: 1KHAL8bUjnkMRMg9yd2dNrYnJgZGH8Nj6T

Register Namecoin domains with BTC
My bitcoin Identity - Send messages to bitcoin users
Charity Ad - Make a good deed without paying a cent

Post Reply