My issue here is the "might". IMO this field should contain a value in exactly one well defined encoding, otherwise there will be all kinds of potential problems in the future. In addition, even if we specify a given charset without sufficiently restricting the allowed values there will be the following problems:http://dot-bit.org/Domain_names wrote:The IDNA standard encoding is used for internationalized domain names. This means that Unicode names need to be converted to punycoded lowercase ASCII before registering, otherwise they might not be accepted by DNS bridges and proxies. Also, names containing dots (.), slashes (/) or other non-standard symbols will not be accepted.
- If we use Punycode (IDNA standard) for domains, but still allow values outside the ASCII range, there might be the issue that two distinct names are displayed in the same way. For example, http://namecoin.bitcoin-contact.org/ seems to display Unicode (UTF-8) values directly and converts Punycode two Unicode. Therefore, one given domain name can be encoded in two different ways (Punycode and UTF-8). If we use Punycode for names I think that there is no valid usecase for strings outside the ASCII/Punycode range, and that the spec should therefore state that such invalid strings must be ignored.
- If we would use Unicode/UTF-8 for the encoding instead of Punycode we must ensure that only normalized Unicode strings are stored and accepted. Otherwise, it would be possible to encode the same domain name multiple times as there are certain Unicode sequences that essentially represent the same characters (see http://en.wikipedia.org/wiki/Unicode_equivalence).
- A similar problem also occurs with the currently used Punycode if we don't normalize the strings with RFC 3491 Nameprep (see http://en.wikipedia.org/wiki/Nameprep). Again, it would be good to specify that only Nameprep-normalized strings must be used as names and that all non-normalized names must be ignored by software using the entries.
- Even when taking care of all those problems there are still issues with IDN homograph attacks (see http://en.wikipedia.org/wiki/IDN_homograph_attack). Some registries take care of this by restricting the range of allowed Unicode characters to the characters used in the given countries. When having a registry where we don't want to restrict the value to a specific charset I think that there is no good way to defend against those attacks (except for completely disabling Unicode/Punycode).
Of course the same issue does not only apply to the domain names directly, but also to the subdomains stored in the value field.