Hello,
I noticed the following while reading article on UTF-8
(http://www.cl.cam.ac.uk/~mgk25/unicode.html) and toying with the format
with ruby:
"An important note for developers of UTF-8 decoding routines: For security
reasons, a UTF-8 decoder must not accept UTF-8 sequences that are longer
than necessary to encode a character. For example, the character U+000A
(line feed) must be accepted from a UTF-8 stream only in the form 0x0A, but
not in any of the following five possible overlong forms:
0xC0 0x8A
0xE0 0x80 0x8A
0xF0 0x80 0x80 0x8A
0xF8 0x80 0x80 0x80 0x8A
0xFC 0x80 0x80 0x80 0x80 0x8A"
So I tried to see what ruby’s (1.6.7) unpack(“U*”) did:
irb(main):019:0> str = “\xc0\x8a”
"\300\212"
irb(main):020:0> str.unpack(“U*”)
[10]
irb(main):021:0> str = “\xfc\x80\x80\x80\x80\x8a”
"\374\200\200\200\200\212"
irb(main):022:0> str.unpack(“U*”)
[10]
irb(main):023:0>
Seems to me UTF-8 is just a way of encoding multi-byte characters with a
stream of bytes, but it looks like the Unicode standard has some further
restrictions on what is a legal stream and what is not.
The mentioned article has some info on how to recognize these invalid
sequences:
"Any overlong UTF-8 sequence could be abused to bypass UTF-8 substring tests
that look only for the shortest possible encoding. All overlong UTF-8
sequences start with one of the following byte patterns:
1100000x (10xxxxxx)
11100000 100xxxxx (10xxxxxx)
11110000 1000xxxx (10xxxxxx 10xxxxxx)
11111000 10000xxx (10xxxxxx 10xxxxxx 10xxxxxx)
11111100 100000xx (10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx)
Also note that the code positions U+D800 to U+DFFF (UTF-16 surrogates) as
well as U+FFFE and U+FFFF must not occur in normal UTF-8 or UCS-4 data.
UTF-8 decoders should treat them like malformed or overlong sequences for
safety reasons. "
Regards,
Paul