UTF-8 "bug": not in accordance with the unicode-3 specs

Hello,

I noticed the following while reading article on UTF-8
(http://www.cl.cam.ac.uk/~mgk25/unicode.html) and toying with the format
with ruby:

"An important note for developers of UTF-8 decoding routines: For security
reasons, a UTF-8 decoder must not accept UTF-8 sequences that are longer
than necessary to encode a character. For example, the character U+000A
(line feed) must be accepted from a UTF-8 stream only in the form 0x0A, but
not in any of the following five possible overlong forms:

0xC0 0x8A
0xE0 0x80 0x8A
0xF0 0x80 0x80 0x8A
0xF8 0x80 0x80 0x80 0x8A
0xFC 0x80 0x80 0x80 0x80 0x8A"

So I tried to see what ruby’s (1.6.7) unpack(“U*”) did:

irb(main):019:0> str = “\xc0\x8a”
"\300\212"
irb(main):020:0> str.unpack(“U*”)
[10]
irb(main):021:0> str = “\xfc\x80\x80\x80\x80\x8a”
"\374\200\200\200\200\212"
irb(main):022:0> str.unpack(“U*”)
[10]
irb(main):023:0>

Seems to me UTF-8 is just a way of encoding multi-byte characters with a
stream of bytes, but it looks like the Unicode standard has some further
restrictions on what is a legal stream and what is not.

The mentioned article has some info on how to recognize these invalid
sequences:

"Any overlong UTF-8 sequence could be abused to bypass UTF-8 substring tests
that look only for the shortest possible encoding. All overlong UTF-8
sequences start with one of the following byte patterns:
1100000x (10xxxxxx)
11100000 100xxxxx (10xxxxxx)
11110000 1000xxxx (10xxxxxx 10xxxxxx)
11111000 10000xxx (10xxxxxx 10xxxxxx 10xxxxxx)
11111100 100000xx (10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx)

Also note that the code positions U+D800 to U+DFFF (UTF-16 surrogates) as
well as U+FFFE and U+FFFF must not occur in normal UTF-8 or UCS-4 data.
UTF-8 decoders should treat them like malformed or overlong sequences for
safety reasons. "

Regards,
Paul

Hi,

Seems to me UTF-8 is just a way of encoding multi-byte characters with a
stream of bytes, but it looks like the Unicode standard has some further
restrictions on what is a legal stream and what is not.

Array#pack(“U”) and String#unpack(“U”) are UTF-8 packer/unpacker, so
that it should not restricted by Unicode standard additional
restriction. But I will add checks for malformed UTF-8 and redundant
UTF-8 sequences (warnings with -w option).

I think checks for surrogate pair values (from U+D800 to U+DFFF)
should be done in different layer, which handles Unicode semantics.

						matz.
···

In message “UTF-8 “bug”: not in accordance with the unicode-3 specs” on 02/11/29, “Paul Melis” paul@floorball.nl writes:

In article 1038560163.981998.11208.nullmailer@picachu.netlab.jp,
matz@ruby-lang.org (Yukihiro Matsumoto) writes:

Array#pack(“U”) and String#unpack(“U”) are UTF-8 packer/unpacker, so
that it should not restricted by Unicode standard additional
restriction. But I will add checks for malformed UTF-8 and redundant
UTF-8 sequences (warnings with -w option).

Is there a way to detect the warnings from Ruby script?

···


Tanaka Akira

Hi,

···

In message “Re: UTF-8 “bug”: not in accordance with the unicode-3 specs” on 02/12/02, Tanaka Akira akr@m17n.org writes:

Array#pack(“U”) and String#unpack(“U”) are UTF-8 packer/unpacker, so
that it should not restricted by Unicode standard additional
restriction. But I will add checks for malformed UTF-8 and redundant
UTF-8 sequences (warnings with -w option).

Is there a way to detect the warnings from Ruby script?

Currently, no. Probably I have to design warning API.

						matz.

In article 1038817117.674266.30323.nullmailer@picachu.netlab.jp,
matz@ruby-lang.org (Yukihiro Matsumoto) writes:

Is there a way to detect the warnings from Ruby script?

Currently, no.

I see.

Probably I have to design warning API.

Agreed. Without such API, application cannot check the security issue.

Note that I found a problem for current checks: “\xc0\x81” is
malformed but not warned by unpack:

% ./ruby -w -e ‘p “\xc0\x81”.unpack(“U”)’
[1]

···


Tanaka Akira