Byte–stream parsing in Ruby

So, I’ve a problem. I’m using ncurses (or possibly not, might just
`STDIN.read(1)` or something, we’ll see) to grab byte–level input from
the terminal. Purpose being to catch and handle control characters in a
text mode application, such as “meta–3” or “control–c.”

Currently, I have a really ugly method that manually parses UTF-8 and
ASCII directly in my Ruby source; however, this is extremely slow, and
seems quite a bit like overkill. After all, with 1.9’s wonderfully
robust `Encoding` support, it seems silly to duplicate all that
byte–parsing work that *must* be going on somewhere in Ruby already.

Here’s my current method (forgive the horrendous code, please! I fully
intended to get rid of it right from the start, so…):

The goal is to devise some method by which I can:

1) Determine whether or not an `Array` of so–far–received bytes is, yet,
a valid `String` of a given `Encoding` (I can get the intended input
`Encoding` by way of a simple `Encoding.find(:locale)`, so we’re always
in–the–know as to which `Encoding` the incoming bytes are intended to
be)
2) Once we know the`Array` instance containing the relevant bytes
pertains to a valid `String`, convert that into a `String` and further
store/cache/process it in some way.

Yes, this means that the `String` will almost always be one character
long; I am uninterested in parsing lengths of characters out of the
input stream, I can deal with that later. At the moment, I very simply
want to ensure that I can retrieve, in real time, the latest character
entered at the terminal, as a `String`, in any `Encoding`.

Any help would be much appreciated; I’ve been banging my head against
this on–and–off for weeks! (-:

···

--
Posted via http://www.ruby-forum.com/.

Elliott Cable wrote:

The goal is to devise some method by which I can:

1) Determine whether or not an `Array` of so–far–received bytes is, yet,
a valid `String` of a given `Encoding`

"über".bytes.to_a

=> [195, 188, 98, 101, 114]

a = "\xc3".force_encoding("UTF-8")

=> "\xC3"

a.valid_encoding?

=> false

a << "\xbc"

=> "ü"

a.valid_encoding?

=> true

···

--
Posted via http://www.ruby-forum.com/\.

Brian Candler wrote:

Elliott Cable wrote:

The goal is to devise some method by which I can:

1) Determine whether or not an `Array` of so–far–received bytes is, yet,
a valid `String` of a given `Encoding`

"über".bytes.to_a

=> [195, 188, 98, 101, 114]

a = "\xc3".force_encoding("UTF-8")

=> "\xC3"

a.valid_encoding?

=> false

a << "\xbc"

=> "ü"

a.valid_encoding?

=> true

Hrm, #valid_encoding? is very helpful. But how can I stuff numerical
(`Fixnum`) bytes onto the string?

···

--
Posted via http://www.ruby-forum.com/\.

Elliott Cable wrote:

Hrm, #valid_encoding? is very helpful. But how can I stuff numerical
(`Fixnum`) bytes onto the string?

hex_str = "\\x%x" % 195
puts hex_str

--output:--
\xc3

···

--
Posted via http://www.ruby-forum.com/\.

Elliott Cable wrote:

Hrm, #valid_encoding? is very helpful. But how can I stuff numerical
(`Fixnum`) bytes onto the string?

If you're doing STDIN.read(1), you get a String. Just use << to
concatenate, or the 2-argument form of read() where you supply a buffer
to append to.

If you are forced to use Fixnum, then try Integer#chr:

str = ""

=> ""

str << 255.chr

=> "\xFF"

Warning: you have to deal with all the (undocumented) ruby-1.9 encoding
stupidity. Have fun guessing the behaviour of each of the methods. e.g.

str = "hello"

=> "hello"

str.encoding

=> #<Encoding:UTF-8>

str << 255.chr

=> "hello\xFF"

str.encoding

=> #<Encoding:ASCII-8BIT>

Surprised that the encoding changed? This means that:

str.valid_encoding?

=> true

until you do:

str.force_encoding("UTF-8")

=> "hello\xFF"

str.valid_encoding?

=> false

Now have a guess what happens if you try to append another byte. Go on.

str.encoding

=> #<Encoding:UTF-8>

str << 250.chr

Encoding::CompatibilityError: incompatible character encodings: UTF-8
and ASCII-8BIT
  from (irb):24
  from /usr/local/bin/irb19:12:in `<main>'

Haha, fooled you. You thought it was safe to append a non-UTF8 character
to a UTF8 string (after all, you did before quite happily), but this
time you get an exception. So now you have to do:

str.force_encoding("ASCII-8BIT")

=> "hello\xFF"

str << 250.chr

=> "hello\xFF\xFA"

str.force_encoding("UTF-8")

=> "hello\xFF\xFA"

str.valid_encoding?

=> false

This is why I hate ruby 1.9.

Regards,

Brian.

P.S. The above example was with ruby 1.9.2 r23158 under Linux with UTF8
locale. Behaviour may or may not be different with other 1.9.x versions
and/or under different locale settings.

···

--
Posted via http://www.ruby-forum.com/\.

7stud -- wrote:

Elliott Cable wrote:

Hrm, #valid_encoding? is very helpful. But how can I stuff numerical
(`Fixnum`) bytes onto the string?

hex_str = "\\x%x" % 195
puts hex_str

--output:--
\xc3

That is not exactly ideal. Is there a cleaner way.

···

--
Posted via http://www.ruby-forum.com/\.

Incidentally, I needed to do something similar in ruby-1.8 recently, and
it was very straightforward.

  def is_utf8?(str)
    Iconv.iconv('UTF-8','UTF-8',str)
    true
  rescue Iconv::IllegalSequence
    false
  end

···

--
Posted via http://www.ruby-forum.com/.

Haha, fooled you. You thought it was safe to append a non-UTF8 character
to a UTF8 string (after all, you did before quite happily), but this
time you get an exception. So now you have to do:

str.force_encoding("ASCII-8BIT")

=> "hello\xFF"

str << 250.chr

=> "hello\xFF\xFA"

str.force_encoding("UTF-8")

=> "hello\xFF\xFA"

str.valid_encoding?

=> false

This is why I hate ruby 1.9.

I don't think that's a valid UTF-8 byte sequence...

Incidentally, I needed to do something similar in ruby-1.8 recently, and
it was very straightforward.

def is_utf8?(str)
   Iconv.iconv('UTF-8','UTF-8',str)
   true
rescue Iconv::IllegalSequence
   false
end

Oh, I see there's another tool let's try it!

$ cat conv.rb
str = "\xFF\xFA"

require 'iconv'

converted = Iconv.iconv 'UTF-8', 'UTF-8', str

puts converted
$ ruby -v conv.rb
ruby 1.8.6 (2008-08-11 patchlevel 287) [universal-darwin9.0]
conv.rb:6:in `iconv': "\377\372" (Iconv::IllegalSequence)
  from conv.rb:6

Ok, so it's not valid. Let's get a valid byte sequence...

$ cat conv.rb
str = "\xE2\x98\x83"

require 'iconv'

converted = Iconv.iconv 'UTF-8', 'UTF-8', str

puts converted
$ ruby conv.rb
:snowman_with_snow:

Ok, so that works!

Now let's use 1.9's built-in encoding stuff with our valid byte sequence:

$ cat conv.rb
# encoding: utf-8
str = "hello "
p :encoding => str.encoding
str << 0xE2.chr
str << 0x98.chr
str << 0x83.chr

puts str
$ ruby19 conv.rb
{:encoding=>#<Encoding:UTF-8>}
hello :snowman_with_snow:

huh, it worked fine.

So you're mad that Ruby doesn't let you shoot yourself in the foot?

···

On Jul 22, 2009, at 00:46, Brian Candler wrote:

Eric Hodel wrote:

=> "hello\xFF\xFA"

str.valid_encoding?

=> false

This is why I hate ruby 1.9.

I don't think that's a valid UTF-8 byte sequence...

That's the whole point. The OP wanted to append bytes to a string, and
detect whether the resulting string was a valid set of complete UTF-8
codepoints, or whether it was necessary to wait for more byte(s) for it
to become complete.

Ruby 1.9's valid_encoding? method seems to do that for you - except that
all the automagical and undocumented mutation of Strings gets in the
way. Sometimes, ruby lets you concatenate an arbitrary byte to a UTF-8
string without an exception; sometimes it does not. It appears this is
something to do with the concept of "compatible encodings".

Now let's use 1.9's built-in encoding stuff with our valid byte
sequence:

$ cat conv.rb
# encoding: utf-8
str = "hello "
p :encoding => str.encoding
str << 0xE2.chr
str << 0x98.chr
str << 0x83.chr

puts str
$ ruby19 conv.rb
{:encoding=>#<Encoding:UTF-8>}
hello :snowman_with_snow:

huh, it worked fine.

Yes, but you forgot to add another

  p :encoding => str.encoding

to the end. This shows that the string's encoding has magically mutated
without a by-your-leave.

So now to test whether the encoding is valid or not, you have to mutate
the string back again:

  str.force_encoding("UTF-8")
  puts "is valid" if str.valid_encoding?

OK, then what happens if you concatenate another byte?

  str << 0xFF.chr # boom

Argh, you need to mutate it back to ASCII-8BIT first.

So you're mad that Ruby doesn't let you shoot yourself in the foot?

I'm mad that Ruby has behaviour which is (a) undocumented, and (b) IMO
just plain stupid, and you have to expend ridiculous effort both to
understand it and to work around it.

I'm actually attempting to document it in my spare time, in the form of
a Test::Unit script. It looks like I'm going to have over 200
assertions. This is time I should probably have spent migrating code to
Erlang - which incidentally has a very sensible proposal for Unicode
handling.

Thank goodness for those people maintaining 1.8.6 and related forks like
Ruby Enterprise Edition.

Regards,

Brian.

···

On Jul 22, 2009, at 00:46, Brian Candler wrote:

--
Posted via http://www.ruby-forum.com/\.

Eric Hodel wrote:

=> "hello\xFF\xFA"

str.valid_encoding?

=> false

This is why I hate ruby 1.9.

I don't think that's a valid UTF-8 byte sequence...

That's the whole point. The OP wanted to append bytes to a string, and
detect whether the resulting string was a valid set of complete UTF-8
codepoints, or whether it was necessary to wait for more byte(s) for it
to become complete.

Ruby 1.9's valid_encoding? method seems to do that for you - except that
all the automagical and undocumented mutation of Strings gets in the
way.

I'm pretty sure I document all the behavior we've seen in this thread (and much more), in this single article on my blog:

I'm really not sure why you seem totally unwilling to count my articles as a valid source of information after all this time. They continually explain what you say is unexplained. I've asked you in the past to list what they don't cover, but aside from the C API side of things (which I admit I don't cover) you're just all out of excuses. I assume you simply have no desire to read them. Fair enough, but hopefully others do. I feel that means we should list them as an available resource.

I'm not sure what "automagical" means in this context either, but I don't feel it's a good description. I assume "auto" is for "automatic." Is Ruby automatically changing the Encoding? I don't think so. The programmer is asking Ruby to add two Strings with different Encodings. Ruby could just say no, but in this case there is a way it can be done, so it makes the choice, assuming that's what you wanted.

I guess "magical" may just mean you don't understand what's happening here. I do though, so there's certainly a process we can break down and understand.

Now let's use 1.9's built-in encoding stuff with our valid byte
sequence:

$ cat conv.rb
# encoding: utf-8
str = "hello "
p :encoding => str.encoding
str << 0xE2.chr
str << 0x98.chr
str << 0x83.chr

puts str
$ ruby19 conv.rb
{:encoding=>#<Encoding:UTF-8>}
hello :snowman_with_snow:

huh, it worked fine.

Yes, but you forgot to add another

p :encoding => str.encoding

to the end. This shows that the string's encoding has magically mutated
without a by-your-leave.

That's not true. You asked Ruby to combine those Strings of differing content. You gave your permission.

So now to test whether the encoding is valid or not, you have to mutate
the string back again:

str.force_encoding("UTF-8")
puts "is valid" if str.valid_encoding?

OK, then what happens if you concatenate another byte?

str << 0xFF.chr # boom

Argh, you need to mutate it back to ASCII-8BIT first.

As always, you are just not explaining what these examples show. The str variable contains some UTF-8 content. There is another String involved here though and we should examine its Encoding:

>> 0xFF.chr.encoding
=> #<Encoding:ASCII-8BIT>

So what you are really asking Ruby to do is to combine data in two different Encodings. There is a way to do that here, thanks to Ruby's concept of compatible Encodings. Given that, the conversion is made. If you had wanted to keep that data in UTF-8, you should have added more UTF-8 bytes to it:

>> ("abc".force_encoding("UTF-8") << 0xFF.chr.force_encoding("UTF-8")).encoding
=> #<Encoding:UTF-8>

There's no magic here. It's a process. We can explain it. I have.

James Edward Gray II

···

On Jul 23, 2009, at 2:47 AM, Brian Candler wrote:

On Jul 22, 2009, at 00:46, Brian Candler wrote:

I've briefly read sections 8 to 11 again.

Where does it say that String#<< can now raise an exception, and under
what circumstances?. Ah, I finally found it, right at the end of the
*comments* at the bottom of section 8, added a month after initial
publication. (+)

Where does it say that the encoding of a String can change when you
concatenate another string onto it?

By "undocumented" I mean: I expect to type "ri String#<<" and see an
accurate description of what String#<< does, including which
combinations of inputs are valid and which are not, and which attributes
of the String may mutate based on the input supplied.

Regards,

Brian.

(+) There is a warning in the string *comparisons* section saying that,
basically, the rules are too complicated to understand, so you should
always ensure that two strings are in the same encoding before comparing
them. Arguably you could say the same applies to any other operation
which takes two strings.

But this to me shows the whole exercise is futile. If, in order to write
a valid program, you need to ensure that all strings are in the same
encoding, then there should be a global flag which sets the encoding. If
I cannot predict what will happen when string A (encoding X) encounters
string B (encoding Y), and I have to keep forcing the encodings to X,
then there's no benefit in having the capability for strings to carry
about their own encodings.

And in many apps, the encoding information is carried "out of band"
anyway: for example: in HTTP or MIME, the encoding info is in a
Content-Type: header.

···

--
Posted via http://www.ruby-forum.com/.

I think you have a misconception about what #force_encoding does. It does not do any conversion. Use Encoding::Converter for that.

While #force_encoding does approximately what you want in the examples you've shown (ASCII, binary data and UTF-8 encodings) it won't work when you're reading one multibyte encoding (say, Shift-JIS from an IO) and adding it to another multibyte encoding (say, a UTF-8 String). You'll only end up with garbage if you don't use a converter.

For 1.9, I don't think io.read(1) is correct. #getc is better since it'll read what you want:

$ cat file
π
$ irb19
irb(main):001:0> open 'file' do |io| p io.getc end
"π"
=> "π"
irb(main):002:0> open 'file' do |io| io.set_encoding 'binary'; p io.getc end
"\xCF"
=> "\xCF"

Even for control characters:

$ ruby19 -e 'p $stdin.getc'
^I
"\t"
$

···

On Jul 23, 2009, at 10:02, Brian Candler wrote:

If I cannot predict what will happen when string A (encoding X) encounters
string B (encoding Y), and I have to keep forcing the encodings to X,
then there's no benefit in having the capability for strings to carry
about their own encodings.

Where does it say that String#<< can now raise an exception, and under
what circumstances?

Quoting from the page I linked to in my last message:
It's probably worth mentioning that it is possible for a transcoding operation to fail with an error. For example:

$ cat transcode.rb
# encoding: UTF-8
utf8 = "Résumé…"
latin1 = utf8.encode("ISO-8859-1")
$ ruby transcode.rb
transcode.rb:3:in `encode': "\xE2\x80\xA6" from UTF-8 to ISO-8859-1 (Encoding::UndefinedConversionError) from transcode.rb:3:in `<main>'
Naturally this fails because "…" is not a valid character in Latin-1.

Ah, I finally found it, right at the end of the
*comments* at the bottom of section 8, added a month after initial
publication. (+)

What does how long it took me to write the content have to do with anything? I added that comment to cover some items you had mentioned I had overlooked. Now it's invalid because it took me a while???

Where does it say that the encoding of a String can change when you
concatenate another string onto it?

Quoting from the same page:

One thing that my help a little in normalizing your data is Ruby's concept of compatibleEncodings. Here's an example of checking and taking advantage of compatible Encodings:

# data in two different Encodings
p ascii_my # >> "My "
puts ascii_my.encoding.name # >> US-ASCII
p utf8_resume # >> "Résumé"
puts utf8_resume.encoding.name # >> UTF-8
# check compatibility
p Encoding.compatible?(ascii_my, utf8_resume) # >> #<Encoding:UTF-8>
# combine compatible data
my_resume = ascii_my + utf8_resume
p my_resume # >> "My Résumé"
puts my_resume.encoding.name # >> UTF-8
In this example I had data in two different Encodings, US-ASCII and UTF-8. I asked Ruby if the two pieces of data were compatible?(). Ruby can respond to that question in one of two ways. If it returns false, the data is not compatible and you will probably need to transcode at least one piece of it to work with the other. If an Encoding is returned, the data is compatible and can be concatenated resulting in data with the returned Encoding. You can see how that played out when I combined these Strings.

(+) There is a warning in the string *comparisons* section saying that,
basically, the rules are too complicated to understand, so you should
always ensure that two strings are in the same encoding before comparing
them. Arguably you could say the same applies to any other operation
which takes two strings.

But this to me shows the whole exercise is futile.

But you should be doing the exact same thing in Ruby 1.8, which I understand you believe to be a superior system. If you are going to have two pieces of data interact, it just makes sense that they will pretty much always need to be the same kinds of data.

If, in order to write a valid program, you need to ensure that all strings are in the same encoding, then there should be a global flag which sets the encoding.

Like -E and -U in Ruby 1.9?

And in many apps, the encoding information is carried "out of band"
anyway: for example: in HTTP or MIME, the encoding info is in a
Content-Type: header.

Yeah, that's why a global switch won't really save you from doing your job. You need to read that header, and treat the content accordingly.

James Edward Gray II

···

On Jul 23, 2009, at 12:02 PM, Brian Candler wrote:

Thanks to everybody involved here, I now have a great solution that
works really well. I also ended up using EventMachine to get the
individual bytes from the keyboard, it’s a lot more efficient. Here’s my
final solution, incase anybody’s interested:

    require 'eventmachine'

    module Handler
      def initialize
        @buffer = ""
      end

      def receive_data byte
        byte.force_encoding Encoding.find('locale')
        @buffer << byte
        check_buffer
      end

      private
        def check_buffer
          if @buffer.valid_encoding?
            p @buffer
            @buffer = ""
          end
        end
    end

    EM.run{ EM.open_keyboard Handler }

···

--
Posted via http://www.ruby-forum.com/.