How to convert string to binary and back in Ruby 1.9?

I'm using Ruby 1.9.1-p243 on Mac OS X 10.5.8.

I have this UTF-8 string that I want to turn into binary, and then
from binary into ISO-8859-1. The result should be some garbage
string, which I need for debugging purposes. For the sake of an
example, my UTF-8 string is "помоник" (Russian for "helper"). After
looking at the documentation, it seemed like String#force_encoding
would do what I need.

But when I go to irb, I get this:

irb(main):060:0> "помоник".encoding
=> #<Encoding:UTF-8>
irb(main):061:0> "помоник".bytes.to_a
=> [208, 191, 208, 190, 208, 188, 208, 190, 208, 189, 208, 184, 208,
186]
irb(main):062:0> "помоник".force_encoding("ISO-8859-1")
=> "помоник"
irb(main):063:0> "помоник".force_encoding("ISO-8859-1").encoding
=> #<Encoding:ISO-8859-1>
irb(main):064:0> "помоник".force_encoding("ISO-8859-1").bytes.to_a
=> [208, 191, 208, 190, 208, 188, 208, 190, 208, 189, 208, 184, 208,
186]

So apparently, it changes the encoding, leaves the bytes unchanged,
but also leaves the decoded characters unchanged? Is this a bug or
what?

Note also:

irb(main):066:0> "помоник".encode('BINARY')
Encoding::UndefinedConversionError: "\xD0\xBF" from UTF-8 to
ASCII-8BIT
  from (irb):66:in `encode'
  from (irb):66
  from /usr/local/bin/irb:12:in `<main>'

So apparently in Ruby 1.9, binary isn't really binary?

I banged my head for a while, and then tried it in python3.
Completely easy:

'помоник'

'помоник'

'помоник'.encode('utf_8')

b'\xd0\xbf\xd0\xbe\xd0\xbc\xd0\xbe\xd0\xbd\xd0\xb8\xd0\xba'

'помоник'.encode('utf_8').decode('latin_1')

'помоник'

'помоник'.encode('utf_8').decode('latin_1')

'помоник'

'помоник'.encode('utf_8').decode('latin_1').encode('latin_1')

b'\xd0\xbf\xd0\xbe\xd0\xbc\xd0\xbe\xd0\xbd\xd0\xb8\xd0\xba'

So can I do the same thing in Ruby 1.9? How do I deal with binary
data? How to I convert a string to a manageable byte sequence? Is
there a way to turn an array of bytes into a string of a specified
encoding?

AFAIK String#force_encoding doesn't re-encode the string, but just changes its
properties (the encoding).

In the other way, #encode does change the encoding, and it fails if the
conversion is not possible.

···

El Miércoles, 2 de Septiembre de 2009, Joe escribió:

I'm using Ruby 1.9.1-p243 on Mac OS X 10.5.8.

I have this UTF-8 string that I want to turn into binary, and then
from binary into ISO-8859-1. The result should be some garbage
string, which I need for debugging purposes. For the sake of an
example, my UTF-8 string is "помоник" (Russian for "helper"). After
looking at the documentation, it seemed like String#force_encoding
would do what I need.

But when I go to irb, I get this:

irb(main):060:0> "помоник".encoding
=> #<Encoding:UTF-8>
irb(main):061:0> "помоник".bytes.to_a
=> [208, 191, 208, 190, 208, 188, 208, 190, 208, 189, 208, 184, 208,
186]
irb(main):062:0> "помоник".force_encoding("ISO-8859-1")
=> "помоник"
irb(main):063:0> "помоник".force_encoding("ISO-8859-1").encoding
=> #<Encoding:ISO-8859-1>
irb(main):064:0> "помоник".force_encoding("ISO-8859-1").bytes.to_a
=> [208, 191, 208, 190, 208, 188, 208, 190, 208, 189, 208, 184, 208,
186]

So apparently, it changes the encoding, leaves the bytes unchanged,
but also leaves the decoded characters unchanged? Is this a bug or
what?

Note also:

irb(main):066:0> "помоник".encode('BINARY')
Encoding::UndefinedConversionError: "\xD0\xBF" from UTF-8 to
ASCII-8BIT
  from (irb):66:in `encode'
  from (irb):66
  from /usr/local/bin/irb:12:in `<main>'

So apparently in Ruby 1.9, binary isn't really binary?

I banged my head for a while, and then tried it in python3.

Completely easy:
>>> 'помоник'

'помоник'

>>> 'помоник'.encode('utf_8')

b'\xd0\xbf\xd0\xbe\xd0\xbc\xd0\xbe\xd0\xbd\xd0\xb8\xd0\xba'

>>> 'помоник'.encode('utf_8').decode('latin_1')

'помоник'

>>> 'помоник'.encode('utf_8').decode('latin_1')

'помоник'

>>> 'помоник'.encode('utf_8').decode('latin_1').encode('latin_1')

b'\xd0\xbf\xd0\xbe\xd0\xbc\xd0\xbe\xd0\xbd\xd0\xb8\xd0\xba'

So can I do the same thing in Ruby 1.9? How do I deal with binary
data? How to I convert a string to a manageable byte sequence? Is
there a way to turn an array of bytes into a string of a specified
encoding?

--
Iñaki Baz Castillo <ibc@aliax.net>

Joe wrote:

I have this UTF-8 string that I want to turn into binary, and then
from binary into ISO-8859-1.

UTF-8 is a binary encoding of Unicode codepoints, so it's a sequence of
binary bytes by definition. And you get the same as your Python code:

irb(main):001:0> 'помоник'
=> "помоник"
irb(main):002:0> 'помоник'.bytes.each { |x| print "%02x " % x }
d0 bf d0 be d0 bc d0 be d0 bd d0 b8 d0 ba => "помоник"
irb(main):004:0> 'помоник'.force_encoding("BINARY")
=> "\xD0\xBF\xD0\xBE\xD0\xBC\xD0\xBE\xD0\xBD\xD0\xB8\xD0\xBA"

I think what's confusing you is this:

irb(main):005:0> str = 'помоник'
=> "помоник"
irb(main):006:0> str.force_encoding("ISO-8859-1")
=> "помоник"

Here, Ruby is doing something strange. The string is tagged as a
sequence of ISO-8859-1 characters, but this sequence of bytes is being
squirted as-is to a UTF-8 terminal, and so the UTF-8 terminal is
displaying them as the original characters.

You can get the behaviour you want like this, by transcoding to UTF-8:

irb(main):009:0> str.encode("UTF-8")
=> "помоник"

Given that irb is running in a UTF-8 environment, it is arguable that
STDOUT should have an external encoding of UTF-8, which means text
should be transcoded to UTF-8 automatically.

That is, you can also get the behaviour you want from this standalone
program:

# encoding: UTF-8
STDOUT.set_encoding "UTF-8" # << THE MAGIC BIT

str = 'помоник'
str.force_encoding("ISO-8859-1")
puts str

It seems inconsistent to me that STDOUT doesn't get its
external_encoding set automatically.

So apparently in Ruby 1.9, binary isn't really binary?

Correct. In Ruby 1.9, binary is ASCII. I hate this.

I have documented a lot of the gory details at

Thanks for bringing another anomoly to my attention.

···

--
Posted via http://www.ruby-forum.com/\.

Joe schrieb:

I'm using Ruby 1.9.1-p243 on Mac OS X 10.5.8.

I have this UTF-8 string that I want to turn into binary, and then
from binary into ISO-8859-1.

What means to "turn a string from binary to ISO-8859-1"?

'помоник'.encode('utf_8').decode('latin_1')

'помоник'

What Python does here is it encodes the string (from its internal unicode format) to an utf-8 binary-string and then converts it again into its internal unicode-format (interpreting it as latin-1 string). Finally it puts it out to the console which means that it converts it again (to probably utf-8) for the Mac Os Terminal. This is an important point you should keep in mind.

So this is quiet similar to what ruby does except that ruby makes no conversion to an general internal format and no special conversion for the terminal.

If you would put the results out to a file you would have the same result.

Regards, R.

OK, so String#force_encoding just changes the encoding, but does not
alter the string. But how does it decide to print as the same
sequence of Cyrillic characters, when it thinks its encoding is
ISO-8859-1? How does ruby1.9 decide what characters to display when
printing a String? Surely it must adhere to the encoding of that
String? Is ruby storing the ISO-8859-1 encoded string as a sequence
of unicode characters, or what?

This seems crazy to me.

OK, so maybe String#force_encoding is crazy and broken or just won't
be able to do what I want. Your suggestion was that String#encode is
the method for changing the string. Of course I tried that one, and
it errors because there is no Cyrillic alphabet in ISO-8859-1.

Is there really no way to go from bytes to string? That's all I want!

···

On Sep 1, 4:18 pm, Iñaki Baz Castillo <i...@aliax.net> wrote:

El Miércoles, 2 de Septiembre de 2009, Joe escribió:

In the other way, #encode does change the encoding, and it fails if the
conversion is not possible.

--
Iñaki Baz Castillo <i...@aliax.net>

I think I understand it now. The following was confusing me initially:

    >> str = "über"
    => "über"
    >> str.force_encoding("ISO-8859-1")
    => "über"
    >> str = "groß"
    => "groß"
    >> str.force_encoding("ISO-8859-1")
    => "gro�\x9F"

It appears this is just an artefact of String#inspect. String#inspect
"knows" that \x80 to \x9F are not printable characters in ISO-8859-1, so
converts them to the backslash hex form. This breaks the UTF-8 display
by splitting the character, but of course only for strings which contain
bytes in that range.

You still get the string displayed as UTF-8 using puts without inspect:

    >> puts str
    groß
    => nil

It works if you set the encoding for STDOUT inside irb, in which case
you'll get everything transcoded to your terminal's character set.

STDOUT.set_encoding "locale"

=> #<IO:<STDOUT>>

str = "über"

=> "über"

str.force_encoding("ISO-8859-1")

=> "über"

puts str

über
=> nil

···

--
Posted via http://www.ruby-forum.com/\.

OK, I found the Array#pack method. At first glance, it seemed to be
exactly what I was looking for. I could do str.bytes.to_a to turn a
String into raw bytes, and Array#pack will turn them right back into a
String.

But go to

http://ruby-doc.org/core-1.9/classes/Array.html

The method is missing from the 1.9 documentation. Has it been
deprecated? The 1.8 documentation doesn't help much, because it seems
the function is entirely unaware of the String encoding.

I guess Ruby's m17n is brand spanking new, and it shows, huh? I'm
finding it pretty frustrating. :frowning:

···

On Sep 1, 4:53 pm, Joe <ziggur...@gmail.com> wrote:

Is there really no way to go from bytes to string? That's all I want!

Brian Candler did a pretty thorough documentation of 1.9's M17N at GitHub - candlerb/string19: Runnable documentation of ruby 1.9's M17N properties . There are also multiple sources of documentation on the subject at Gray Soft / Not Found (Edward Gray) and elsewhere.

I'm also more comfortable with how 1.8 behaves but then again I'm a newbie here.

Patrick

···

On Sep 2, 2009, at 2:55 AM, Joe wrote:

On Sep 1, 4:18 pm, Iñaki Baz Castillo <i...@aliax.net> wrote:

El Miércoles, 2 de Septiembre de 2009, Joe escribió:

In the other way, #encode does change the encoding, and it fails if the
conversion is not possible.

--
Iñaki Baz Castillo <i...@aliax.net>

OK, so String#force_encoding just changes the encoding, but does not
alter the string. But how does it decide to print as the same
sequence of Cyrillic characters, when it thinks its encoding is
ISO-8859-1? How does ruby1.9 decide what characters to display when
printing a String? Surely it must adhere to the encoding of that
String? Is ruby storing the ISO-8859-1 encoded string as a sequence
of unicode characters, or what?

I'm finding it pretty frustrating. :frowning:

It is, especially as Ruby 1.8 behaviour is less annoying IMHO in this
regard.

···

--
Posted via http://www.ruby-forum.com/\.

I don't believe so. I don't know why it's not in the docs there, but
it's in my local ri:

Slim2:~ phrogz$ ri -T Array#pack

···

On Sep 1, 6:38 pm, Joe <ziggur...@gmail.com> wrote:

OK, I found the Array#pack method. At first glance, it seemed to be
exactly what I was looking for. I could do str.bytes.to_a to turn a
String into raw bytes, and Array#pack will turn them right back into a
String.

But go to

class Array - RDoc Documentation

The method is missing from the 1.9 documentation. Has it been
deprecated?

-------------------------------------------------------------
Array#pack
     arr.pack ( aTemplateString ) -> aBinaryString

     From Ruby 1.9.1
------------------------------------------------------------------------
     Packs the contents of _arr_ into a binary sequence according to
the
     directives in _aTemplateString_ (see the table below) Directives
     ``A,'' ``a,'' and ``Z'' may be followed by a count, which gives
the
     width of the resulting field. The remaining directives also may
     take a count, indicating the number of array elements to convert.
     If the count is an asterisk (``+*+''), all remaining array
elements
     will be converted. Any of the directives ``+sSiIlL+'' may be
     followed by an underscore (``+_+'') to use the underlying
     platform's native size for the specified type; otherwise, they
use
     a platform-independent size. Spaces are ignored in the template
     string. See also +String#unpack+.

        a = [ "a", "b", "c" ]
        n = [ 65, 66, 67 ]
        a.pack("A3A3A3") #=> "a b c "
        a.pack("a3a3a3") #=> "a\000\000b\000\000c\000\000"
        n.pack("ccc") #=> "ABC"

     Directives for +pack+.

      Directive Meaning
      ---------------------------------------------------------------
          @ | Moves to absolute position
          A | arbitrary binary string (space padded, count is
width)
          a | arbitrary binary string (null padded, count is
width)
          B | Bit string (descending bit order)
          b | Bit string (ascending bit order)
          C | Unsigned byte (C unsigned char)
          c | Byte (C char)
          D, d | Double-precision float, native format
          E | Double-precision float, little-endian byte order
          e | Single-precision float, little-endian byte order
          F, f | Single-precision float, native format
          G | Double-precision float, network (big-endian) byte
order
          g | Single-precision float, network (big-endian) byte
order
          H | Hex string (high nibble first)
          h | Hex string (low nibble first)
          I | Unsigned integer
          i | Integer
          L | Unsigned long
          l | Long
          M | Quoted printable, MIME encoding (see RFC2045)
          m | Base64 encoded string (see RFC 2045, count is
width)
                > (if count is 0, no line feed are added, see RFC
4648)
          N | Long, network (big-endian) byte order
          n | Short, network (big-endian) byte-order
          P | Pointer to a structure (fixed-length string)
          p | Pointer to a null-terminated string
          Q, q | 64-bit number
          S | Unsigned short
          s | Short
          U | UTF-8
          u | UU-encoded string
          V | Long, little-endian byte order
          v | Short, little-endian byte order
          w | BER-compressed integer\fnm
          X | Back up a byte
          x | Null byte
          Z | Same as ``a'', except that null is added with *