I'm using Ruby 1.9.1-p243 on Mac OS X 10.5.8.
I have this UTF-8 string that I want to turn into binary, and then
from binary into ISO-8859-1. The result should be some garbage
string, which I need for debugging purposes. For the sake of an
example, my UTF-8 string is "помоник" (Russian for "helper"). After
looking at the documentation, it seemed like String#force_encoding
would do what I need.
But when I go to irb, I get this:
irb(main):060:0> "помоник".encoding
=> #<Encoding:UTF-8>
irb(main):061:0> "помоник".bytes.to_a
=> [208, 191, 208, 190, 208, 188, 208, 190, 208, 189, 208, 184, 208,
186]
irb(main):062:0> "помоник".force_encoding("ISO-8859-1")
=> "помоник"
irb(main):063:0> "помоник".force_encoding("ISO-8859-1").encoding
=> #<Encoding:ISO-8859-1>
irb(main):064:0> "помоник".force_encoding("ISO-8859-1").bytes.to_a
=> [208, 191, 208, 190, 208, 188, 208, 190, 208, 189, 208, 184, 208,
186]
So apparently, it changes the encoding, leaves the bytes unchanged,
but also leaves the decoded characters unchanged? Is this a bug or
what?
Note also:
irb(main):066:0> "помоник".encode('BINARY')
Encoding::UndefinedConversionError: "\xD0\xBF" from UTF-8 to
ASCII-8BIT
from (irb):66:in `encode'
from (irb):66
from /usr/local/bin/irb:12:in `<main>'
So apparently in Ruby 1.9, binary isn't really binary?
I banged my head for a while, and then tried it in python3.
Completely easy:
'помоник'
'помоник'
'помоник'.encode('utf_8')
b'\xd0\xbf\xd0\xbe\xd0\xbc\xd0\xbe\xd0\xbd\xd0\xb8\xd0\xba'
'помоник'.encode('utf_8').decode('latin_1')
'помоник'
'помоник'.encode('utf_8').decode('latin_1')
'помоник'
'помоник'.encode('utf_8').decode('latin_1').encode('latin_1')
b'\xd0\xbf\xd0\xbe\xd0\xbc\xd0\xbe\xd0\xbd\xd0\xb8\xd0\xba'
So can I do the same thing in Ruby 1.9? How do I deal with binary
data? How to I convert a string to a manageable byte sequence? Is
there a way to turn an array of bytes into a string of a specified
encoding?