Converting between ASCII-8BIT and UTF-8

I have an issue where I have a string of arbitrary data in Ruby, which
is encoded using ASCII-8BIT (the output from File.binread) that needs to
be sent to another system running Python. On the receiving end it
expects the data to be UTF-8 encoded [1].

On the Ruby side I've tried such things as:

asciidata = File.binread('/path/to/file')

data = asciidata.encode('UTF-8')

also:

data = asciidata.dup.encode('UTF-8')

While the resulting data reference claims it's valid UTF-8 encoded data,
the receiver says it's ot and chokes with:

'utf8' codec can't decode byte 0x9b in position 10: invalid start byte

I'm a bit lost on this topic. How do I get Ruby to verify that the data
is properly encoded as UTF-8 and, if not, transcode it before sending?

[1] Actually, our wire protocol requires such data to be UTF-8 encoded

···

--
Darryl L. Pierce <mcpierce@gmail.com>
http://mcpierce.blogspot.com/
Famous last words:
   "I wonder what happens if we do it this way?"

Hi Darryl,

"Darryl L. Pierce" <mcpierce@gmail.com> writes:

I have an issue where I have a string of arbitrary data in Ruby, which
is encoded using ASCII-8BIT (the output from File.binread) that needs to
be sent to another system running Python. On the receiving end it
expects the data to be UTF-8 encoded [1].

ASCII-8BIT is not really an “encoding”. It means that you have binary
data, i.e. raw bytes, not something text-based. Hence, transcoding from
ASCII-8BIT, whose alias is BINARY btw., is not a meaningful operation.

If you can guarantee you have only 7-Bit characters in your ASCII-8Bit
string, i.e. real 7bit ASCII, then you can be happy as any 7bit ASCII
string is automatically a valid UTF-8 string due to the first characters
in the UTF-8 encoding are equal to 7bit ASCII. In that case no
transcoding is necessary.

The way you have phrased your question it is to be assumed that you
receive arbitrary binary data, e.g. you are reading in image files like
JPEG. So, you are basically ascing, “how do I transcode a JPEG image to
UTF-8?”. It should be obvious that these are two entirely different
concepts.

If you aren’t reading in JPEG files with your Ruby program, but rather
textual data, then it will be encoded in some encoding, maybe Windows-1252,
which you can then tell Ruby by using a construct such as this one:

  data = File.open("yourfile", "r:Windows-1252"){|f| f.read}

Ruby itself cannot know the encoding of textual input unless you tell it
the encoding as shown above. Once you have obtained a correctly tagged
string, you can then transcode that to UTF-8:

  utf8 = data.encode("UTF-8")

As a helpful piece of extra information, you can use
String#valid_encoding? to test if the string you have is entirely valid
in the encoding it has been tagged with.

Does this clear up things for you?

Vale,
Quintus

···

--
Blog: http://www.quintilianus.eu

I will reject HTML emails. | Ich akzeptiere keine HTML-Nachrichten.
                               >
Use GnuPG for mail encryption: | GnuPG für Mail-Verschlüsselung:
http://www.gnupg.org | The GNU Privacy Guard

No. It means you CAN have binary data. It doesn't mean you do. It is a meaningful operation if the data being re-encoded is meaningful to the new encoding.

···

On Nov 4, 2014, at 12:12, Quintus <quintus@quintilianus.eu> wrote:

ASCII-8BIT is not really an “encoding”. It means that you have binary
data, i.e. raw bytes, not something text-based. Hence, transcoding from
ASCII-8BIT, whose alias is BINARY btw., is not a meaningful operation.

Hi Darryl,

"Darryl L. Pierce" <mcpierce@gmail.com> writes:

> I have an issue where I have a string of arbitrary data in Ruby, which
> is encoded using ASCII-8BIT (the output from File.binread) that needs to
> be sent to another system running Python. On the receiving end it
> expects the data to be UTF-8 encoded [1].

ASCII-8BIT is not really an “encoding”. It means that you have binary
data, i.e. raw bytes, not something text-based. Hence, transcoding from
ASCII-8BIT, whose alias is BINARY btw., is not a meaningful operation.

If you can guarantee you have only 7-Bit characters in your ASCII-8Bit
string, i.e. real 7bit ASCII, then you can be happy as any 7bit ASCII
string is automatically a valid UTF-8 string due to the first characters
in the UTF-8 encoding are equal to 7bit ASCII. In that case no
transcoding is necessary.

The way you have phrased your question it is to be assumed that you
receive arbitrary binary data, e.g. you are reading in image files like
JPEG. So, you are basically ascing, “how do I transcode a JPEG image to
UTF-8?”. It should be obvious that these are two entirely different
concepts.

Yeah, sorry. Please don't let my example confuse things. I was simply
using loading a binary file from disk as a way of getting a bunch of
random data. A bad example on my part.

What I have is this: my project has the concept of a message and a
messenger that can send/receive messages. The wire protocol requires
that any string of data be UTF-8 encoded.

In the Message class we have body= and body for setting and getting the
body of the message. The user has the ability to set the format of the
message body and then set the content, and one such format is named
"DATA" for arbitrary data which must be UTF-8 encoded.

If you aren’t reading in JPEG files with your Ruby program, but rather
textual data, then it will be encoded in some encoding, maybe Windows-1252,
which you can then tell Ruby by using a construct such as this one:

  data = File.open("yourfile", "r:Windows-1252"){|f| f.read}

Ruby itself cannot know the encoding of textual input unless you tell it
the encoding as shown above. Once you have obtained a correctly tagged
string, you can then transcode that to UTF-8:

  utf8 = data.encode("UTF-8")

As a helpful piece of extra information, you can use
String#valid_encoding? to test if the string you have is entirely valid
in the encoding it has been tagged with.

Does this clear up things for you?

It makes my head ache, for certain. :smiley:

Given the above parameters for our project, what I'm thinking is that I
could either do is 1) check the value passed into body= and, if it's not
UTF-8 encoded, raise an exception that the caller needs to pass in
proper encoding, or 2) find a way to transcode data to UTF-8.

My preference would be for (1) and to put the burden on the caller to
ensure their data is fit for our wire protocol. I don't want to have to
do any transcoding unless it's baked into the language.

···

On Tue, Nov 04, 2014 at 09:12:18PM +0100, Quintus wrote:

--
Darryl L. Pierce <mcpierce@gmail.com>

Famous last words:
   "I wonder what happens if we do it this way?"

Quoting Darryl L. Pierce (mcpierce@gmail.com):

What I have is this: my project has the concept of a message and a
messenger that can send/receive messages. The wire protocol requires
that any string of data be UTF-8 encoded.

Strings are just sequences of bits, like other data. But while with
binary , ASCII and older per-country encodings you take the bits one
byte at a time, in UTF-8 you may have to take them one, two or three
bytes at a time. And there are illegal sequences.

In Ruby, if I recall correctly since 2.0, a string is always
associated with an encoding, UTF-8 as default.

s='ábc'
p s.encoding => #<Encoding:UTF-8>

There are two main operations you can do with yiur string re:
encoding: either you can transcode:

s.encode!('ISO-8859-1')

In this case, the first character (lower case 'a' with acute accent,
represented as two bytes \xC3\xA1 in UTF-8) is changed to its
equivalent in ISO8859.1, the single character \xE1.

The second operation allows you to keep the exact sequence of bytes,
but tells the system that it has to interpret the string with another
encoding:

s.force_encoding('ISO-8859-1')

In this case, the new string, interpreted as ISO8859.1, will be

ábc

because in ISO8859.1 \xC3 is A with tilde, and \xA1 is the inverted
exclamation mark.

Every time, you have the possibility to inspect the exact bytes that
make up a string:

b=s.bytes
p b => [195, 161, 98, 99]

In your case, since you receive stuff, it should be the other part's
responsibility to make sure the strings are proper UTF-8. What you
should not do is mangle it. If I were you, in order not to be mistaken
I'd get the string as array of bytes, and write a method that's
something like this:

def massage_input(array)
  s=array.pack('c*').force_encoding('UTF-8')
  unless(s.valid_encoding?())
    [COMPLAIN IN SOME WAY]
  end
  s
end

and then make sure I do not modify the string I receive anymore.

Carlo

···

Subject: Re: Converting between ASCII-8BIT and UTF-8
  Date: Tue 04 Nov 14 05:22:25PM -0500

--
  * Se la Strada e la sua Virtu' non fossero state messe da parte,
* K * Carlo E. Prelz - fluido@fluido.as che bisogno ci sarebbe
  * di parlare tanto di amore e di rettitudine? (Chuang-Tzu)

Okay, so this discussion and a little reading and I think I understand
the issue a bit better.

So that said I now have an extension on the issue. To solve our problem
we peek at the String#encoding value for the string:

1) if it says it's UTF-8 we treat it as such, otherwise
2) we try to encode it as UTF-8 with value.force_encoding('UTF-8') and
   check the valid_encoding? result, otherwise
3) we treat it as a binary string.

These work for us. But the problem is how to do this on Ruby 1.8 (we
support 1.8.7 up)?

···

On Tue, Nov 04, 2014 at 09:12:18PM +0100, Quintus wrote:

Hi Darryl,

"Darryl L. Pierce" <mcpierce@gmail.com> writes:

> I have an issue where I have a string of arbitrary data in Ruby, which
> is encoded using ASCII-8BIT (the output from File.binread) that needs to
> be sent to another system running Python. On the receiving end it
> expects the data to be UTF-8 encoded [1].

ASCII-8BIT is not really an “encoding”. It means that you have binary
data, i.e. raw bytes, not something text-based. Hence, transcoding from
ASCII-8BIT, whose alias is BINARY btw., is not a meaningful operation.

If you can guarantee you have only 7-Bit characters in your ASCII-8Bit
string, i.e. real 7bit ASCII, then you can be happy as any 7bit ASCII
string is automatically a valid UTF-8 string due to the first characters
in the UTF-8 encoding are equal to 7bit ASCII. In that case no
transcoding is necessary.

The way you have phrased your question it is to be assumed that you
receive arbitrary binary data, e.g. you are reading in image files like
JPEG. So, you are basically ascing, “how do I transcode a JPEG image to
UTF-8?”. It should be obvious that these are two entirely different
concepts.

If you aren’t reading in JPEG files with your Ruby program, but rather
textual data, then it will be encoded in some encoding, maybe Windows-1252,
which you can then tell Ruby by using a construct such as this one:

  data = File.open("yourfile", "r:Windows-1252"){|f| f.read}

Ruby itself cannot know the encoding of textual input unless you tell it
the encoding as shown above. Once you have obtained a correctly tagged
string, you can then transcode that to UTF-8:

  utf8 = data.encode("UTF-8")

As a helpful piece of extra information, you can use
String#valid_encoding? to test if the string you have is entirely valid
in the encoding it has been tagged with.

Does this clear up things for you?

--
Darryl L. Pierce <mcpierce@gmail.com>

Famous last words:
   "I wonder what happens if we do it this way?"