Ascii representation of unicode string?

darren_kirby · 22 June 2006 20:33

Hello all.

I am unpacking some unicode strings from a binary file. I have a string like:

"W\000M\000/\000T\000r\000a\000c\000k\000N\000u\000m\000b\000e\000r\000\000\000\003"

and I need to turn it into:

"WM/TrackNumber"

When I 'puts' the string it prints fine but I need to assign it to a variable,
and when I try something like:

require 'jcode'
$KCODE = 'UTF8'
s.each_char { |ch| print ch }

it will print each char but return the original unicode string. And when I
try:

n = ""
s.each_char { |ch| n += ch }

The entire unicode char is being added to n.

So can I extract an ascii representation of this string? I will admit I don't
know the first thing about unicode and I may be totally lost here...

Thanks for consideration.

-d

···

--
darren kirby :: Part of the problem since 1976 :: http://badcomputer.org
"...the number of UNIX installations has grown to 10, with more expected..."
- Dennis Ritchie and Ken Thompson, June 1972

Phillip_Hutchings · 22 June 2006 20:57

That's not UTF-8, that's UTF-16 little endian without a BOM. If you
know the string is pure ASCII, just UTF-16 encoded you can just do
s.gsub(/\000/,''), but this will break any non-7bit characters.

···

On 6/23/06, darren kirby <bulliver@badcomputer.org> wrote:

Hello all.

I am unpacking some unicode strings from a binary file. I have a string like:

"W\000M\000/\000T\000r\000a\000c\000k\000N\000u\000m\000b\000e\000r\000\000\000\003"

and I need to turn it into:

"WM/TrackNumber"

When I 'puts' the string it prints fine but I need to assign it to a variable,
and when I try something like:

require 'jcode'
$KCODE = 'UTF8'
s.each_char { |ch| print ch }

it will print each char but return the original unicode string. And when I
try:

n = ""
s.each_char { |ch| n += ch }

The entire unicode char is being added to n.

So can I extract an ascii representation of this string? I will admit I don't
know the first thing about unicode and I may be totally lost here...

Thanks for consideration.

--
Phillip Hutchings
http://www.sitharus.com/

Chris16 · 22 June 2006 21:12

darren kirby wrote:

Hello all.

I am unpacking some unicode strings from a binary file. I have a string
like:

"W\000M\000/\000T\000r\000a\000c\000k\000N\000u\000m\000b\000e\000r\000\000\000\003"

and I need to turn it into:

"WM/TrackNumber"

could try:

myAscii = s.unpack('U'*s.length).select{|x| x >46}.collect{|x|
x.chr}.to_s

···

--
Posted via http://www.ruby-forum.com/\.

Paul_Battley · 23 June 2006 11:31

I am unpacking some unicode strings from a binary file. I have a string like:

"W\000M\000/\000T\000r\000a\000c\000k\000N\000u\000m\000b\000e\000r\000\000\000\003"

and I need to turn it into:

"WM/TrackNumber"

...

So can I extract an ascii representation of this string? I will admit I don't
know the first thing about unicode and I may be totally lost here...

Here's a reliable way to do it with Iconv:

require 'iconv'
s = "W\000M\000/\000T\000r\000a\000c\000k\000N\000u\000m\000b\000e\000r\000\000\000\003"
ic = Iconv.new("US-ASCII//IGNORE", "UTF-16LE")
p (ic.iconv(s+' '))[0..-2] # => "WM/TrackNumber"

Paul.

···

On 22/06/06, darren kirby <bulliver@badcomputer.org> wrote:

Julian_Julik_Tarkhan · 23 June 2006 15:35

What you have looks like UTF-16. you are best just pushing it through iConv and convert it to UTF8

···

On 22-jun-2006, at 22:33, darren kirby wrote:

Hello all.

I am unpacking some unicode strings from a binary file. I have a string like:

"W\000M\000/\000T\000r\000a\000c\000k\000N\000u\000m\000b\000e\000r\000\000\000\003"

--
Julian 'Julik' Tarkhanov
please send all personal mail to
me at julik.nl

darren_kirby · 22 June 2006 21:32

quoth the Chris Hulan:

darren kirby wrote:
> Hello all.
>
> I am unpacking some unicode strings from a binary file. I have a string
> like:
>
> "W\000M\000/\000T\000r\000a\000c\000k\000N\000u\000m\000b\000e\000r\000\0
>00\000\003"
>
> and I need to turn it into:
>
> "WM/TrackNumber"

could try:

myAscii = s.unpack('U'*s.length).select{|x| x >46}.collect{|x|
x.chr}.to_s

Hello,

This is fairly close to something I was trying earlier, except more elegant.
This seems to be munging my spaces though, as I take it the space code is
less than 46? Have to look for a chart.

Is unpacking as UTF8 going to be problematic if the string is actually UTF16
as Philip points out?

I will play some more, thanks guys...
-d

···

--
darren kirby :: Part of the problem since 1976 :: http://badcomputer.org
"...the number of UNIX installations has grown to 10, with more expected..."
- Dennis Ritchie and Ken Thompson, June 1972

Nobuyoshi_Nakada1 · 23 June 2006 01:21

Hi,

At Fri, 23 Jun 2006 06:12:21 +0900,
Chris Hulan wrote in [ruby-talk:198599]:

darren kirby wrote:
> Hello all.
>
> I am unpacking some unicode strings from a binary file. I have a string
> like:
>
> "W\000M\000/\000T\000r\000a\000c\000k\000N\000u\000m\000b\000e\000r\000\000\000\003"
>
> and I need to turn it into:
>
> "WM/TrackNumber"
>

could try:

myAscii = s.unpack('U'*s.length).select{|x| x >46}.collect{|x|
x.chr}.to_s

Shorter:

s.unpack("v*").pack("U*") #=> "WM/TrackNumber\000"

Or to keep trailing odd byte,

(s+"\0").unpack("v*").pack("U*") #=> "WM/TrackNumber\000\003"

Note this isn't aware of surrogate pairs.

···

--
Nobu Nakada

darren_kirby · 23 June 2006 14:23

quoth the Paul Battley:

> So can I extract an ascii representation of this string? I will admit I
> don't know the first thing about unicode and I may be totally lost
> here...

Here's a reliable way to do it with Iconv:

require 'iconv'
s =
"W\000M\000/\000T\000r\000a\000c\000k\000N\000u\000m\000b\000e\000r\000\000
\000\003" ic = Iconv.new("US-ASCII//IGNORE", "UTF-16LE")
p (ic.iconv(s+' '))[0..-2] # => "WM/TrackNumber"

Hi Paul,

This seems to be working quite nicely, after playing around for a bit. A few
of my test files were throwing "Iconv::InvalidCharacter" errors on some
strings, but when I change the "(s+' ')" to "(s)" it works fine. Then, of
course, the strings that originally worked start throwing the error. So, I do
this:

begin
textString = @ic.iconv(data+' ')[0..-2]
rescue
textString = @ic.iconv(data)[0..-2]
end

Yesir, I am really just mashing code together until I see the results I am
looking for...

I wonder though, the docs lead me to believe the iconv library is UNIX only.
Is this true? I really need a cross-platform solution, but don't have a win32
box to try on...

Thanks very much,

Paul.

-d

···

--
darren kirby :: Part of the problem since 1976 :: http://badcomputer.org
"...the number of UNIX installations has grown to 10, with more expected..."
- Dennis Ritchie and Ken Thompson, June 1972

Phillip_Hutchings · 22 June 2006 21:36

>
> could try:
>
> myAscii = s.unpack('U'*s.length).select{|x| x >46}.collect{|x|
> x.chr}.to_s

Hello,

This is fairly close to something I was trying earlier, except more elegant.
This seems to be munging my spaces though, as I take it the space code is
less than 46? Have to look for a chart.

Is unpacking as UTF8 going to be problematic if the string is actually UTF16
as Philip points out?

I will play some more, thanks guys...
-d

Spaces are 32 if I recall my ASCII. UTF-16 with no non-ASCII
characters is essentially an ASCII string with NULLs every other byte,
it's quite obvious. UTF-8 with only ASCII just looks like ASCII.

···

--
Phillip Hutchings
http://www.sitharus.com/

Paul_Battley · 23 June 2006 22:14

This seems to be working quite nicely, after playing around for a bit. A few
of my test files were throwing "Iconv::InvalidCharacter" errors on some
strings, but when I change the "(s+' ')" to "(s)" it works fine. Then, of
course, the strings that originally worked start throwing the error. So, I do
this:

Sorry, I translated that code from somewhere else, but forgot that
UTF-16 needs an even number of bytes. The fact that it worked as
advertised was serendipity rather than good judgement! The trouble
with Iconv's //IGNORE flag is that it doesn't ignore trailing errors;
you can get around this by adding a valid codepoint at the end, and
removing it after conversion. Adding a valid byte <128 gets around
this for UTF-8 input, but only worked for your example as it had an
odd number of input bytes. For UTF-16 (LE or BE) without surrogates,
this will work:

t = ic.iconv(s[0,s.length/2*2])

although a more general solution that should also handle surrogates is this:

t = ic.iconv(s[0,s.length/2*2]+"\000\000")[0..-2]

Finally, your input string has a trailing null; a regexp-based
solution is probably the most reliable way to remove this:

t.sub!(/\x00$/, '')

I wonder though, the docs lead me to believe the iconv library is UNIX only.
Is this true? I really need a cross-platform solution, but don't have a win32
box to try on...

It's definitely possible to use iconv on Windows, but it wasn't in the
one-click installer until 1.8.4, I believe.

Paul.

···

On 23/06/06, darren kirby <bulliver@badcomputer.org> wrote:

darren_kirby · 22 June 2006 21:42

quoth the Phillip Hutchings:
<snip>

Spaces are 32 if I recall my ASCII. UTF-16 with no non-ASCII
characters is essentially an ASCII string with NULLs every other byte,
it's quite obvious. UTF-8 with only ASCII just looks like ASCII.

Thanks Philip,

I changed the 46 to 32 in Chris' code and it seems to be working fine for my
test files here. Will have to do more testing to see if it will be a suitable
permanent solution...

Thanks again guys,
-d

···

--
darren kirby :: Part of the problem since 1976 :: http://badcomputer.org
"...the number of UNIX installations has grown to 10, with more expected..."
- Dennis Ritchie and Ken Thompson, June 1972

darren_kirby · 24 June 2006 20:36

quoth the Paul Battley:

Sorry, I translated that code from somewhere else, but forgot that
UTF-16 needs an even number of bytes. The fact that it worked as
advertised was serendipity rather than good judgement! The trouble
with Iconv's //IGNORE flag is that it doesn't ignore trailing errors;
you can get around this by adding a valid codepoint at the end, and
removing it after conversion. Adding a valid byte <128 gets around
this for UTF-8 input, but only worked for your example as it had an
odd number of input bytes. For UTF-16 (LE or BE) without surrogates,
this will work:

t = ic.iconv(s[0,s.length/2*2])

although a more general solution that should also handle surrogates is
this:

t = ic.iconv(s[0,s.length/2*2]+"\000\000")[0..-2]

This ^^^ is working perfect for all my test files now...thank you.

Finally, your input string has a trailing null; a regexp-based
solution is probably the most reliable way to remove this:

t.sub!(/\x00$/, '')

> I wonder though, the docs lead me to believe the iconv library is UNIX
> only. Is this true? I really need a cross-platform solution, but don't
> have a win32 box to try on...

It's definitely possible to use iconv on Windows, but it wasn't in the
one-click installer until 1.8.4, I believe.

Ok, good. I can live with that.

Thank you very much for the help,

Paul.

-d

···

--
darren kirby :: Part of the problem since 1976 :: http://badcomputer.org
"...the number of UNIX installations has grown to 10, with more expected..."
- Dennis Ritchie and Ken Thompson, June 1972

Topic		Replies	Views
Decode/encode Unicode ruby-talk	4	97	28 August 2008
Ruby/Unicode library ruby-talk	3	69	18 June 2006
Unicode string conversion ruby-talk	3	74	8 May 2007
Unicode escaping fun & games ruby-talk	0	99	23 April 2009
YAML + ASCII Encoded Unicode ruby-talk	1	96	10 February 2009

Ascii representation of unicode string?

Related topics