Converting to UCS-2 or UTF-16 for use by a C extension

Wincent_Colaiuta · 7 June 2007 20:05

I'm working on a C extension that embeds an ANTLR parser, and I need
to convert a Ruby input string into UCS-2 or possibly UTF-16 encoding.

I've got a working implementation but I suspect that it is flawed and
just wanted to ask if this is the right way to do it. The basic idea
is as follows (in pseudo-code):

// 1. unpack to array of UTF8 characters
utf8 = input.unpack("C*");

// 2. repack
packed = utf8.pack("U*");

// 3. convert using Iconv
ucs2 = Iconv.iconv("UCS-2", "UTF-8", packed).first

// 4. freeze
ucs2.freeze

// 5. get pointer, and length (in 16 bit words)
pointer = StringValuePtr(ucs2); // this bit in C
count = ucs.length / 2;

// 6. hand off to the parser...

My doubts are basically as follows:

- I'm doing the unpack/repack because I am not sure that my string is
encoded internally as UTF-8... it *seems* to be, because if I type a
string like "€" in irb then I can see that it's composed of three
bytes in UTF-8 ("\342\202\254")

- Is it in UTF-8 only because my system's locale is set that way?
might it be different on other people's machines? (and if so, how
would I find out what the encoding is?)

- In the case that the encoding is *not* UTF-8, does my "round-trip"
unpack/pack trick actually get it into UTF-8? (I don't think it will!
In which case the rount-trip is a waste of time)

- And once I've got the String in UCS-2, does StringValuePtr give me
access to the raw UCS-2 encoded data like I think it does? (seems to)

- Does calling length on the UCS-2 encoded string always give the
result in bytes? (I am almost certain that it does)

- Is there some more elegant way to get an arbitrary Ruby string into
UCS-2 so that it can be handed off the C parser?

Cheers,
Wincent

Topic		Replies	Views
Ucs-2 ruby-talk	4	123	1 August 2006
Utf8 -> latin2 ruby-talk	1	80	14 November 2003
Ruby 1.8.* convert string to utf-8 ruby-talk	7	218	20 August 2008
Ruby 'C' Extensions and Unicode ruby-talk	10	132	22 March 2010
UTF-8 in Ruby ruby-talk	3	104	1 May 2008

Converting to UCS-2 or UTF-16 for use by a C extension

Related topics