A Code Point's Tale: There and Back Again

Terry_Michaels · 30 April 2011 04:12

This is probably obvious in the docs and I'm just missing it, but here
goes: So, I see there is str.each_codepoint, which I want to use in a
function to convert Unicode Strings to a list of Unicode code points.
But what can I do if I have a list of Unicode code points and want to
convert them back into a String?

···

--
Posted via http://www.ruby-forum.com/.

xcr_xcr · 30 April 2011 04:34

I hope this is what u r looking for
http://ruby-unicode.rubyforge.org/doc/

···

--
Posted via http://www.ruby-forum.com/.

Markus_Fischer · 30 April 2011 08:59

Hi,

This is probably obvious in the docs and I'm just missing it, but here
goes: So, I see there is str.each_codepoint, which I want to use in a
function to convert Unicode Strings to a list of Unicode code points.
But what can I do if I have a list of Unicode code points and want to
convert them back into a String?

I think you can use Array#pack for that:

$ irb
ruby-1.9.2-p180 :001 > "f뀀oöbß".each_codepoint.to_a
=> [102, 45056, 111, 246, 98, 223]
ruby-1.9.2-p180 :002 > "f뀀oöbß".each_codepoint.to_a.pack("U*")
=> "f뀀oöbß"

cheers

···

On 30.04.2011 06:12, Terry Michaels wrote:

7stud · 1 May 2011 02:24

Terry Michaels wrote in post #995906:

This is probably obvious in the docs and I'm just missing it, but here
goes: So, I see there is str.each_codepoint, which I want to use in a
function to convert Unicode Strings to a list of Unicode code points.
But what can I do if I have a list of Unicode code points and want to
convert them back into a String?

#encoding: UTF-8
#That comment tells ruby to treat string literals in my source code,
like
#the one below, as utf-8 encoded.

str = "\xE2\x82\xAC\xE2\x82\xAC"

codes = str.each_codepoint.to_a

p codes
puts codes.map {|code| code.chr(Encoding::UTF_8) }.join(" ")

--output:--
[8364, 8364]
€ €

(You should see two euro symbols as the last line of output.)

I don't know where you are getting your string, but you can always do
this:

str = "\xE2\x82\xAC\xE2\x82\xAC"
str.force_encoding("UTF-8")

codes = str.each_codepoint.to_a

p codes
puts codes.map {|code| code.chr(Encoding::UTF_8) }.join(" ")

--output:--
[8364, 8364]
€ €

(You should see two euro symbols as the last line of output.)

···

--
Posted via http://www.ruby-forum.com/\.

7stud · 1 May 2011 03:22

Maybe each_char() will work for you? Take a look at the following code.

str = "\xE2\x82\xAC\xE2\x82\xAC"
puts str.encoding

str.force_encoding("UTF-8")
puts str.encoding

chars = str.each_char.to_a
p chars

puts chars[0].encoding

puts chars.join

--output:--
ASCII-8BIT
UTF-8
["\u20AC", "\u20AC"]
UTF-8
€€

(You should see two euro symbols as the last line of output.)

The output implies that a string with unicode escapes is given a UTF-8
encoding by default. And that seems to be the case:

str = "\u20AC\u20AC"
puts str.encoding

--output:--
UTF-8

···

--
Posted via http://www.ruby-forum.com/.

7stud · 1 May 2011 02:41

7stud -- wrote in post #996022:

Terry Michaels wrote in post #995906:

This is probably obvious in the docs and I'm just missing it,

You will never learn ruby unicode by reading the docs. Head over to
James Edward Gray II's website for some lessons:

Gray Soft / Not Found

Someone else blogged in great detail about all the intricacies of ruby
unicode and its problems, but I can't find the link now.

···

--
Posted via http://www.ruby-forum.com/\.

Topic		Replies	Views
Individual char values in a Unicode string ruby-talk	2	147	2 September 2006
String from code points? ruby-talk	3	129	27 May 2009
Convert \uXXXX to character ruby-talk	6	110	28 June 2010
Unicode in Ruby now? ruby-talk	1	90	1 August 2002
Ruby unicode./encoding support ruby-talk	9	80	4 June 2003

A Code Point's Tale: There and Back Again

Related topics