Converting UTF-8 to entities like 剛

Jian_Lin1 · 9 May 2009 12:04

I was trying to convert UTF-8 content into a series of entities like
剛 so that whatever the page encoding is, the characters would
show...

so I used something like this:
<%
begin
t = ''
s = Iconv.conv("UTF-32", "UTF-8", some_utf8_string)

  s.scan(/(.)(.)(.)(.)/) do |b1, b2, b3, b4|
    t += ("&#x" + "%02X" % b3.ord) + ("%02X" % b4.ord) + ";"
  end
rescue => details
  t = "exception " + details
end
%>

<%= t %>

but some characters get converted, and some don't. Is it true that
(.)(.)(.)(.) will not necessarily match 4 bytes at a time?

At first, I was going to use

s = Iconv.conv("UTF-16", "UTF-8", some_utf8_string)

but then i found that utf-16 is also variable length... so I used UTF-32
instead which is fixed length. The UTF-8 string I have is just the
Basic Plane... so should be all in the 0x0000 to 0xFFFF range in
unicode.

···

--
Posted via http://www.ruby-forum.com/.

Forum · 9 May 2009 12:13

sorry for a quite superficial answer, but can you use the unicode
switch for regexen in your Ruby Version. This seems to be the problem.

Robert

···

On Sat, May 9, 2009 at 2:04 PM, Jian Lin <winterheat@gmail.com> wrote:

--
Si tu veux construire un bateau ...
Ne rassemble pas des hommes pour aller chercher du bois, préparer des
outils, répartir les tâches, alléger le travail… mais enseigne aux
gens la nostalgie de l’infini de la mer.

If you want to build a ship, don’t herd people together to collect
wood and don’t assign them tasks and work, but rather teach them to
long for the endless immensity of the sea.

--
Antoine de Saint-Exupéry

Wisccal_Wisccal · 13 May 2009 06:55

Jian Lin wrote:

I was trying to convert UTF-8 content into a series of entities like
剛 so that whatever the page encoding is, the characters would
show...

If you are one 1.9, you could use String.codepoints. Something similar
to:

'威斯加的中文很不好'.codepoints.to_a.map {|e| "&#x#{e.to_s(16)};"}

=> ["威", "斯", "加", "的", "中",
"文", "很", "不", "好"]

HTH
威斯加

···

--
Posted via http://www.ruby-forum.com/\.

Jian_Lin1 · 9 May 2009 12:28

Robert Dober wrote:

···

On Sat, May 9, 2009 at 2:04 PM, Jian Lin <winterheat@gmail.com> wrote:
sorry for a quite superficial answer, but can you use the unicode
switch for regexen in your Ruby Version. This seems to be the problem.

Robert

it really might be the 0 that is choking the regular expression match...
if i use

s.scan(/(.)(.)(.)(.)/s)

then it works better but still not all characters are converted...

but the way i have a solution using the byte processing ... in next post

--
Posted via http://www.ruby-forum.com/\.

Jian_Lin1 · 9 May 2009 12:40

Robert Dober wrote:

···

On Sat, May 9, 2009 at 2:04 PM, Jian Lin <winterheat@gmail.com> wrote:
sorry for a quite superficial answer, but can you use the unicode
switch for regexen in your Ruby Version. This seems to be the problem.

Robert

by the way... Robert... what is the regexen? is it the regular
expression modifier? I'd like it to match absolutely anything
(newline, 0, etc)... but seems like there is no match
--
Posted via http://www.ruby-forum.com/\.

Jian_Lin1 · 13 May 2009 09:51

Wisccal Wisccal wrote:

If you are one 1.9, you could use String.codepoints. Something similar
to:

'威斯加的中文很不好'.codepoints.to_a.map {|e| "&#x#{e.to_s(16)};"}

=> ["威", "斯", "加", "的", "中",
"文", "很", "不", "好"]

HTH
威斯加

that's really cool... Wisccal, how do you know Chinese?

···

--
Posted via http://www.ruby-forum.com/\.

Jian_Lin1 · 9 May 2009 12:29

this works:

but i am sure there are more elegant solutions.

<%
begin
t = ''
s = Iconv.conv("UTF-32", "UTF-8", some_utf8_string)

  (s.length / 4).times do |i|
    b3 = s[i*4 + 2]
    b4 = s[i*4 + 3]
    t += ("&#x" + "%02X" % b3) + ("%02X" % b4) + ";"
  end
rescue => details
  t = "exception " + details
end
%>

<%= t %>

···

--
Posted via http://www.ruby-forum.com/.

Rick_DeNatale1 · 9 May 2009 13:30

I'm pretty sure that Robert used regexen as the geeky way of pluralizing regex.

The unicode switch (a u regular expression option) forces the use of
unicode to interpret the string being matched, otherwise it uses
whatever the encoding of the source file containing the regular
expression.

e.g. /./u

If you want . to match newlines you want the m (multi-line) option.
Normally . will match anything BUT a new line, m changes this.

rb(main):001:0> "a\nb".match(/a.b/)
=> nil
irb(main):002:0> "a\nb".match(/a.b/m)
=> #<MatchData:0x6a248>

···

On Sat, May 9, 2009 at 8:40 AM, Jian Lin <winterheat@gmail.com> wrote:

Robert Dober wrote:

On Sat, May 9, 2009 at 2:04 PM, Jian Lin <winterheat@gmail.com> wrote:
sorry for a quite superficial answer, but can you use the unicode
switch for regexen in your Ruby Version. This seems to be the problem.

Robert

by the way... Robert... what is the regexen? is it the regular
expression modifier? I'd like it to match absolutely anything
(newline, 0, etc)... but seems like there is no match

--
Rick DeNatale

Blog: http://talklikeaduck.denhaven2.com/
Twitter: http://twitter.com/RickDeNatale
WWR: http://www.workingwithrails.com/person/9021-rick-denatale
LinkedIn: http://www.linkedin.com/in/rickdenatale

Jian_Lin1 · 9 May 2009 13:42

Rick Denatale wrote:

···

On Sat, May 9, 2009 at 8:40 AM, Jian Lin <winterheat@gmail.com> wrote:

(newline, 0, etc)... but seems like there is no match

I'm pretty sure that Robert used regexen as the geeky way of pluralizing
regex.

The unicode switch (a u regular expression option) forces the use of
unicode to interpret the string being matched, otherwise it uses
whatever the encoding of the source file containing the regular
expression.

e.g. /./u

If you want . to match newlines you want the m (multi-line) option.
Normally . will match anything BUT a new line, m changes this.

rb(main):001:0> "a\nb".match(/a.b/)
=> nil
irb(main):002:0> "a\nb".match(/a.b/m)
=> #<MatchData:0x6a248>

aha... here i just want to match 4 bytes at a time, no matter what the
bytes are. Using "m" won't do it... the "u" would be helpful if i
match one UTF-8 character at a time and then process it... right now i
actually convert it all at once to UTF-32 and then process it... so I
wonder if there is a way to match 4 bytes at a time.

--
Posted via http://www.ruby-forum.com/\.

Forum · 10 May 2009 09:07

I'm pretty sure that Robert used regexen as the geeky way of pluralizing regex.

I plead guilty your honor

7stud · 9 May 2009 20:10

Jian Lin wrote:

Rick Denatale wrote:

(newline, 0, etc)... but seems like there is no match

I'm pretty sure that Robert used regexen as the geeky way of pluralizing
regex.

The unicode switch (a u regular expression option) forces the use of
unicode to interpret the string being matched, otherwise it uses
whatever the encoding of the source file containing the regular
expression.

e.g. /./u

If you want . to match newlines you want the m (multi-line) option.
Normally . will match anything BUT a new line, m changes this.

rb(main):001:0> "a\nb".match(/a.b/)
=> nil
irb(main):002:0> "a\nb".match(/a.b/m)
=> #<MatchData:0x6a248>

aha... here i just want to match 4 bytes at a time, no matter what the
bytes are. Using "m" won't do it... the "u" would be helpful if i
match one UTF-8 character at a time and then process it... right now i
actually convert it all at once to UTF-32 and then process it... so I
wonder if there is a way to match 4 bytes at a time.

So what's the problem? A dot matches any byte (with the 'm' switch).
Make a regex with four dots:

/..../

or

/.{4}/

···

On Sat, May 9, 2009 at 8:40 AM, Jian Lin <winterheat@gmail.com> wrote:

--
Posted via http://www.ruby-forum.com/\.

7stud · 9 May 2009 20:15

7stud -- wrote:
Whoops. With the 'm' switch:

/..../m

or

/.{4}/m

···

--
Posted via http://www.ruby-forum.com/.

Jian_Lin1 · 9 May 2009 20:36

7stud -- wrote:

7stud -- wrote:
Whoops. With the 'm' switch:

/..../m

or

/.{4}/m

the problem is that some characters are converted to the correct
骼 etc, but some characters are not... you can try if you want...
just go to Google News and get a China, taiwan, or hk news headline.

···

--
Posted via http://www.ruby-forum.com/\.

7stud · 9 May 2009 23:29

Jian Lin wrote:

7stud -- wrote:

7stud -- wrote:
Whoops. With the 'm' switch:

/..../m

or

/.{4}/m

the problem is that some characters are converted to the correct
骼 etc, but some characters are not... you can try if you want...
just go to Google News and get a China, taiwan, or hk news headline.

Then why do you insist that you are trying to match any 4 bytes?

···

--
Posted via http://www.ruby-forum.com/\.

Jian_Lin1 · 10 May 2009 02:29

7stud -- wrote:

Jian Lin wrote:

7stud -- wrote:

7stud -- wrote:
Whoops. With the 'm' switch:

/..../m

or

/.{4}/m

the problem is that some characters are converted to the correct
骼 etc, but some characters are not... you can try if you want...
just go to Google News and get a China, taiwan, or hk news headline.

Then why do you insist that you are trying to match any 4 bytes?

no... the program converts the UTF-8 string into UTF-32, so that each
character (code point) is 4 bytes long. And then the program process
the end result, 4 bytes at a time, so that's why scanning 4 bytes at a
time.

···

--
Posted via http://www.ruby-forum.com/\.

Topic		Replies	Views
Wanted: Script to convert to/from UTF-8/UTF-16/UTF-32 ruby-talk	2	188	31 August 2008
[ENCODING] UTF8 hell ruby-talk	14	702	24 February 2010
UTF-8 in Ruby ruby-talk	3	105	1 May 2008
How does one transform UTF-8 encoded characters to ASCII? ruby-talk	13	141	25 May 2006
String to UTF ruby-talk	9	104	22 December 2003

Converting UTF-8 to entities like &#x525B;

Related topics

Converting UTF-8 to entities like 剛