UTF-8 Character Conversion to HTML

Ben_Schumacher · 11 October 2002 20:16

All-

I’m trying to convert some characters from a text file that are
double-byte characters from Czech and Polish into HTML entities. I have
a Ruby script that I’m trying to do this with, but when I’m scanning,
I’m getting several characters, instead of just one. Can somebody help
me out here? I imagine I’m not doing this in a terribly efficiently
anyhow, so an optimizations would be appreciated. Here’s the text I’m
converting:

Aguarda carregamento da página....

And I’m doing this for the conversion:

str = "Aguarda carregamento da página…"
str.scan(/./mu) { |c|
v = c.unpack(“U.”)[0]
if (v > 255)
out += "&##{v};"
else
out += c
end
}
puts str

The output is:
Aguarda carregamento da p᧩na…

Which is obviously not correct. When I do a “p c” inside the String#scan
block, it shows the script is grabbing all characters individually (as
it should), except “ági” which is grabs together. The output from that
is: “\341gi”

I don’t know enough about internationalization to know any meaning to
this, but I need to figure it out ASAP.

Any help would be much obliged.

Cheers,

bs.

Ben_Schumacher · 11 October 2002 20:18

Excuse me. That’s not Polish, its Portugese, but I’m having problems
with characters from Portugese, Spanish, Czech and Polish.

Cheers,

bs.

Ben Schumacher wrote:

···

All-

I’m trying to convert some characters from a text file that are
double-byte characters from Czech and Polish into HTML entities. I have
a Ruby script that I’m trying to do this with, but when I’m scanning,
I’m getting several characters, instead of just one. Can somebody help
me out here? I imagine I’m not doing this in a terribly efficiently
anyhow, so an optimizations would be appreciated. Here’s the text I’m
converting:
Aguarda carregamento da página....
And I’m doing this for the conversion:

str = “Aguarda carregamento da página…”
str.scan(/./mu) { |c|
v = c.unpack(“U.”)[0]
if (v > 255)
out += “&##{v};”
else
out += c
end
}
puts str

The output is:
Aguarda carregamento da p᧩na…

Which is obviously not correct. When I do a “p c” inside the String#scan
block, it shows the script is grabbing all characters individually (as
it should), except “ági” which is grabs together. The output from that
is: “\341gi”

I don’t know enough about internationalization to know any meaning to
this, but I need to figure it out ASAP.

Any help would be much obliged.

Cheers,

bs.

David_Alan_Black1 · 11 October 2002 20:43

Hello –

All-

I’m trying to convert some characters from a text file that are
double-byte characters from Czech and Polish into HTML entities. I have
a Ruby script that I’m trying to do this with, but when I’m scanning,
I’m getting several characters, instead of just one. Can somebody help
me out here? I imagine I’m not doing this in a terribly efficiently
anyhow, so an optimizations would be appreciated. Here’s the text I’m
converting:

Aguarda carregamento da página…

And I’m doing this for the conversion:

str = “Aguarda carregamento da página…”
str.scan(/./mu) { |c|
v = c.unpack(“U.”)[0]
if (v > 255)
out += “&##{v};”
else
out += c
end
}
puts str

The output is:
Aguarda carregamento da p᧩na…

Hmmm… I get an error about trying to call + on nil But with a
little tweaking (like initializing “out”), I get what you’re
describing.

I’m not an encoding expert, but here’s perhaps the beginnings of
a way to do it. (I’ve changed 255 to 127, since the á is 225.)

str = “Aguarda carregamento da página…”
out = str.unpack(“C*”).map {|c|
c > 127 ? “&##{c};” : c.chr
}.join
puts out

David

···

On Sat, 12 Oct 2002, Ben Schumacher wrote:

–
David Alan Black
home: dblack@candle.superlink.net
work: blackdav@shu.edu
Web: http://pirate.shu.edu/~blackdav

Yukihiro_Matsumoto2 · 11 October 2002 20:48

Hi,

···

In message “UTF-8 Character Conversion to HTML” on 02/10/12, Ben Schumacher ben@blahr.com writes:

I’m trying to convert some characters from a text file that are
double-byte characters from Czech and Polish into HTML entities. I have
a Ruby script that I’m trying to do this with, but when I’m scanning,
I’m getting several characters, instead of just one. Can somebody help
me out here? I imagine I’m not doing this in a terribly efficiently
anyhow, so an optimizations would be appreciated. Here’s the text I’m
converting:

Aguarda carregamento da p�ágina…

You have to make sure the encoding of your input. Is it really in
UTF-8? If it’s not (I suspect it’s in ISO-8859-1), you should convert
them to UTF-8 before applying UTF regexp.

						matz.

Ben_Schumacher · 11 October 2002 21:01

matz-

It should be UTF-8. The actual output comes from a .csv that is saved
from an Excel spreadsheet. Since translations are coming from overseas
from non-technical people (tranlators), this was the only way to send
the data. When I save the CSV, OpenOffice.org adds a little 3-byte
identifier to the beginning of the file which I believe is what makes
the file behave as if encoded in UTF-8, correct? (This is probably the
wrong list for this question.)

Perhaps its the Ruby CSV module I’m using from RAA that is converting it
to ISO-8859-1(5?).

Hmm… sorry about leaving details out of earlier message, wanted to get
it out… had a meeting to run off to.

Any ideas? It appears I am having better luck when running this on my
Windows box, but it still doesn’t seem to be working 100% of the time.

Thanks for your help… I’ll take another shot at making sure the input
is UTF-8. I guess this means I may have to parse the CSV myself, which
I’m not really looking forward to.

Cheers,

bs.

Yukihiro Matsumoto wrote:

···

Hi,

In message “UTF-8 Character Conversion to HTML” > on 02/10/12, Ben Schumacher ben@blahr.com writes:

I’m trying to convert some characters from a text file that are
double-byte characters from Czech and Polish into HTML entities. I have
a Ruby script that I’m trying to do this with, but when I’m scanning,
I’m getting several characters, instead of just one. Can somebody help
me out here? I imagine I’m not doing this in a terribly efficiently
anyhow, so an optimizations would be appreciated. Here’s the text I’m
converting:

Aguarda carregamento da p�ágina…

You have to make sure the encoding of your input. Is it really in
UTF-8? If it’s not (I suspect it’s in ISO-8859-1), you should convert
them to UTF-8 before applying UTF regexp.
  					matz.

NAKAMURA_Hiroshi1 · 15 October 2002 08:01

Hi,

From: Ben Schumacher [mailto:ben@blahr.com]
Sent: Saturday, October 12, 2002 6:02 AM

It should be UTF-8. The actual output comes from a .csv
that is saved
from an Excel spreadsheet. Since translations are coming from overseas
from non-technical people (tranlators), this was the only way to send
the data. When I save the CSV, OpenOffice.org adds a little 3-byte
identifier to the beginning of the file which I believe is
what makes
the file behave as if encoded in UTF-8, correct? (This is probably the
wrong list for this question.)

Is it 0xef, 0xbb, 0xbf? Then, yes it should be UTF-8.
But OpenOffice.org Calc/1.0.1 Japanese version I use seems
not to dump a BOM at the begenning of UTF-8 encoded CSV
export file.

Can you send me a sample BOMed UTF-8 file which CSV module
must parse?

Regards,
// NaHi

Topic		Replies	Views
Wanted: Script to convert to/from UTF-8/UTF-16/UTF-32 ruby-talk	2	192	31 August 2008
Converting UTF-8 to entities like 剛 ruby-talk	14	149	13 May 2009
How does one transform UTF-8 encoded characters to ASCII? ruby-talk	13	151	25 May 2006
Converting escaped html to utf-8 ruby-talk	2	128	26 July 2007
Ampersand entities ruby-talk	4	75	14 November 2006

UTF-8 Character Conversion to HTML

Related topics