All-
I’m trying to convert some characters from a text file that are
double-byte characters from Czech and Polish into HTML entities. I have
a Ruby script that I’m trying to do this with, but when I’m scanning,
I’m getting several characters, instead of just one. Can somebody help
me out here? I imagine I’m not doing this in a terribly efficiently
anyhow, so an optimizations would be appreciated. Here’s the text I’m
converting:
Aguarda carregamento da página....
And I’m doing this for the conversion:
str = "Aguarda carregamento da página…"
str.scan(/./mu) { |c|
v = c.unpack(“U.”)[0]
if (v > 255)
out += "&##{v};"
else
out += c
end
}
puts str
The output is:
Aguarda carregamento da p᧩na…
Which is obviously not correct. When I do a “p c” inside the String#scan
block, it shows the script is grabbing all characters individually (as
it should), except “ági” which is grabs together. The output from that
is: “\341gi”
I don’t know enough about internationalization to know any meaning to
this, but I need to figure it out ASAP.
Any help would be much obliged.
Cheers,
bs.
Excuse me. That’s not Polish, its Portugese, but I’m having problems
with characters from Portugese, Spanish, Czech and Polish.
Cheers,
bs.
Ben Schumacher wrote:
···
All-
I’m trying to convert some characters from a text file that are
double-byte characters from Czech and Polish into HTML entities. I have
a Ruby script that I’m trying to do this with, but when I’m scanning,
I’m getting several characters, instead of just one. Can somebody help
me out here? I imagine I’m not doing this in a terribly efficiently
anyhow, so an optimizations would be appreciated. Here’s the text I’m
converting:
Aguarda carregamento da página....
And I’m doing this for the conversion:
str = “Aguarda carregamento da página…”
str.scan(/./mu) { |c|
v = c.unpack(“U.”)[0]
if (v > 255)
out += “&##{v};”
else
out += c
end
}
puts str
The output is:
Aguarda carregamento da p᧩na…
Which is obviously not correct. When I do a “p c” inside the String#scan
block, it shows the script is grabbing all characters individually (as
it should), except “ági” which is grabs together. The output from that
is: “\341gi”
I don’t know enough about internationalization to know any meaning to
this, but I need to figure it out ASAP.
Any help would be much obliged.
Cheers,
bs.
Hello –
All-
I’m trying to convert some characters from a text file that are
double-byte characters from Czech and Polish into HTML entities. I have
a Ruby script that I’m trying to do this with, but when I’m scanning,
I’m getting several characters, instead of just one. Can somebody help
me out here? I imagine I’m not doing this in a terribly efficiently
anyhow, so an optimizations would be appreciated. Here’s the text I’m
converting:
Aguarda carregamento da página…
And I’m doing this for the conversion:
str = “Aguarda carregamento da página…”
str.scan(/./mu) { |c|
v = c.unpack(“U.”)[0]
if (v > 255)
out += “&##{v};”
else
out += c
end
}
puts str
The output is:
Aguarda carregamento da p᧩na…
Hmmm… I get an error about trying to call + on nil
But with a
little tweaking (like initializing “out”), I get what you’re
describing.
I’m not an encoding expert, but here’s perhaps the beginnings of
a way to do it. (I’ve changed 255 to 127, since the á is 225.)
str = “Aguarda carregamento da página…”
out = str.unpack(“C*”).map {|c|
c > 127 ? “&##{c};” : c.chr
}.join
puts out
David
···
On Sat, 12 Oct 2002, Ben Schumacher wrote:
–
David Alan Black
home: dblack@candle.superlink.net
work: blackdav@shu.edu
Web: http://pirate.shu.edu/~blackdav
matz-
It should be UTF-8. The actual output comes from a .csv that is saved
from an Excel spreadsheet. Since translations are coming from overseas
from non-technical people (tranlators), this was the only way to send
the data. When I save the CSV, OpenOffice.org adds a little 3-byte
identifier to the beginning of the file which I believe is what makes
the file behave as if encoded in UTF-8, correct? (This is probably the
wrong list for this question.)
Perhaps its the Ruby CSV module I’m using from RAA that is converting it
to ISO-8859-1(5?).
Hmm… sorry about leaving details out of earlier message, wanted to get
it out… had a meeting to run off to.
Any ideas? It appears I am having better luck when running this on my
Windows box, but it still doesn’t seem to be working 100% of the time.
Thanks for your help… I’ll take another shot at making sure the input
is UTF-8. I guess this means I may have to parse the CSV myself, which
I’m not really looking forward to.
Cheers,
bs.
Yukihiro Matsumoto wrote:
···
Hi,
In message “UTF-8 Character Conversion to HTML” > on 02/10/12, Ben Schumacher ben@blahr.com writes:
I’m trying to convert some characters from a text file that are
double-byte characters from Czech and Polish into HTML entities. I have
a Ruby script that I’m trying to do this with, but when I’m scanning,
I’m getting several characters, instead of just one. Can somebody help
me out here? I imagine I’m not doing this in a terribly efficiently
anyhow, so an optimizations would be appreciated. Here’s the text I’m
converting:
Aguarda carregamento da p�ágina…
You have to make sure the encoding of your input. Is it really in
UTF-8? If it’s not (I suspect it’s in ISO-8859-1), you should convert
them to UTF-8 before applying UTF regexp.
matz.
Hi,
From: Ben Schumacher [mailto:ben@blahr.com]
Sent: Saturday, October 12, 2002 6:02 AM
It should be UTF-8. The actual output comes from a .csv
that is saved
from an Excel spreadsheet. Since translations are coming from overseas
from non-technical people (tranlators), this was the only way to send
the data. When I save the CSV, OpenOffice.org adds a little 3-byte
identifier to the beginning of the file which I believe is
what makes
the file behave as if encoded in UTF-8, correct? (This is probably the
wrong list for this question.)
Is it 0xef, 0xbb, 0xbf? Then, yes it should be UTF-8.
But OpenOffice.org Calc/1.0.1 Japanese version I use seems
not to dump a BOM at the begenning of UTF-8 encoded CSV
export file.
Can you send me a sample BOMed UTF-8 file which CSV module
must parse?
Regards,
// NaHi