How to convert the charset of texts in a Execl which has multi-language text and charset?

Hello all,

I want to use Ruby to read a excel file's content and convert them in to
UTF-8.
However, in that file there are many different language texts, such as
Greek, Japanese, Korea, Russia and so on.
So I use Iconv to convert the them into UTF-8.
I searched the internet, some article said the default charset of Excel is
UTF-16LE.
So I use the codes below:

Iconv.conv("UTF-8","UTF-16",$excel.Cells(row,col).value.to_s)

And the contents in excel are(each line is a cell)

···

----------------------------------------
(Please wait)
(Veuillez attendre)
(Bitte warten)
(Espere un momento)
(Attendere, prego)
(Even geduld aub)
(Ждите)
(Aguarde)
(잠시만 기다려주십시오)

------------------------------------------
After I run it, I get a Error:
in `conv': ")" (Iconv::InvalidCharacter)

It seems that the in UTF-16, the ( is not '('???

Then I changed the 'UTF-16' in to 'GB2312'(the default charset of my
system),but it cannot convert the Koean character correctly. All the Koean
characters became ???

I use Ruby 1.8.6 on WinXP Sp3.

How could I resolve it ?

Many thanks,

Nan

-------- Original-Nachricht --------

Datum: Mon, 15 Sep 2008 17:49:15 +0900
Von: "Wu Nan" <i.wunan+rubymail@gmail.com>
An: ruby-talk@ruby-lang.org
Betreff: How to convert the charset of texts in a Execl which has multi-language text and charset?

Hello all,

I want to use Ruby to read a excel file's content and convert them in to
UTF-8.
However, in that file there are many different language texts, such as
Greek, Japanese, Korea, Russia and so on.
So I use Iconv to convert the them into UTF-8.
I searched the internet, some article said the default charset of Excel is
UTF-16LE.
So I use the codes below:

Iconv.conv("UTF-8","UTF-16",$excel.Cells(row,col).value.to_s)

And the contents in excel are(each line is a cell)
----------------------------------------
(Please wait)
(Veuillez attendre)
(Bitte warten)
(Espere un momento)
(Attendere, prego)
(Even geduld aub)
(Ждите)
(Aguarde)
(잠시만 기다려주십시오)

------------------------------------------
After I run it, I get a Error:
in `conv': ")" (Iconv::InvalidCharacter)

It seems that the in UTF-16, the ( is not '('???

Then I changed the 'UTF-16' in to 'GB2312'(the default charset of my
system),but it cannot convert the Koean character correctly. All the Koean
characters became ???

I use Ruby 1.8.6 on WinXP Sp3.

How could I resolve it ?

Many thanks,

Nan

Dear Nan,

after some searching, I found that there is a special encoding for Korean characters, EUC-KR.
I managed to convert your Korean text from UTF-8 to EUC-KR, write it to a file and display it correctly in Firefox, once
the right encoding is set in the Preferences (EUC-KR in this case, but I can also display Korean text in UTF-8.)

So I think you'll be successful by making sure you convert from EUC-KR to UTF-8 for the Korean, and to UTF-8 for everything else.

Best regards,

Axel

···

--
Ist Ihr Browser Vista-kompatibel? Jetzt die neuesten
Browser-Versionen downloaden: GMX Browser - verwenden Sie immer einen aktuellen Browser. Kostenloser Download.

Hello Axel,

Many thanks for your answer,

I just test it again, I dump the original text, and display them in integer.
I found that all the Korean Char became '???' as soon as them were read
out from the Excel.

I attached the test codes and test excel file. In the excel file there is
only 1 text.

Do you have any idea about the reason?

test.rb (567 Bytes)

test.xls (13.5 KB)

···

2008/9/15 Axel Etzold <AEtzold@gmx.de>

-------- Original-Nachricht --------
> Datum: Mon, 15 Sep 2008 17:49:15 +0900
> Von: "Wu Nan" <i.wunan+rubymail@gmail.com <i.wunan%2Brubymail@gmail.com>
>
> An: ruby-talk@ruby-lang.org
> Betreff: How to convert the charset of texts in a Execl which has
multi-language text and charset?

> Hello all,
>
> I want to use Ruby to read a excel file's content and convert them in to
> UTF-8.
> However, in that file there are many different language texts, such as
> Greek, Japanese, Korea, Russia and so on.
> So I use Iconv to convert the them into UTF-8.
> I searched the internet, some article said the default charset of Excel
is
> UTF-16LE.
> So I use the codes below:
>
> Iconv.conv("UTF-8","UTF-16",$excel.Cells(row,col).value.to_s)
>
> And the contents in excel are(each line is a cell)
> ----------------------------------------
> (Please wait)
> (Veuillez attendre)
> (Bitte warten)
> (Espere un momento)
> (Attendere, prego)
> (Even geduld aub)
> (Ждите)
> (Aguarde)
> (잠시만 기다려주십시오)
>
> ------------------------------------------
> After I run it, I get a Error:
> in `conv': ")" (Iconv::InvalidCharacter)
>
> It seems that the in UTF-16, the ( is not '('???
>
> Then I changed the 'UTF-16' in to 'GB2312'(the default charset of my
> system),but it cannot convert the Koean character correctly. All the
Koean
> characters became ???
>
> I use Ruby 1.8.6 on WinXP Sp3.
>
> How could I resolve it ?
>
> Many thanks,
>
> Nan

Dear Nan,

after some searching, I found that there is a special encoding for Korean
characters, EUC-KR.
I managed to convert your Korean text from UTF-8 to EUC-KR, write it to a
file and display it correctly in Firefox, once
the right encoding is set in the Preferences (EUC-KR in this case, but I
can also display Korean text in UTF-8.)

So I think you'll be successful by making sure you convert from EUC-KR to
UTF-8 for the Korean, and to UTF-8 for everything else.

Best regards,

Axel

--
Ist Ihr Browser Vista-kompatibel? Jetzt die neuesten
Browser-Versionen downloaden: http://www.gmx.net/de/go/browser

-------- Original-Nachricht --------

Datum: Mon, 15 Sep 2008 18:50:03 +0900
Von: "Wu Nan" <i.wunan+rubymail@gmail.com>
An: ruby-talk@ruby-lang.org
Betreff: Re: How to convert the charset of texts in a Execl which has multi-language text and charset?

Hello Axel,

Many thanks for your answer,

I just test it again, I dump the original text, and display them in
integer.
I found that all the Korean Char became '???' as soon as them were read
out from the Excel.

I attached the test codes and test excel file. In the excel file there is
only 1 text.

Do you have any idea about the reason?

Dear Nan,

right now, I am not on Windows, so in order to check whether the problem is with Windows or with
Ruby, I'd suggest you try the following (which works on Ubuntu with your data).

Gem-install parseexcel (http://raa.ruby-lang.org/project/parseexcel/\)

Check what the following script gives on your test.xls file (shamelessly adapted from the website mentioned
above).

require "rubygems"
require 'parseexcel'
require "iconv"

# your first step is always reading in the file.
# that gives you a workbook-object, which has one or more worksheets,
# just like in Excel you have the possibility of multiple worksheets.
workbook = Spreadsheet::ParseExcel.parse("/home/axel/Desktop/test.xls")

# usually, you want the first worksheet:
worksheet = workbook.worksheet(0)
p worksheet
# now you can either iterate over all rows, skipping the first number of
# rows (in case you know they just contain column headers)
skip = 0
worksheet.each(skip) { |row|
  # a row is actually just an Array of Cells..
  first_cell = row.at(0)
  p 'first'
  p first_cell
  # how you get data out of the cell depends on what datatype you
  # expect:

# if you expect a String, you can pass an encoding and (iconv
# required) the content of the cell will be converted.
str = row.at(0).to_s('EUC-KR')
  p str
  f=File.open("textexcel.html","w")
  f.puts str
  f.close
}

I could open the file textexcel.html with correctly displayable Korean characters (now in EUC-KR, but you can
convert these to UTF-8, at least in Ubuntu.

Best regards,

Axel

···

--
GMX startet ShortView.de. Hier findest Du Leute mit Deinen Interessen!
Jetzt dabei sein: http://www.shortview.de/wasistshortview.php?mc=sv_ext_mf@gmx

Hello Axel,

Thank you very much.

I'll try it.

BR
Nan

···

2008/9/15, Axel Etzold <AEtzold@gmx.de>:

-------- Original-Nachricht --------
> Datum: Mon, 15 Sep 2008 18:50:03 +0900
> Von: "Wu Nan" <i.wunan+rubymail@gmail.com <i.wunan%2Brubymail@gmail.com>
>
> An: ruby-talk@ruby-lang.org
> Betreff: Re: How to convert the charset of texts in a Execl which has
multi-language text and charset?

> Hello Axel,
>
> Many thanks for your answer,
>
> I just test it again, I dump the original text, and display them in
> integer.
> I found that all the Korean Char became '???' as soon as them were read
> out from the Excel.
>
> I attached the test codes and test excel file. In the excel file there is
> only 1 text.
>
> Do you have any idea about the reason?
>
>

Dear Nan,

right now, I am not on Windows, so in order to check whether the problem is
with Windows or with
Ruby, I'd suggest you try the following (which works on Ubuntu with your
data).

Gem-install parseexcel (http://raa.ruby-lang.org/project/parseexcel/\)

Check what the following script gives on your test.xls file (shamelessly
adapted from the website mentioned
above).

require "rubygems"
require 'parseexcel'
require "iconv"

# your first step is always reading in the file.
# that gives you a workbook-object, which has one or more worksheets,
# just like in Excel you have the possibility of multiple worksheets.
workbook = Spreadsheet::ParseExcel.parse("/home/axel/Desktop/test.xls")

# usually, you want the first worksheet:
worksheet = workbook.worksheet(0)
p worksheet
# now you can either iterate over all rows, skipping the first number of
# rows (in case you know they just contain column headers)
skip = 0
worksheet.each(skip) { |row|
       # a row is actually just an Array of Cells..
       first_cell = row.at(0)
       p 'first'
       p first_cell
       # how you get data out of the cell depends on what datatype you
       # expect:

# if you expect a String, you can pass an encoding and (iconv
# required) the content of the cell will be converted.
str = row.at(0).to_s('EUC-KR')
       p str
       f=File.open("textexcel.html","w")
       f.puts str
       f.close
}

I could open the file textexcel.html with correctly displayable Korean
characters (now in EUC-KR, but you can
convert these to UTF-8, at least in Ubuntu.

Best regards,

Axel

--
GMX startet ShortView.de. Hier findest Du Leute mit Deinen Interessen!
Jetzt dabei sein:
http://www.shortview.de/wasistshortview.php?mc=sv_ext_mf@gmx