Ruby1.9 Encoding

Hey, guys!

I've just started learning Ruby from Python and recently posted a
question here that was promptly and effectively answered (thanks,
Glenn!), so I decided to come here once more. I hope I'll be able to
be answering some of the questions on my own soon. :wink:

Here it is.

I'm writing a wrapper for a Korean Morphological Parser that only
works with EUC-KR encoding and has some trouble with longer texts. So,
first, I have to preprocess the input text to divide it into
sentences, remove unicode characters which are not related to Korean
and save them for further reinclusion in the postprocessing stage.
This has worked out wonderfully and, I should say, easier than what
I'd done in Python.

The problem I'm having now is converting the string from UTF-8 (I'm
running Ubuntu with pt_BR.UTF-8 locale) into EUC-KR, run the
Morphological Parser, read its output and process it. I have to parse
a whole bunch of data "spidered" from the internet and it works great
until the encoder comes across a Korean typo... Let's make this point
clearer: the EUC-KR encoding does not cover all the possible
combinations of initial and final consonants + vowels, as does
unicode. Explaining: Unicode has 21708 codepoints for hangul whereas
EUC-KR has only 11172. In fact, most of the extra chars are not used
in day by day life, but still they can be used as abbreviations,
slang, smileys or typos...

In Python, I would simply throw these chars away but I really didn't
manage to understand the "Ruby encoding way"... I know I'm missing
something, but I can't seem to find enough info around... Google
doesn't seem to know much of this either... So, I'm coming here to ask
for your enlightenment, dear rubyist friends!

The part of my code which deals with this is as follows:

  def run(txt)
    txt = txt.encode("EUC-KR")
    kts_file = Tempfile::new('kts_text')
    kts_file = open(kts_file.path, "w:EUC-KR")
    kts_file << "#{txt}\n"
    kts_file.close
    cmd = "ktspell < #{kts_file.path}" # 2> /dev/null"
    IO::popen(cmd, "r:EUC-KR").read.encode("UTF-8")
  end

I found something about "ignoring" the non-existent codepoints, but it
doesn't work... I'm even thinking that my Ruby installation might have
gotten corrupted somehow... Everytime I think I did it right, I still
get The Exception popping up on the screen...

Thanks for your patience reading this looong post.

Juliano

-------- Original-Nachricht --------

Datum: Thu, 10 Sep 2009 18:20:06 +0900
Von: "Juliano 준호" <jjunho@gmail.com>
An: ruby-talk@ruby-lang.org
Betreff: Ruby1.9 Encoding

Hey, guys!

I've just started learning Ruby from Python and recently posted a
question here that was promptly and effectively answered (thanks,
Glenn!), so I decided to come here once more. I hope I'll be able to
be answering some of the questions on my own soon. :wink:

Here it is.

I'm writing a wrapper for a Korean Morphological Parser that only
works with EUC-KR encoding and has some trouble with longer texts. So,
first, I have to preprocess the input text to divide it into
sentences, remove unicode characters which are not related to Korean
and save them for further reinclusion in the postprocessing stage.
This has worked out wonderfully and, I should say, easier than what
I'd done in Python.

The problem I'm having now is converting the string from UTF-8 (I'm
running Ubuntu with pt_BR.UTF-8 locale) into EUC-KR, run the
Morphological Parser, read its output and process it. I have to parse
a whole bunch of data "spidered" from the internet and it works great
until the encoder comes across a Korean typo... Let's make this point
clearer: the EUC-KR encoding does not cover all the possible
combinations of initial and final consonants + vowels, as does
unicode. Explaining: Unicode has 21708 codepoints for hangul whereas
EUC-KR has only 11172. In fact, most of the extra chars are not used
in day by day life, but still they can be used as abbreviations,
slang, smileys or typos...

In Python, I would simply throw these chars away but I really didn't
manage to understand the "Ruby encoding way"... I know I'm missing
something, but I can't seem to find enough info around... Google
doesn't seem to know much of this either... So, I'm coming here to ask
for your enlightenment, dear rubyist friends!

The part of my code which deals with this is as follows:

  def run(txt)
    txt = txt.encode("EUC-KR")
    kts_file = Tempfile::new('kts_text')
    kts_file = open(kts_file.path, "w:EUC-KR")
    kts_file << "#{txt}\n"
    kts_file.close
    cmd = "ktspell < #{kts_file.path}" # 2> /dev/null"
    IO::popen(cmd, "r:EUC-KR").read.encode("UTF-8")
  end

I found something about "ignoring" the non-existent codepoints, but it
doesn't work... I'm even thinking that my Ruby installation might have
gotten corrupted somehow... Everytime I think I did it right, I still
get The Exception popping up on the screen...

Thanks for your patience reading this looong post.

Juliano

Dear Juliano,

a disclaimer first: I know no Korean, so what's below might not work.

I've had to do some coding to resolve Arabic ligatures (combinations
of two letters) recently. Similarly as what you describe, there is most
of the time no need to use a special combined form, and unluckily, the
same word is sometimes spelled in this and sometimes in that way, giving
a list of duplicate words.

I used a list of Unicode characters with names of the individual characters
to solve that problem.

You might find the table below on this page useful :

http://www.kfunigraz.ac.at/~katzer/korean_hangul_unicode.html

I don't know if that list is exhaustive, but you may try to individually
convert each of the syllables listed there from Unicode to EUC::KR, and
if that doesn't work, decide what to do with the particular combination
of signs, based on the Latin transcription, creating a transform hash
for these encodings yourself.

There might also be some locale or OS-related problems with Iconv::IGNORE .
There's some discussion of this here :

Best regards,

Axel

···

--
Jetzt kostenlos herunterladen: Internet Explorer 8 und Mozilla Firefox 3 -
sicherer, schneller und einfacher! Aktuelle Nachrichten aus Politik, Wirtschaft & Panorama | GMX

Hi,

···

In message "Re: Ruby1.9 Encoding" on Thu, 10 Sep 2009 18:20:06 +0900, Juliano 준호 <jjunho@gmail.com> writes:

The part of my code which deals with this is as follows:

def run(txt)
   txt = txt.encode("EUC-KR")
   kts_file = Tempfile::new('kts_text')
   kts_file = open(kts_file.path, "w:EUC-KR")
   kts_file << "#{txt}\n"
   kts_file.close
   cmd = "ktspell < #{kts_file.path}" # 2> /dev/null"
   IO::popen(cmd, "r:EUC-KR").read.encode("UTF-8")
end

I found something about "ignoring" the non-existent codepoints, but it
doesn't work... I'm even thinking that my Ruby installation might have
gotten corrupted somehow... Everytime I think I did it right, I still
get The Exception popping up on the screen...

I had some difficulty to see your intention from the code. Could you
show us the exception messages you've got?

              matz.

Hey, guys!

I've just started learning Ruby from Python and recently posted a
question here that was promptly and effectively answered (thanks,
Glenn!), so I decided to come here once more. I hope I'll be able to
be answering some of the questions on my own soon. :wink:

Welcome to Ruby.

The problem I'm having now is converting the string from UTF-8 (I'm
running Ubuntu with pt_BR.UTF-8 locale) into EUC-KR, run the
Morphological Parser, read its output and process it. I have to parse
a whole bunch of data "spidered" from the internet and it works great
until the encoder comes across a Korean typo... Let's make this point
clearer: the EUC-KR encoding does not cover all the possible
combinations of initial and final consonants + vowels, as does
unicode. Explaining: Unicode has 21708 codepoints for hangul whereas
EUC-KR has only 11172. In fact, most of the extra chars are not used
in day by day life, but still they can be used as abbreviations,
slang, smileys or typos...

In Python, I would simply throw these chars away but I really didn't
manage to understand the "Ruby encoding way"...

I think we can throw them away in Ruby too. See below.

I know I'm missing something, but I can't seem to find enough info around... Google
doesn't seem to know much of this either...

I wrote a lot about Ruby's encoding engine on my blog:

The part of my code which deals with this is as follows:

def run(txt)
   txt = txt.encode("EUC-KR")

Try replacing the above line with:

   txt = txt.encode("EUC-KR", invalid: :replace, undef: :replace, replace: "")

   kts_file = Tempfile::new('kts_text')
   kts_file = open(kts_file.path, "w:EUC-KR")
   kts_file << "#{txt}\n"
   kts_file.close
   cmd = "ktspell < #{kts_file.path}" # 2> /dev/null"
   IO::popen(cmd, "r:EUC-KR").read.encode("UTF-8")
end

Hope that helps.

James Edward Gray II

···

On Sep 10, 2009, at 4:20 AM, Juliano 준호 wrote:

My interpretation of the code is:

  def run(txt)
    # translate the string into EUC-KR encoding
    txt = txt.encode("EUC-KR")

    # Create a temp file to store the data and
    # write it to the file, using the EUC-KR encoding
    kts_file = Tempfile::new('kts_text')
    kts_file = open(kts_file.path, "w:EUC-KR")
    kts_file << "#{txt}\n"
    kts_file.close

    # Run ktspell, feeding it the data from the file
    cmd = "ktspell < #{kts_file.path}" # 2> /dev/null"

    # Read the result and translate it into UTF-8
    IO::popen(cmd, "r:EUC-KR").read.encode("UTF-8")
  end

I don't know much about 1.9's encodings, but can suggest a more rubyish way of writing the method:

def run(txt)
   euc_txt = txt.encode("EUC-KR")

   # Tempfile::new actually returns a filehandle, so there's no need to re-open
   # the file based on the path. Since you seem to want to open the file,
   # write to it, use the data, and then immediately do away with the file,
   # the 'open' block form is probably more appropriate
   kts_file = Tempfile::open("kts_text") do |kts_file|

     # The more common way to write a newline-terminated string to a file is
     # file.puts(foo) rather than file << "#{foo}\n"
     kts_file.puts(euc_txt)
     cmd = "ktspell < #{kts_file.path}"

     # You could do all the rest on one line as you did, but this lets you
     # look at the data first using 'p processed_euc_txt' or something
     processed_euc_txt = IO::popen(cmd, "r:EUC-KR").read

     # Again, a temporary variable to let you see if the data looks right.
     processed_utf_txt = processed_euc_txt.encode("UTF-8")
   end

   processed_utf_txt
end

Ben

···

On Sep 10, 2009, at 09:39, Yukihiro Matsumoto wrote:

I had some difficulty to see your intention from the code. Could you
show us the exception messages you've got?