Separate Chinese and English! with Ruby

Nanyang_Zhan1 · 7 May 2007 07:39

Don't get me wrong, because I just want to know how to separate English
words from a string with ruby.
There are strings (UTF-8 encoded) to record people's name,
like:

摩根·弗里曼 Morgan Freeman
布鲁斯·威利斯 Bruce Willis
李小明 Lee xiao ming
these strings containing Chinese name(without space between characters),
separated by a space, following an English name

or
Frank Darabont
Just an English name.

Would you give me an idea how to separate these Chinese characters(if
any)?

···

--
Posted via http://www.ruby-forum.com/.

akbarhome · 7 May 2007 09:15

a = File.open('a.txt')
a.each {|x| puts x.split(' ', 2) }
Output:
摩根·弗里曼
Morgan Freeman
布鲁斯·威利斯
Bruce Willis
李小明
Lee xiao ming

···

On May 7, 2:39 pm, Nanyang Zhan <s...@hotmail.com> wrote:

Don't get me wrong, because I just want to know how to separate English
words from a string with ruby.
There are strings (UTF-8 encoded) to record people's name,
like:

摩根·弗里曼 Morgan Freeman
布鲁斯·威利斯 Bruce Willis
李小明 Lee xiao ming
these strings containing Chinese name(without space between characters),
separated by a space, following an English name

or
Frank Darabont
Just an English name.

Would you give me an idea how to separate these Chinese characters(if
any)?

--
Posted viahttp://www.ruby-forum.com/.

Mariusz_Pekala · 7 May 2007 10:04

Maybe a regexp similiar to
/^([^qazwsxedcrfvtgbyhnujmikolpQAZWSXEDCRFVTGBYHNUJMIKOLP ]+)/
would help?

Does [a-zA-Z] include Chinese characters? In Polish locale it includes
Polish non-ASCII characters, so I guess it might include Chinese ones.

I guess you want split a given string into words (separated by space),
and then check whether the first word starts or includes at least one
Chinese character.

···

On 2007-05-07 16:39:12 +0900 (Mon, May), Nanyang Zhan wrote:

Don't get me wrong, because I just want to know how to separate English
words from a string with ruby.
There are strings (UTF-8 encoded) to record people's name,
like:

摩根·弗里曼 Morgan Freeman
布鲁斯·威利斯 Bruce Willis
李小明 Lee xiao ming
these strings containing Chinese name(without space between characters),
separated by a space, following an English name

or
Frank Darabont
Just an English name.

Would you give me an idea how to separate these Chinese characters(if
any)?

--
No virus found in this outgoing message.
Checked by 'grep -i virus $MESSAGE'
Trust me.

Harry3 · 7 May 2007 10:20

Try something like this.

t = str.split(//).partition {|x| x=~/[a-z]|[A-Z]/ }
p t[0].join
p t[1].join

Harry

···

On 5/7/07, Nanyang Zhan <sxain@hotmail.com> wrote:

Don't get me wrong, because I just want to know how to separate English
words from a string with ruby.
There are strings (UTF-8 encoded) to record people's name,
like:

摩根·弗里曼 Morgan Freeman
布鲁斯·威利斯 Bruce Willis
李小明 Lee xiao ming
these strings containing Chinese name(without space between characters),
separated by a space, following an English name

or
Frank Darabont
Just an English name.

Would you give me an idea how to separate these Chinese characters(if
any)?

--
Posted via http://www.ruby-forum.com/\.

--

A Look into Japanese Ruby List in English

akbarhome · 7 May 2007 09:20

Sorry. Fixed version:
a.each {|x|
   if x[0].to_i > 128 then
     puts x.split(' ', 2)
   else
     puts x
    end
}

This code is quick and dirty.

···

On May 7, 4:12 pm, akbarhome <akbarh...@gmail.com> wrote:

On May 7, 2:39 pm, Nanyang Zhan <s...@hotmail.com> wrote:

> Don't get me wrong, because I just want to know how to separate English
> words from a string with ruby.
> There are strings (UTF-8 encoded) to record people's name,
> like:

> 摩根·弗里曼 Morgan Freeman
> 布鲁斯·威利斯 Bruce Willis
> 李小明 Lee xiao ming
> these strings containing Chinese name(without space between characters),
> separated by a space, following an English name

> or
> Frank Darabont
> Just an English name.

> Would you give me an idea how to separate these Chinese characters(if
> any)?

> --
> Posted viahttp://www.ruby-forum.com/.

a = File.open('a.txt')
a.each {|x| puts x.split(' ', 2) }
Output:
摩根·弗里曼
Morgan Freeman
布鲁斯·威利斯
Bruce Willis
李小明
Lee xiao ming

Nanyang_Zhan1 · 7 May 2007 12:22

Harry Kakueki wrote:

Try something like this.

t = str.split(//).partition {|x| x=~/[a-z]|[A-Z]/ }
p t[0].join
p t[1].join

Harry

Thanks, KaKuEKi, but:
!!!!below code were tested under Ruby on Rails console!!!

str1 = "中文 English Words"

=> "中文 English Words"

str2 = "Ôkami: chi"

=> "Ôkami: chi"

t = str2.split(//).partition { |x| x=~/[a-z]|[A-Z]/}

=> [["k", "a", "m", "i", "c", "h", "i"], ["Ô", ":", " "]]

p t[0].join

"kamichi" ##########I want all non Chinese characters remained.
=> nil

t = str1.split(//).partition { |x| x=~/[a-z]|[A-Z]/}

=> [["E", "n", "g", "l", "i", "s", "h", "W", "o", "r", "d", "s"], ["中",
"文", " ", " "]]

p t[0].join

"EnglishWords" #######no space
=> nil

Harry Kakueki wrote:

Or this

str.split(//).partition {|x| x.length == 1 }

Harry

this time spaces are kept:

t = str1.split(//).partition {|x| x.length == 1 }

=> [[" ", "E", "n", "g", "l", "i", "s", "h", " ", "W", "o", "r", "d",
"s"], ["中", "文"]]

t[0].join

=> " English Words"

t = str2.split(//).partition {|x| x.length == 1 }

=> [["k", "a", "m", "i", ":", " ", "c", "h", "i"], ["Ô"]]

t[0].join

=> "kami: chi"

I think "Ô" may just like Chinese characters, so it is hard to take it
out.

···

On 5/7/07, Nanyang Zhan <sxain@hotmail.com> wrote:

--
Posted via http://www.ruby-forum.com/\.

Nanyang_Zhan1 · 7 May 2007 10:17

Akbar Home wrote:

···

On May 7, 4:12 pm, akbarhome <akbarh...@gmail.com> wrote:

> 布鲁斯·威利斯 Bruce Willis

李小明
Lee xiao ming

Sorry. Fixed version:
a.each {|x|
   if x[0].to_i > 128 then
     puts x.split(' ', 2)
   else
     puts x
    end
}

This code is quick and dirty.

Thanks.
But I was wrong. There are more Characters than Chinese and English that
compose the strings. Now I see characters like Ô, é, á... if x is one of
these, x[0]> 128 as Chinese does, but I only want to separate Chinese.

so do you know what exactly range of the value Chinese Characters will
return? or you can tell me where I can find this kind of information.

--
Posted via http://www.ruby-forum.com/\.

Harry3 · 7 May 2007 11:15

Or this

str.split(//).partition {|x| x.length == 1 }

Harry

···

On 5/7/07, Nanyang Zhan <sxain@hotmail.com> wrote:

Akbar Home wrote:
> On May 7, 4:12 pm, akbarhome <akbarh...@gmail.com> wrote:
>> > 布鲁斯·威利斯 Bruce Willis
>>
>> 李小明
>> Lee xiao ming
>
> Sorry. Fixed version:
> a.each {|x|
> if x[0].to_i > 128 then
> puts x.split(' ', 2)
> else
> puts x
> end
> }
>
> This code is quick and dirty.
Thanks.
But I was wrong. There are more Characters than Chinese and English that
compose the strings. Now I see characters like Ô, é, á... if x is one of
these, x[0]> 128 as Chinese does, but I only want to separate Chinese.

so do you know what exactly range of the value Chinese Characters will
return? or you can tell me where I can find this kind of information.

--
Posted via http://www.ruby-forum.com/\.

--

A Look into Japanese Ruby List in English

akbarhome · 7 May 2007 11:35

These:
http://blade.nagaokaut.ac.jp/cgi-bin/scat.rb/ruby/ruby-talk/197946

should get you done.

ustr
=> +"摩根·弗里曼"
irb(main):027:0> ustr[0]
=> U+6469 <CJK Ideograph>
irb(main):028:0> format "%X", ustr[0].to_i.to_s
=> "6469"
irb(main):029:0>

···

On May 7, 5:17 pm, Nanyang Zhan <s...@hotmail.com> wrote:

Akbar Home wrote:
> On May 7, 4:12 pm, akbarhome <akbarh...@gmail.com> wrote:
>> > 布鲁斯·威利斯 Bruce Willis

>> 李小明
>> Lee xiao ming

> Sorry. Fixed version:
> a.each {|x|
> if x[0].to_i > 128 then
> puts x.split(' ', 2)
> else
> puts x
> end
> }

> This code is quick and dirty.

Thanks.
But I was wrong. There are more Characters than Chinese and English that
compose the strings. Now I see characters like Ô, é, á... if x is one of
these, x[0]> 128 as Chinese does, but I only want to separate Chinese.

so do you know what exactly range of the value Chinese Characters will
return? or you can tell me where I can find this kind of information.

--
Posted viahttp://www.ruby-forum.com/.

Gary_Thomas · 7 May 2007 19:12

I believe the range is (in hex) 3400 to 97A5

Cheers

Gary

···

-----Original Message-----
From: list-bounce@example.com [mailto:list-bounce@example.com]On Behalf
Of Nanyang Zhan
Sent: Monday, 7 May 2007 10:17 p.m.
To: ruby-talk ML
Subject: Re: separate Chinese and English! with Ruby

Akbar Home wrote:
> On May 7, 4:12 pm, akbarhome <akbarh...@gmail.com> wrote:
>> > 布鲁斯·威利斯 Bruce Willis
>>
>> 李小明
>> Lee xiao ming
>
> Sorry. Fixed version:
> a.each {|x|
> if x[0].to_i > 128 then
> puts x.split(' ', 2)
> else
> puts x
> end
> }
>
> This code is quick and dirty.
Thanks.
But I was wrong. There are more Characters than Chinese and English that
compose the strings. Now I see characters like Ô, é, á... if x is one of
these, x[0]> 128 as Chinese does, but I only want to separate Chinese.

so do you know what exactly range of the value Chinese Characters will
return? or you can tell me where I can find this kind of information.

--
Posted via http://www.ruby-forum.com/\.

James_Britt · 7 May 2007 12:31

You could identify the encoding or just make it unicode, then check if the characters fall into a range in unicode, that will identify them.
One shortcut is checking for leading zeros in the unicode character's code.

···

On May 7, 2007, at 8:35 PM, akbarhome wrote:

On May 7, 5:17 pm, Nanyang Zhan <s...@hotmail.com> wrote:

Akbar Home wrote:

On May 7, 4:12 pm, akbarhome <akbarh...@gmail.com> wrote:

布鲁斯·威利斯 Bruce Willis

李小明
Lee xiao ming

Sorry. Fixed version:
a.each {|x|
   if x[0].to_i > 128 then
     puts x.split(' ', 2)
   else
     puts x
    end
}

This code is quick and dirty.

Thanks.
But I was wrong. There are more Characters than Chinese and English that
compose the strings. Now I see characters like Ô, é, á... if x is one of
these, x[0]> 128 as Chinese does, but I only want to separate Chinese.

so do you know what exactly range of the value Chinese Characters will
return? or you can tell me where I can find this kind of information.

--
Posted viahttp://www.ruby-forum.com/.

These:
http://blade.nagaokaut.ac.jp/cgi-bin/scat.rb/ruby/ruby-talk/197946
CJK Unicode Tables

should get you done.

ustr
=> +"摩根·弗里曼"
irb(main):027:0> ustr[0]
=> U+6469 <CJK Ideograph>
irb(main):028:0> format "%X", ustr[0].to_i.to_s
=> "6469"
irb(main):029:0>

Nanyang_Zhan1 · 7 May 2007 12:34

Akbar Home wrote:

These:
http://blade.nagaokaut.ac.jp/cgi-bin/scat.rb/ruby/ruby-talk/197946
CJK Unicode Tables

should get you done.

str1 = "中文 English Words"

=> "中文 English Words"

str1[0]

=> 228

str2 = "Ôkami: chi"

=> "Ôkami: chi"

str2[0]

=> 195

str3 = "English Words"

=> "English Words"

str3[0]

=> 69

if only I known which number Chinese Characters start and end...

···

--
Posted via http://www.ruby-forum.com/\.

Nanyang_Zhan1 · 7 May 2007 12:43

John Joyce wrote:

if x[0].to_i > 128 then

English that
Posted viahttp://www.ruby-forum.com/.

=> U+6469 <CJK Ideograph>
irb(main):028:0> format "%X", ustr[0].to_i.to_s
=> "6469"
irb(main):029:0>

You could identify the encoding or just make it unicode, then check
if the characters fall into a range in unicode, that will identify them.
One shortcut is checking for leading zeros in the unicode character's
code.

John Joyce, Thank you for your explanation.
Now I get akbarhome's idea. So I need to download the unicode lib here
http://blade.nagaokaut.ac.jp/cgi-bin/scat.rb/ruby/ruby-talk/197946
Then covert the strings into unicode, and then compare the characters
with the CJK Unicode Table from here:

Yes,It must work!

but look this:

str1 = "中文 English Words"

=> "中文 English Words"

str1[0]

=> 228

str2 = "Ôkami: chi"

=> "Ôkami: chi"

str2[0]

=> 195

str3 = "English Words"

=> "English Words"

str3[0]

=> 69

may be there are numbers that are right for Chinese,
if only I known which number Chinese Characters start and end, there
will be a much simple solution.

···

On May 7, 2007, at 8:35 PM, akbarhome wrote:

--
Posted via http://www.ruby-forum.com/\.

James_Britt · 7 May 2007 13:18

yes, that's pretty much how unicode is supposed to work.
In theory you could take a sample range of characters to guess the document language even.
The problem is that unicode allows multilanguage documents, which in some cases is difficult because of fonts and systems' implementations.
But yes you're on the right track now (IMHO).

And yes, the overhead will be greater, but that's just a fact of unicode and large character sets like chinese and japanese.
You will also want to check which chinese!
Chinese is split into two (politically safe) names : Traditional and Simpllified.
If you were doing Japanese text, separating English or other western languages wouldn't be so easy, since Japanese essentially includes a number of other languages' character sets in its unicode set and in everyday usage.

···

On May 7, 2007, at 9:43 PM, Nanyang Zhan wrote:

John Joyce wrote:

On May 7, 2007, at 8:35 PM, akbarhome wrote:

if x[0].to_i > 128 then

English that
Posted viahttp://www.ruby-forum.com/.

=> U+6469 <CJK Ideograph>
irb(main):028:0> format "%X", ustr[0].to_i.to_s
=> "6469"
irb(main):029:0>

You could identify the encoding or just make it unicode, then check
if the characters fall into a range in unicode, that will identify them.
One shortcut is checking for leading zeros in the unicode character's
code.

John Joyce, Thank you for your explanation.
Now I get akbarhome's idea. So I need to download the unicode lib here
http://blade.nagaokaut.ac.jp/cgi-bin/scat.rb/ruby/ruby-talk/197946
Then covert the strings into unicode, and then compare the characters
with the CJK Unicode Table from here:
CJK Unicode Tables
Yes,It must work!

but look this:

str1 = "中文 English Words"

=> "中文 English Words"

str1[0]

=> 228

str2 = "Ôkami: chi"

=> "Ôkami: chi"

str2[0]

=> 195

str3 = "English Words"

=> "English Words"

str3[0]

=> 69

may be there are numbers that are right for Chinese,
if only I known which number Chinese Characters start and end, there
will be a much simple solution.

--
Posted via http://www.ruby-forum.com/\.

Nanyang_Zhan1 · 7 May 2007 16:26

John Joyce wrote:

And yes, the overhead will be greater, but that's just a fact of
unicode and large character sets like chinese and japanese.
You will also want to check which chinese!
Chinese is split into two (politically safe) names : Traditional and
Simpllified.
If you were doing Japanese text, separating English or other western
languages wouldn't be so easy, since Japanese essentially includes a
number of other languages' character sets in its unicode set and in
everyday usage.

You are right. And let alone the characters, there is a different set of
punctuations!

So, you don't think there is a doc about the number range string[0]
return with a specified language?

I wonder what those number mean...

···

--
Posted via http://www.ruby-forum.com/\.

Zev_Blut · 8 May 2007 06:48

If the goal is to separate the western languages from the Japanese
Kanji and Kana, then it appears to not be too bad when using a lib
like this:

http://raa.ruby-lang.org/project/moji/

http://gimite.net/gimite/rubymess/moji.html

Zev

···

On Mon, 07 May 2007 22:18:36 +0900, John Joyce <dangerwillrobinsondanger@gmail.com> wrote:

If you were doing Japanese text, separating English or other western languages wouldn't be so easy, since Japanese essentially includes a number of other languages' character sets in its unicode set and in everyday usage.

James_Britt · 7 May 2007 17:09

there is a doc.
go to
www.unicode.org
There should be a pdf (many actually)
I don't know if the two main chinese sets are encoded as different ranges or simply declared in some way.
In general in Unicode a character is the same character even when it appears in a different language.

···

On May 8, 2007, at 1:26 AM, Nanyang Zhan wrote:

John Joyce wrote:

And yes, the overhead will be greater, but that's just a fact of
unicode and large character sets like chinese and japanese.
You will also want to check which chinese!
Chinese is split into two (politically safe) names : Traditional and
Simpllified.
If you were doing Japanese text, separating English or other western
languages wouldn't be so easy, since Japanese essentially includes a
number of other languages' character sets in its unicode set and in
everyday usage.

You are right. And let alone the characters, there is a different set of
punctuations!

So, you don't think there is a doc about the number range string[0]
return with a specified language?

I wonder what those number mean...

--
Posted via http://www.ruby-forum.com/\.

James_Britt · 7 May 2007 20:03

NZ,
You might want to check the RubyGems gem unihan
At the command line type:
gem list --remote uni
and it will show up.
then
gem install unihan --include-dependencies

I haven't checked it out yet, but after installing it, check the documentation.
It seems to be an API to the Unihan online database.
Could be quite useful.

John Joyce

James_Britt · 7 May 2007 20:39

NZ
another English site on Unicode that may be easier to understand (it was for me)
http://www.alanwood.net/unicode/index.html

There must surely be some docs in Chinese somewhere.
I know here in Japan there are many books on the subject. (in Japanese) Since computer science in Japan does deal with it a lot.
I've been interested in this subject myself, but it is a big one.
Unicode.org published the print version of 5.0 and I have browsed the book in the bookstore, it is worth checking out. Maybe a nearby university library would have it also.

It certainly seems like a point where a compiled language would be helpful, such as C
Most interpreted languages are only reaching partial unicode support now because of the overhead of processing many languages and the sheer volume of material to deal with, AND the various algorithms necessary for languages whose writing depends on context. (arabic, hebrew, indic languages, etc...)

Perhaps Perl and Ruby and Python and PHP should get hooks from Apple and Microsoft to help these languages be more productive by using their implementations.

eden · 8 May 2007 06:44

There is documentation:

ri String#

Although it is a little vague about what "character code" means. By
default (in ruby 1.8.x) the number returned by some_string[i] is a
fixnum in the range [0,255] -- even for UTF-8 encoded strings. Ruby
will just treat the string as a string of 8-bit bytes and give you
back whatever byte you asked for.

irb(main):001:0> s = "大智若愚"
=> "\345\244\247\346\231\272\350\213\245\346\204\232"
irb(main):002:0> s[0]
=> 229
irb(main):003:0> s.length
=> 12

···

On May 8, 12:26 am, Nanyang Zhan <s...@hotmail.com> wrote:

John Joyce wrote:
> And yes, the overhead will be greater, but that's just a fact of
> unicode and large character sets like chinese and japanese.
> You will also want to check which chinese!
> Chinese is split into two (politically safe) names : Traditional and
> Simpllified.
> If you were doing Japanese text, separating English or other western
> languages wouldn't be so easy, since Japanese essentially includes a
> number of other languages' character sets in its unicode set and in
> everyday usage.

You are right. And let alone the characters, there is a different set of
punctuations!

So, you don't think there is a doc about the number range string[0]
return with a specified language?

I wonder what those number mean...

--
Posted viahttp://www.ruby-forum.com/.

Topic		Replies	Views
Encounter troubles with Regex in Chinese text splitting ruby-talk	3	113	3 December 2005
Translation Project ruby-talk	21	228	30 March 2013
Japanese / chinese characters ruby-talk	3	149	10 January 2011
Unicode in Regex ruby-talk	32	379	7 December 2007
Strange behaviour of Strings in Range ruby-talk	23	168	4 May 2004

Separate Chinese and English! with Ruby

Related topics