I do a lot of work with a mix of English and Korean UTF-8 text. I need a regular expression to match both English and Korean (multi-byte) text. Unfortunately, this text is frequently "decorated" with lots of punctuation that causes downstream processing problems.
My primary goal is to remove most (or all) punctuation from the UTF-8 text, while retaining the words and whitespace.
When I use \w, I don't match the Korean text.
/[\w\s]+/.match 'slow motion 배속 부드러운 슬로모션'
=> #<MatchData "slow motion ">
/[\w\s]+/.match '배속 부드러운 슬로모션 slow motion'
=> #<MatchData " ">
Is there some way to capture arbitrary combinations of Korean and English text, ignoring punctuation, without resorting to the .* construct?
In the past, I have used .gsub to remove punctuation characters. Is this really the best/only way? If so, can I consolidate a chain of punctuation-removing gsubs into a single gsub?
Thank you
I do a lot of work with a mix of English and Korean UTF-8 text. I need a regular expression to match both English and Korean (multi-byte) text. Unfortunately, this text is frequently "decorated" with lots of punctuation that causes downstream processing problems.
My primary goal is to remove most (or all) punctuation from the UTF-8 text, while retaining the words and whitespace.
When I use \w, I don't match the Korean text.
/[\w\s]+/.match 'slow motion 배속 부드러운 슬로모션'
=> #<MatchData "slow motion ">
/[\w\s]+/.match '배속 부드러운 슬로모션 slow motion'
=> #<MatchData " ">
Is there some way to capture arbitrary combinations of Korean and English text, ignoring punctuation, without resorting to the .* construct?
Have you tried using Posix character classes?
$ irb
irb(main):001:0> /[\w\s]+/.match 'slow motion 배속 부드러운 슬로모션'
=> #<MatchData "slow motion ">
irb(main):002:0> /[[:alnum:][:blank:]]+/.match 'slow motion 배속 부드러운 슬로모션'
=> #<MatchData "slow motion 배속 부드러운 슬로모션">
irb(main):003:0> /[[:alnum:][:blank:]]+/.match 'slow motion 배속 부드러운 슬로모션!'
=> #<MatchData "slow motion 배속 부드러운 슬로모션”>
Hope this helps,
Mike
···
On Dec 26, 2018, at 9:08 AM, RRRoy BBBean <rrroybbbean@gmail.com> wrote:
In the past, I have used .gsub to remove punctuation characters. Is this really the best/only way? If so, can I consolidate a chain of punctuation-removing gsubs into a single gsub?
Thank you
Unsubscribe: <mailto:ruby-talk-request@ruby-lang.org?subject=unsubscribe>
<http://lists.ruby-lang.org/cgi-bin/mailman/options/ruby-talk>
--
Mike Stok <mike@stok.ca>
http://www.stok.ca/~mike/
The "`Stok' disclaimers" apply.