Regular Expression UTF-8 Alphanumeric+Whitespace Match, Ignore Embedded Punctuation


(RRRoy BBBean) #1

I do a lot of work with a mix of English and Korean UTF-8 text. I need a regular expression to match both English and Korean (multi-byte) text. Unfortunately, this text is frequently "decorated" with lots of punctuation that causes downstream processing problems.

My primary goal is to remove most (or all) punctuation from the UTF-8 text, while retaining the words and whitespace.

When I use \w, I don't match the Korean text.

/[\w\s]+/.match 'slow motion 배속 부드러운 슬로모션'
=> #<MatchData "slow motion ">

/[\w\s]+/.match '배속 부드러운 슬로모션 slow motion'
=> #<MatchData " ">

Is there some way to capture arbitrary combinations of Korean and English text, ignoring punctuation, without resorting to the .* construct?

In the past, I have used .gsub to remove punctuation characters. Is this really the best/only way? If so, can I consolidate a chain of punctuation-removing gsubs into a single gsub?

Thank you :slight_smile:


(Mike Stok) #2

I do a lot of work with a mix of English and Korean UTF-8 text. I need a regular expression to match both English and Korean (multi-byte) text. Unfortunately, this text is frequently "decorated" with lots of punctuation that causes downstream processing problems.

My primary goal is to remove most (or all) punctuation from the UTF-8 text, while retaining the words and whitespace.

When I use \w, I don't match the Korean text.

/[\w\s]+/.match 'slow motion 배속 부드러운 슬로모션'
=> #<MatchData "slow motion ">

/[\w\s]+/.match '배속 부드러운 슬로모션 slow motion'
=> #<MatchData " ">

Is there some way to capture arbitrary combinations of Korean and English text, ignoring punctuation, without resorting to the .* construct?

Have you tried using Posix character classes?

$ irb
irb(main):001:0> /[\w\s]+/.match 'slow motion 배속 부드러운 슬로모션'
=> #<MatchData "slow motion ">
irb(main):002:0> /[[:alnum:][:blank:]]+/.match 'slow motion 배속 부드러운 슬로모션'
=> #<MatchData "slow motion 배속 부드러운 슬로모션">
irb(main):003:0> /[[:alnum:][:blank:]]+/.match 'slow motion 배속 부드러운 슬로모션!'
=> #<MatchData "slow motion 배속 부드러운 슬로모션”>

Hope this helps,

Mike

···

On Dec 26, 2018, at 9:08 AM, RRRoy BBBean <rrroybbbean@gmail.com> wrote:

In the past, I have used .gsub to remove punctuation characters. Is this really the best/only way? If so, can I consolidate a chain of punctuation-removing gsubs into a single gsub?

Thank you :slight_smile:

Unsubscribe: <mailto:ruby-talk-request@ruby-lang.org?subject=unsubscribe>
<http://lists.ruby-lang.org/cgi-bin/mailman/options/ruby-talk>

--

Mike Stok <mike@stok.ca>
http://www.stok.ca/~mike/

The "`Stok' disclaimers" apply.