Regular Expression UTF-8 Alphanumeric+Whitespace Match, Ignore Embedded Punctuation

RRRoy_BBBean · 26 December 2018 14:08

I do a lot of work with a mix of English and Korean UTF-8 text. I need a regular expression to match both English and Korean (multi-byte) text. Unfortunately, this text is frequently "decorated" with lots of punctuation that causes downstream processing problems.

My primary goal is to remove most (or all) punctuation from the UTF-8 text, while retaining the words and whitespace.

When I use \w, I don't match the Korean text.

/[\w\s]+/.match 'slow motion 배속 부드러운 슬로모션'
=> #<MatchData "slow motion ">

/[\w\s]+/.match '배속 부드러운 슬로모션 slow motion'
=> #<MatchData " ">

Is there some way to capture arbitrary combinations of Korean and English text, ignoring punctuation, without resorting to the .* construct?

In the past, I have used .gsub to remove punctuation characters. Is this really the best/only way? If so, can I consolidate a chain of punctuation-removing gsubs into a single gsub?

Thank you

Mike_Stok1 · 26 December 2018 14:19

I do a lot of work with a mix of English and Korean UTF-8 text. I need a regular expression to match both English and Korean (multi-byte) text. Unfortunately, this text is frequently "decorated" with lots of punctuation that causes downstream processing problems.

My primary goal is to remove most (or all) punctuation from the UTF-8 text, while retaining the words and whitespace.

When I use \w, I don't match the Korean text.

/[\w\s]+/.match 'slow motion 배속 부드러운 슬로모션'
=> #<MatchData "slow motion ">

/[\w\s]+/.match '배속 부드러운 슬로모션 slow motion'
=> #<MatchData " ">

Is there some way to capture arbitrary combinations of Korean and English text, ignoring punctuation, without resorting to the .* construct?

Have you tried using Posix character classes?

$ irb
irb(main):001:0> /[\w\s]+/.match 'slow motion 배속 부드러운 슬로모션'
=> #<MatchData "slow motion ">
irb(main):002:0> /[[:alnum:][:blank:]]+/.match 'slow motion 배속 부드러운 슬로모션'
=> #<MatchData "slow motion 배속 부드러운 슬로모션">
irb(main):003:0> /[[:alnum:][:blank:]]+/.match 'slow motion 배속 부드러운 슬로모션!'
=> #<MatchData "slow motion 배속 부드러운 슬로모션”>

Hope this helps,

Mike

···

On Dec 26, 2018, at 9:08 AM, RRRoy BBBean <rrroybbbean@gmail.com> wrote:

In the past, I have used .gsub to remove punctuation characters. Is this really the best/only way? If so, can I consolidate a chain of punctuation-removing gsubs into a single gsub?

Thank you

Unsubscribe: <mailto:ruby-talk-request@ruby-lang.org?subject=unsubscribe>
<http://lists.ruby-lang.org/cgi-bin/mailman/options/ruby-talk>

--

Mike Stok <mike@stok.ca>
http://www.stok.ca/~mike/

The "`Stok' disclaimers" apply.

Topic		Replies	Views
RegEx for punctuation? ruby-talk	3	55	2 August 2007
Text parser (text into sentences) that works with UTF-8 and multiple languages? ruby-talk	3	99	30 July 2007
Regexp for matching UTF-8 characters without close tag ruby-talk	4	104	6 January 2008
Issue with regular expressions and locales ruby-talk	2	126	10 May 2010
UTF in Regexp ruby-talk	1	68	3 February 2007

Regular Expression UTF-8 Alphanumeric+Whitespace Match, Ignore Embedded Punctuation

Related Topics