I do a lot of work with a mix of English and Korean UTF-8 text. I need a regular expression to match both English and Korean (multi-byte) text. Unfortunately, this text is frequently "decorated" with lots of punctuation that causes downstream processing problems.
My primary goal is to remove most (or all) punctuation from the UTF-8 text, while retaining the words and whitespace.
When I use \w, I don't match the Korean text.
/[\w\s]+/.match 'slow motion 배속 부드러운 슬로모션'
=> #<MatchData "slow motion ">
/[\w\s]+/.match '배속 부드러운 슬로모션 slow motion'
=> #<MatchData " ">
Is there some way to capture arbitrary combinations of Korean and English text, ignoring punctuation, without resorting to the .* construct?
In the past, I have used .gsub to remove punctuation characters. Is this really the best/only way? If so, can I consolidate a chain of punctuation-removing gsubs into a single gsub?