How to split(//) with respect to bigraphs?

Pavel_Smerk · 2 August 2006 17:55

And once more question:

In Czech, c followed by h is considered (for sorting etc.) as one character/grapheme ch. I need to split string to single characters with respect to this absurd manner.

In Perl I can write

split /(?<=(?![Cc][Hh]).)/, $string

and it works fine.

Unfortunately, Ruby does not implement/support this "zero-width positive look-behind assertion", so the question is how can one efficiently split the string in Ruby?

Thanks,

P.

Collins_Justin · 2 August 2006 18:04

Pavel Smerk wrote:

And once more question:

In Czech, c followed by h is considered (for sorting etc.) as one character/grapheme ch. I need to split string to single characters with respect to this absurd manner.

In Perl I can write

split /(?<=(?![Cc][Hh]).)/, $string

and it works fine.

Unfortunately, Ruby does not implement/support this "zero-width positive look-behind assertion", so the question is how can one efficiently split the string in Ruby?

Thanks,

P.

Does this work?

irb(main):001:0> "czech".split(/([Cc][Hh])|/)
=> ["c", "z", "e", "ch"]
irb(main):002:0> "check czech".split(/([Cc][Hh])|/)
=> ["", "ch", "e", "c", "k", " ", "c", "z", "e", "ch"]
irb(main):003:0> "cHeck czeCh".split(/([Cc][Hh])|/)
=> ["", "cH", "e", "c", "k", " ", "c", "z", "e", "Ch"]

-Justin

Christian_Neukirche1 · 2 August 2006 18:49

Pavel Smerk <smerk@fi.muni.cz> writes:

And once more question:

In Czech, c followed by h is considered (for sorting etc.) as one
character/grapheme ch. I need to split string to single characters
with respect to this absurd manner.

In Perl I can write

split /(?<=(?![Cc][Hh]).)/, $string

string.split(/ch|./i)

···

--
Christian Neukirchen <chneukirchen@gmail.com> http://chneukirchen.org

Paul_Battley · 2 August 2006 18:28

Or use scan:

str.scan(/(?:ch)|./i)

You might still have a problem with other characters, though,
depending on the encoding and normalisation.

Paul.

···

On 02/08/06, Justin Collins <collinsj@seattleu.edu> wrote:

irb(main):001:0> "czech".split(/([Cc][Hh])|/)
=> ["c", "z", "e", "ch"]
irb(main):002:0> "check czech".split(/([Cc][Hh])|/)
=> ["", "ch", "e", "c", "k", " ", "c", "z", "e", "ch"]
irb(main):003:0> "cHeck czeCh".split(/([Cc][Hh])|/)
=> ["", "cH", "e", "c", "k", " ", "c", "z", "e", "Ch"]

Pavel_Smerk · 2 August 2006 19:05

Justin Collins wrote:

Pavel Smerk wrote:

And once more question:

one more

In Czech, c followed by h is considered (for sorting etc.) as one character/grapheme ch. I need to split string to single characters with respect to this absurd manner.

In Perl I can write

split /(?<=(?![Cc][Hh]).)/, $string

and it works fine.

Unfortunately, Ruby does not implement/support this "zero-width positive look-behind assertion", so the question is how can one efficiently split the string in Ruby?

Stupid question. One should not insist on word-for-word translation when rewriting some code from Perl to Ruby.

The solution can be e.g. scan(/[cC][hH]|./)

irb(main):001:0> "cHeck czeCh".scan(/[cC][hH]|./)
=> ["cH", "e", "c", "k", " ", "c", "z", "e", "Ch"]

Does this work?

irb(main):001:0> "czech".split(/([Cc][Hh])|/)
=> ["c", "z", "e", "ch"]
irb(main):002:0> "check czech".split(/([Cc][Hh])|/)
=> ["", "ch", "e", "c", "k", " ", "c", "z", "e", "ch"]
irb(main):003:0> "cHeck czeCh".split(/([Cc][Hh])|/)
=> ["", "cH", "e", "c", "k", " ", "c", "z", "e", "Ch"]

Scan version is slightly better as it never returns the empty string. Of course, thanks anyway.

But where can one find this feature of the split in the documentation? http://www.rubycentral.com/ref/ref_c_string.html#split does not mention split returns not only delimited substrings, but also successful groups from the match of the regexp.

Regards,

P.

Pavel_Smerk · 2 August 2006 19:05

Paul Battley wrote:

···

On 02/08/06, Justin Collins <collinsj@seattleu.edu> wrote:

irb(main):001:0> "czech".split(/([Cc][Hh])|/)
=> ["c", "z", "e", "ch"]
irb(main):002:0> "check czech".split(/([Cc][Hh])|/)
=> ["", "ch", "e", "c", "k", " ", "c", "z", "e", "ch"]
irb(main):003:0> "cHeck czeCh".split(/([Cc][Hh])|/)
=> ["", "cH", "e", "c", "k", " ", "c", "z", "e", "Ch"]

Or use scan:

str.scan(/(?:ch)|./i)

Yes, the use of scan strikes me in the meantime too. Why (?:)? str.scan(/ch|./i) does exactly the same, doesn't it?

Thank you,

P.

Collins_Justin · 2 August 2006 19:21

Pavel Smerk wrote:

Justin Collins wrote:

Pavel Smerk wrote:

And once more question:

one more

In Czech, c followed by h is considered (for sorting etc.) as one character/grapheme ch. I need to split string to single characters with respect to this absurd manner.

In Perl I can write

split /(?<=(?![Cc][Hh]).)/, $string

and it works fine.

Unfortunately, Ruby does not implement/support this "zero-width positive look-behind assertion", so the question is how can one efficiently split the string in Ruby?

Stupid question. One should not insist on word-for-word translation when rewriting some code from Perl to Ruby.

The solution can be e.g. scan(/[cC][hH]|./)

irb(main):001:0> "cHeck czeCh".scan(/[cC][hH]|./)
=> ["cH", "e", "c", "k", " ", "c", "z", "e", "Ch"]

Does this work?

irb(main):001:0> "czech".split(/([Cc][Hh])|/)
=> ["c", "z", "e", "ch"]
irb(main):002:0> "check czech".split(/([Cc][Hh])|/)
=> ["", "ch", "e", "c", "k", " ", "c", "z", "e", "ch"]
irb(main):003:0> "cHeck czeCh".split(/([Cc][Hh])|/)
=> ["", "cH", "e", "c", "k", " ", "c", "z", "e", "Ch"]

Scan version is slightly better as it never returns the empty string. Of course, thanks anyway.

But where can one find this feature of the split in the documentation? http://www.rubycentral.com/ref/ref_c_string.html#split does not mention split returns not only delimited substrings, but also successful groups from the match of the regexp.

Regards,

P.

As far as I can see, it's not in the documentation. I found it by accident. But, yes, the scan method is better.

-Justin

Morton_Goldberg · 2 August 2006 23:28

In Dave Thomas' Pickaxe book. Under String#split he writes:

"If pattern is a Regexp, str is divided where the pattern matches. Whenever the pattern
matches a zero-length string, str is split into individual characters. If pattern includes
groups, these groups will be included in the returned values."

Then he gives the following example:

"a@1bb@2ccc".split(/@(\d)/) => ["a", "1", "bb", "2", "ccc"]

Regards, Morton

···

On Aug 2, 2006, at 3:05 PM, Pavel Smerk wrote:

But where can one find this feature of the split in the documentation? http://www.rubycentral.com/ref/ref_c_string.html#split does not mention split returns not only delimited substrings, but also successful groups from the match of the regexp.

Paul_Battley · 2 August 2006 19:09

Yeah, there's no need for the (?: ... ). I started off thinking it was
more complicated than it was, and forgot to take that out. I really
need a regexp refactoring tool.

Paul.

···

On 02/08/06, Pavel Smerk <smerk@fi.muni.cz> wrote:

Yes, the use of scan strikes me in the meantime too. Why (?:)?
str.scan(/ch|./i) does exactly the same, doesn't it?

Dave_Howell · 2 August 2006 21:43

Oh, my gosh. If only you'd posted this little tidbit two days ago, I'd have saved a couple hours of code-wrangling.

For sorting purposes, I needed to turn something like
one-and.two@three.net
into
net.three@two.one-and

I started with str.split(/[.]|@/), but then I'd lose where the @ went. I tried turning it into
["one-and", ".", "two", "@", "three", ".", "net"]
so I could .reverse that, but without positive look-behind, I couldn't find any way to detect the break *after* the dot except with \w, which would also trigger after the hyphen.

After hours of work, I ended up with something that was not only long and confusing, involving .collect and an inner search loop and other stuff, but when I brought it back up to check it for this email message, I discovered that it didn't even actually work correctly.

And all along, all I needed to do was change
str.split(/[.]|@).reverse.join
into
str.split(/([.]|@)/).reverse.join

Dang. And thanks!

···

On Aug 2, 2006, at 12:21, Justin Collins wrote:

Pavel Smerk wrote:

But where can one find this feature of the split in the documentation? http://www.rubycentral.com/ref/ref_c_string.html#split does not mention split returns not only delimited substrings, but also successful groups from the match of the regexp.

Regards,

P.

As far as I can see, it's not in the documentation. I found it by accident. But, yes, the scan method is better.

Topic		Replies	Views
Split without lookbehind ruby-talk	10	73	3 October 2005
Can't understand String#split's behavior ruby-talk	2	144	17 October 2010
Premature end of regular expression with non-ascii chara ruby-talk	1	115	31 January 2006
Split version requirements ruby-talk	7	89	7 December 2010
Splitting a string ruby-talk	4	89	28 June 2007

How to split(//) with respect to bigraphs?

Related topics