Split without lookbehind

Hi

I'm trying to split a path (like an URL, but not one) into bits on the '/' character.
Problem is, I want to be able to be able to escape '/' in the names of bits by doubling
the character to '//' eg

'foo/bar // baz/qux'
=> ["foo", "bar / baz", "qux"]

In a perlish regex I could use zero-width assertions either side thus
string.split(/(?<!\/)\/(?!\/)/)

but there's no lookbehind in Ruby - I wonder if someone could suggest a neat
ruby alternative.

thanks
alex

[snip]

but there's no lookbehind in Ruby - I wonder if someone could suggest a neat
ruby alternative.

with oniguruma this works:

'foo/bar // baz/qux'.split(/(?<!\/)\/(?!\/)/)
# => ["foo", "bar // baz", "qux"]

I am not sure exactly how you want it splitted.
How about this one?

'a/b//c///d////e/////f'.split(/\/(?!\/)/)
# => ["a", "b/", "c//", "d///", "e////", "f"]

···

On 10/2/05, Alex Fenton <alex@deleteme.pressure.to> wrote:

--
Simon Strandgaard

Don't know if your case is loose enough to allow for a hack like this, but maybe it will give you ideas:

irb(main):001:0> "foo/bar // baz/qux".gsub("//", "\0").split("/").map { |e| e.gsub("\0", "/") }
=> ["foo", "bar / baz", "qux"]

If that doesn't work, it's probably time to break out StringScanner...

James Edward Gray II

···

On Oct 2, 2005, at 2:06 PM, Alex Fenton wrote:

Hi

I'm trying to split a path (like an URL, but not one) into bits on the '/' character.
Problem is, I want to be able to be able to escape '/' in the names of bits by doubling
the character to '//' eg

'foo/bar // baz/qux'
=> ["foo", "bar / baz", "qux"]

In a perlish regex I could use zero-width assertions either side thus
string.split(/(?<!\/)\/(?!\/)/)

but there's no lookbehind in Ruby - I wonder if someone could suggest a neat
ruby alternative.

Lookbehind is what you want; until we get it in core ruby, I've provided an example of another way to get the above. I've used commas instead of / just to help make the regexp easier to understand - substitute with "\/" as appropriate:

str = 'foo,bar , baz,qux,jorb,jing,blat'
out =
str.scan( /(.+?[^,],{2}*)(?:,(?!,)|$)/ ){ |a,b|
     out << a.gsub( ',', ',' )
}
p out
#=> ["foo", "bar , baz", "qux,", "jorb", "jing,blat"]

···

On Oct 2, 2005, at 1:06 PM, Alex Fenton wrote:

I'm trying to split a path (like an URL, but not one) into bits on the '/' character.
Problem is, I want to be able to be able to escape '/' in the names of bits by doubling
the character to '//' eg

'foo/bar // baz/qux'
=> ["foo", "bar / baz", "qux"]

Thanks all for your suggestions. I had forgotten scan was handy in situations like
this - I ended up going with a solution similar to Gavin's (again using commas
to avoid toothpick-itis)

str.scan /(?:[^,]|,)+/

Also helped me see that there are some tricky edge-cases where the separator
character is at the beginning or end of an element - I'm going to prohibit this in
the appplication.

cheers
alex

Gavin Kistner wrote:

···

On Oct 2, 2005, at 1:06 PM, Alex Fenton wrote:

I'm trying to split a path (like an URL, but not one) into bits on the '/' character.
Problem is, I want to be able to be able to escape '/' in the names of bits by doubling
the character to '//' eg

'foo/bar // baz/qux'
=> ["foo", "bar / baz", "qux"]

Lookbehind is what you want; until we get it in core ruby, I've provided an example of another way to get the above. I've used commas instead of / just to help make the regexp easier to understand - substitute with "\/" as appropriate:

str = 'foo,bar , baz,qux,jorb,jing,blat'
out =
str.scan( /(.+?[^,],{2}*)(?:,(?!,)|$)/ ){ |a,b|
    out << a.gsub( ',', ',' )
}
p out
#=> ["foo", "bar , baz", "qux,", "jorb", "jing,blat"]

Whenever I find myself about to do something like the above, I say to myself:

"Hey, buddy, pre-allocating an array and shoving stuff onto it in a block is neat as an exercise of the closure, but you should be using something like #map."

Unfortunately, it would appear that #scan doesn't automagically map the returned value from each iteration to an array. Man, wouldn't that be nice?

Following is my hackish attempt to make a String#scan_and_map function that does the above.

A few questions for the gurus:
a) Is there a better way to deal with bol? with StringScanner? (Boy, it'd be nice if there was a Regexp#uses_bol_at_start_of_match? method.)

b) Is there a clean way to tell the 'arity' of a regexp (how many captures it has, at max)? (Boy, it'd be nice if there was a Regexp#arity method.)

c) Without knowing the arity, is there a clean/fast way to gather all the 1..n submatches held in StringScanner? (Boy, it'd be nice if StringScanner gave you access to an array of subcaptures as a single property. And if it set the $1..$9 vars.)

require 'strscan'
class String
   def scan_and_map( regexp )
     # A naive check for beginning of line
     use_bol = regexp.inspect =~ /\/(?:\((?:\?:)?)*\^/

     # A naive check for sub-expression groups
     # Will fail for unescaped ( inside , for example
     use_groups = regexp.inspect =~ /(\^|[^\\])\\{2}*\(/

     results =
     ss = StringScanner.new( self )
     while !ss.eos?
       ss.scan_until( regexp ) unless ss.match?( regexp )
       if use_bol and not ss.bol?
         ss.pos += 1
       else
         result = ss.scan( regexp )
         if use_groups
           result = (1..9).to_a.map{ |i| ss[i] }
         end
         results << yield( result )
       end
     end
     results
   end
end

str = 'foo,bar , baz,qux,jorb,jing,blat'
p str.scan_and_map( /(.+?[^,],{2}*)(?:,(?!,)|$)/ ){ |saved,others|
   saved
}
#=> ["foo", "bar , baz", "qux,", "jorb", "jing,blat"]

···

On Oct 3, 2005, at 7:01 AM, Gavin Kistner wrote:

str = 'foo,bar , baz,qux,jorb,jing,blat'
out =
str.scan( /(.+?[^,],{2}*)(?:,(?!,)|$)/ ){ |a,b|
    out << a.gsub( ',', ',' )
}
p out
#=> ["foo", "bar , baz", "qux,", "jorb", "jing,blat"]

Gavin Kistner wrote:

···

On Oct 2, 2005, at 1:06 PM, Alex Fenton wrote:
> I'm trying to split a path (like an URL, but not one) into bits on
> the '/' character.
> Problem is, I want to be able to be able to escape '/' in the names
> of bits by doubling
> the character to '//' eg
>
> 'foo/bar // baz/qux'
> => ["foo", "bar / baz", "qux"]

Lookbehind is what you want; until we get it in core ruby, I've
provided an example of another way to get the above. I've used commas
instead of / just to help make the regexp easier to understand -
substitute with "\/" as appropriate:

str = 'foo,bar , baz,qux,jorb,jing,blat'
out =
str.scan( /(.+?[^,],{2}*)(?:,(?!,)|$)/ ){ |a,b|
     out << a.gsub( ',', ',' )
}
p out
#=> ["foo", "bar , baz", "qux,", "jorb", "jing,blat"]

str = 'foo,bar , baz,qux,jorb,jing,blat'
p str.scan( /(.+?[^,],{2}*)(?:,(?!,)|$)/ ).map{|x|
  x.first.gsub( /,/, "," ) }

You can avoid toothpick-ities by using a custom regexp delimiter. E.g.

irb(main):001:0> "this/is/a/path".split(%r{/})
=> ["this", "is", "a", "path"]

instead of

irb(main):002:0> "this/is/a/path".split(/\//)
=> ["this", "is", "a", "path"]

You can even do

irb(main):003:0> "this/is/a/path".split(%r / )
=> ["this", "is", "a", "path"]

or even if there are no spaces in the regexp

irb(main):004:0> "this/is/a/path".split %r /
=> ["this", "is", "a", "path"]

(but that is dangerous because you need a trailing space and was done
more for fun)

regards,

Brian

···

On 03/10/05, Alex Fenton <alex@deleteme.pressure.to> wrote:

Thanks all for your suggestions. I had forgotten scan was handy in
situations like
this - I ended up going with a solution similar to Gavin's (again using
commas
to avoid toothpick-itis)

str.scan /(?:[^,]|,)+/

Also helped me see that there are some tricky edge-cases where the separator
character is at the beginning or end of an element - I'm going to
prohibit this in
the appplication.

cheers
alex

Gavin Kistner wrote:
> On Oct 2, 2005, at 1:06 PM, Alex Fenton wrote:
>
>> I'm trying to split a path (like an URL, but not one) into bits on
>> the '/' character.
>> Problem is, I want to be able to be able to escape '/' in the names
>> of bits by doubling
>> the character to '//' eg
>>
>> 'foo/bar // baz/qux'
>> => ["foo", "bar / baz", "qux"]
>
>
> Lookbehind is what you want; until we get it in core ruby, I've
> provided an example of another way to get the above. I've used commas
> instead of / just to help make the regexp easier to understand -
> substitute with "\/" as appropriate:
>
> str = 'foo,bar , baz,qux,jorb,jing,blat'
> out =
> str.scan( /(.+?[^,],{2}*)(?:,(?!,)|$)/ ){ |a,b|
> out << a.gsub( ',', ',' )
> }
> p out
> #=> ["foo", "bar , baz", "qux,", "jorb", "jing,blat"]

--
http://ruby.brian-schroeder.de/

Stringed instrument chords: http://chordlist.brian-schroeder.de/

Following is my hackish attempt to make a String#scan_and_map
function that does the above.

OK, that was dumb. On the way to work I realized that the above can far
more correctly be implemented simply as:

class String
  def scan_and_map( regexp )
    results =
    scan( regexp ){ |args|
      results << yield( args )
    }
    results
  end
end

str = 'foo,bar , baz,qux,jorb,jing,blat'
p str.scan_and_map( /(.+?[^,],{2}*)(?:,(?!,)|$)/ ){ |saved,others|
  saved
}

#=> ["foo", "bar , baz", "qux,", "jorb", "jing,blat"]

"Phrogz" <gavin@refinery.com> writes:

Following is my hackish attempt to make a String#scan_and_map
function that does the above.

OK, that was dumb. On the way to work I realized that the above can far
more correctly be implemented simply as:

class String
  def scan_and_map( regexp )
    results =
    scan( regexp ){ |args|
      results << yield( args )
    }
    results
  end
end

str = 'foo,bar , baz,qux,jorb,jing,blat'
p str.scan_and_map( /(.+?[^,],{2}*)(?:,(?!,)|$)/ ){ |saved,others|
  saved
}

p str.scan( /(.+?[^,],{2}*)(?:,(?!,)|$)/ ).map{ |saved,others|
  saved
}

Trivialized?

···

--
Christian Neukirchen <chneukirchen@gmail.com> http://chneukirchen.org

Oh hell. Well done. :slight_smile:

I didn't realize that it returned something different with no block at
all. Nice work, ye who wrote #scan. :slight_smile: