Searching/regex

Hello
How can i get the 3rd (nth) occurence of a match?

a='http...x1x\nhttp.......x2x\nhttp........x3x'
if a=~/((http.*?(x.x)).*?){1,}/ then found=\g3 # WRONG
# result should be 'x3x' and 'http.......x3x'
btw: the nth match of my regex

Thanks for your help
B.

If the “\n” are meant to represent newlines then you could take advantage of ^ matching at the start of a line, if that’s important:

    19:48 $ pry
    [1] pry(main)> a = "http...x1x\nhttp.......x2x\nhttp........x3x"
    => "http...x1x\nhttp.......x2x\nhttp........x3x"
    [2] pry(main)> if (matches = a.scan(/^http.*?(x.x)/)) && matches.size > 2
    [2] pry(main)* puts "found #{matches[2]}"
    [2] pry(main)* end
    found ["x3x"]
    => nil

Otherwise:

    [3] pry(main)> a = 'http...x1x\nhttp.......x2x\nhttp........x3x’
    => "http...x1x\\nhttp.......x2x\\nhttp........x3x"
    [4] pry(main)> a.scan(/http.*?(x.x)/).map(&:first)[2]
    => “x3x"

might be interesting.

The approach I would take would depend on the nature of the data and what kind of efficiency concerns you have. For reasonable sized strings then String#scan will build an array of matches which you can use.

Hope this helps,

Mike

···

On Jan 2, 2016, at 7:30 PM, A Berger <aberger7890@gmail.com> wrote:

Hello
How can i get the 3rd (nth) occurence of a match?

a='http...x1x\nhttp.......x2x\nhttp........x3x'
if a=~/((http.*?(x.x)).*?){1,}/ then found=\g3 # WRONG
# result should be 'x3x' and 'http.......x3x'
btw: the nth match of my regex

Thanks for your help
B.

--

Mike Stok <mike@stok.ca>
http://www.stok.ca/~mike/

The "`Stok' disclaimers" apply.

>
> Hello
> How can i get the 3rd (nth) occurence of a match?
>
> a='http...x1x\nhttp.......x2x\nhttp........x3x'
> if a=~/((http.*?(x.x)).*?){1,}/ then found=\g3 # WRONG
> # result should be 'x3x' and 'http.......x3x'
> btw: the nth match of my regex
>
> Thanks for your help
> B.
>
If the “\n” are meant to represent newlines then you could take advantage

of ^ matching at the start of a line, if that’s important:

    19:48 $ pry
    [1] pry(main)> a = "http...x1x\nhttp.......x2x\nhttp........x3x"
    => "http...x1x\nhttp.......x2x\nhttp........x3x"
    [2] pry(main)> if (matches = a.scan(/^http.*?(x.x)/)) && matches.size
2
    [2] pry(main)* puts "found #{matches[2]}"
    [2] pry(main)* end
    found ["x3x"]
    => nil

Otherwise:

    [3] pry(main)> a = 'http...x1x\nhttp.......x2x\nhttp........x3x’
    => "http...x1x\\nhttp.......x2x\\nhttp........x3x"
    [4] pry(main)> a.scan(/http.*?(x.x)/).map(&:first)[2]
    => “x3x"

might be interesting.

The approach I would take would depend on the nature of the data and what

kind of efficiency concerns you have. For reasonable sized strings then
String#scan will build an array of matches which you can use.

Hope this helps,

Mike

--

Mike Stok <mike@stok.ca>
Mike Stok

The "`Stok' disclaimers" apply.

Unsubscribe: <mailto:ruby-talk-request@ruby-lang.org?subject=unsubscribe>
<http://lists.ruby-lang.org/cgi-bin/mailman/options/ruby-talk&gt;

Hi
Thanks!
to understand that in detail: is .first is done on each match, then the
3rd array-element is selected??

for extracting some urls thats a nice solution, but is there a way to not
having evaluated all matches (if it would be 1million matches, the
matchdata-array would grow...)
thanks

···

Am 03.01.2016 02:01 schrieb "Mike Stok" <mike@stok.ca>:

> On Jan 2, 2016, at 7:30 PM, A Berger <aberger7890@gmail.com> wrote:

>
>
> >
> > Hello
> > How can i get the 3rd (nth) occurence of a match?
> >
> > a='http...x1x\nhttp.......x2x\nhttp........x3x'
> > if a=~/((http.*?(x.x)).*?){1,}/ then found=\g3 # WRONG
> > # result should be 'x3x' and 'http.......x3x'
> > btw: the nth match of my regex
> >
> > Thanks for your help
> > B.
> >
> If the “\n” are meant to represent newlines then you could take advantage of ^ matching at the start of a line, if that’s important:
>
> 19:48 $ pry
> [1] pry(main)> a = "http...x1x\nhttp.......x2x\nhttp........x3x"
> => "http...x1x\nhttp.......x2x\nhttp........x3x"
> [2] pry(main)> if (matches = a.scan(/^http.*?(x.x)/)) && matches.size > 2
> [2] pry(main)* puts "found #{matches[2]}"
> [2] pry(main)* end
> found ["x3x"]
> => nil
>
> Otherwise:
>
> [3] pry(main)> a = 'http...x1x\nhttp.......x2x\nhttp........x3x’
> => "http...x1x\\nhttp.......x2x\\nhttp........x3x"
> [4] pry(main)> a.scan(/http.*?(x.x)/).map(&:first)[2]
> => “x3x"
>
> might be interesting.
>
> The approach I would take would depend on the nature of the data and what kind of efficiency concerns you have. For reasonable sized strings then String#scan will build an array of matches which you can use.
>
>

[…]

Hi

Thanks!
to understand that in detail: is .first is done on each match, then the 3rd array-element is selected??

for extracting some urls thats a nice solution, but is there a way to not having evaluated all matches (if it would be 1million matches, the matchdata-array would grow...)
thanks

[…]

If you have millions of matches then you might consider stream processing, which allows you to make use of laziness.

An example of doing this might be:

a = <<ETX
doesn't start with http
http...x1x
http.......x2x
http.....yxy
http........x3x
http...x4x
ETX

def nth_match(enumerable, n, pattern)
  enumerable
  .map { |l| pattern.match(l) }
  .reject(&:nil?)
  .map { |m|
    puts "retrieving #{m[1].inspect}” # just here to show when it is called
    m[1]
  }
  .drop(n - 1)
  .first
end

puts "got #{nth_match(a.lines, 3, /http.*?(x.x)/)}"
puts "got #{nth_match(a.lines.lazy, 3, /http.*?(x.x)/)}”

__END__

If you run that you should see that the version with .lazy does not retrieve x4x.

You should try benchmarking various approaches to see what works best for you in terms of time and memory use for real representative data. The lazy enumerators are not free in terms of performance, but have the advantage of working well on long (or infinite) streams.

Hope this helps,

Mike

···

On Jan 3, 2016, at 8:19 AM, A Berger <aberger7890@gmail.com> wrote:
Am 03.01.2016 02:01 schrieb "Mike Stok" <mike@stok.ca>:
> > On Jan 2, 2016, at 7:30 PM, A Berger <aberger7890@gmail.com> wrote:

--

Mike Stok <mike@stok.ca>
http://www.stok.ca/~mike/

The "`Stok' disclaimers" apply.

>
> Hello
> How can i get the 3rd (nth) occurence of a match?
>
> a='http...x1x\nhttp.......x2x\nhttp........x3x'
> if a=~/((http.*?(x.x)).*?){1,}/ then found=\g3 # WRONG
> # result should be 'x3x' and 'http.......x3x'
> btw: the nth match of my regex
>
> Thanks for your help
> B.
>
If the “\n” are meant to represent newlines then you could take advantage

of ^ matching at the start of a line, if that’s important:

    19:48 $ pry
    [1] pry(main)> a = "http...x1x\nhttp.......x2x\nhttp........x3x"
    => "http...x1x\nhttp.......x2x\nhttp........x3x"
    [2] pry(main)> if (matches = a.scan(/^http.*?(x.x)/)) && matches.size
2
    [2] pry(main)* puts "found #{matches[2]}"
    [2] pry(main)* end
    found ["x3x"]
    => nil

Otherwise:

    [3] pry(main)> a = 'http...x1x\nhttp.......x2x\nhttp........x3x’
    => "http...x1x\\nhttp.......x2x\\nhttp........x3x"
    [4] pry(main)> a.scan(/http.*?(x.x)/).map(&:first)[2]
    => “x3x"

might be interesting.

The approach I would take would depend on the nature of the data and what

kind of efficiency concerns you have. For reasonable sized strings then
String#scan will build an array of matches which you can use.

Hope this helps,

Mike

--

Mike Stok <mike@stok.ca>
Mike Stok

The "`Stok' disclaimers" apply.

Unsubscribe: <mailto:ruby-talk-request@ruby-lang.org?subject=unsubscribe>
<http://lists.ruby-lang.org/cgi-bin/mailman/options/ruby-talk&gt;

Hi
Thanks!
to understand that in detail: is .first is done on each match, then the
3rd array-element is selected??

for extracting some urls thats a nice solution, but is there a way to not
having evaluated all matches (if it would be 1million matches, the
matchdata-array would grow...)
thanks

···

Am 03.01.2016 02:01 schrieb "Mike Stok" <mike@stok.ca>:

> On Jan 2, 2016, at 7:30 PM, A Berger <aberger7890@gmail.com> wrote:

If you have millions of matches then you might consider stream processing,
which allows you to make use of laziness.

An example of doing this might be:

a = <<ETX
doesn't start with http
http...x1x
http.......x2x
http.....yxy
http........x3x
http...x4x
ETX

def nth_match(enumerable, n, pattern)
  enumerable
  .map { |l| pattern.match(l) }
  .reject(&:nil?)
  .map { |m|
    puts "retrieving #{m[1].inspect}” # just here to show when it is called
    m[1]
  }
  .drop(n - 1)
  .first
end

puts "got #{nth_match(a.lines, 3, /http.*?(x.x)/)}"
puts "got #{nth_match(a.lines.lazy, 3, /http.*?(x.x)/)}”

__END__

If you run that you should see that the version with .lazy does not
retrieve x4x.

That code isn't even syntactical correct.

You should try benchmarking various approaches to see what works best for
you in terms of time and memory use for real representative data. The lazy
enumerators are not free in terms of performance, but have the advantage of
working well on long (or infinite) streams.

We do not even know where the data comes from. So far we only have seen a
single String instance as source. (Btw. \n does not work as probably
intended in single quotes.)

For a String a lazy approach to find the nth occurrence of a match would be:

def lazy_match(input, rx, n)
  raise ArgumentError, "Invalid n: #{n.inspect}" unless n > 0

  input.scan rx do |m|
    n -= 1
    return m if n == 0
  end

  nil
end

irb(main):020:0> lazy_match a, /http.*?(x.x)/, 3
=> ["x3x"]

Kind regards

robert

···

On Sun, Jan 3, 2016 at 5:18 PM, Mike Stok <mike@stok.ca> wrote:

--
[guy, jim, charlie].each {|him| remember.him do |as, often| as.you_can -
without end}
http://blog.rubybestpractices.com/