Regexp Question: Checking for [joe][/joe] pairs

Hey, I've got some text in @x and want there to be at least 1 and at
most 3 [joe][/joe] pairs, each having at least one character between the
beginning [joe] and the ending [/joe].

This is what I have now, and it seems to sometimes work, and sometimes
not.

@x.match(/(\[joe\][\s\d\w]+?\[\/joe\]){1,3}/)

···

--
Posted via http://www.ruby-forum.com/.

Why are you doing /[\s\d\w]+?/? Just use /.+?/.

Dan

Joe Peck wrote:

···

Hey, I've got some text in @x and want there to be at least 1 and at
most 3 [joe][/joe] pairs, each having at least one character between the
beginning [joe] and the ending [/joe].

This is what I have now, and it seems to sometimes work, and sometimes
not.

@x.match(/(\[joe\][\s\d\w]+?\[\/joe\]){1,3}/)

Hi Joe --

Hey, I've got some text in @x and want there to be at least 1 and at
most 3 [joe][/joe] pairs, each having at least one character between the
beginning [joe] and the ending [/joe].

This is what I have now, and it seems to sometimes work, and sometimes
not.

@x.match(/(\[joe\][\s\d\w]+?\[\/joe\]){1,3}/)

   require 'test/unit'

   class JoeTest < Test::Unit::TestCase
     def setup
       @re = /(\[joe\][\s\d\w]+?\[\/joe\]){1,3}/
     end

     def test_ok
       assert("[joe]abc[/joe]".match(@re))
     end

     def test_broken
       # ??? <--- fill in the blank :slight_smile:
     end
   end

David

···

On Thu, 21 Dec 2006, Joe Peck wrote:

--
Q. What's a good holiday present for the serious Rails developer?
A. RUBY FOR RAILS by David A. Black (http://www.manning.com/black\)
    aka The Ruby book for Rails developers!
Q. Where can I get Ruby/Rails on-site training, consulting, coaching?
A. Ruby Power and Light, LLC (http://www.rubypal.com)

Joe Peck wrote:

Hey, I've got some text in @x and want there to be at least 1 and at
most 3 [joe][/joe] pairs, each having at least one character between the
beginning [joe] and the ending [/joe].

This is what I have now, and it seems to sometimes work, and sometimes
not.

@x.match(/(\[joe\][\s\d\w]+?\[\/joe\]){1,3}/)

--
Posted via http://www.ruby-forum.com/\.

@x = "[joe] [/joe] [joe][/joe] [joe] foo [/joe]"
count = @x.scan(/\[joe\](.*?)\[\/joe\]/m).flatten.
  reject{|s| ""==s}.size
p count

Daniel Finnie wrote:

Why are you doing /[\s\d\w]+?/? Just use /.+?/.

Dan

Good point. I was using .+? earlier, but thought that might be part of
my problem. It seems to accept @x even if it contains more than 3
[joe][/joe] pairs.

···

--
Posted via http://www.ruby-forum.com/\.

Hi --

···

On Thu, 21 Dec 2006, Daniel Finnie wrote:

Joe Peck wrote:

Hey, I've got some text in @x and want there to be at least 1 and at
most 3 [joe][/joe] pairs, each having at least one character between the
beginning [joe] and the ending [/joe].

This is what I have now, and it seems to sometimes work, and sometimes
not.

@x.match(/(\[joe\][\s\d\w]+?\[\/joe\]){1,3}/)

Why are you doing /[\s\d\w]+?/? Just use /.+?/.

\d is part of \w, so [\s\w] would be OK. But . is very different. It
does not include newline (by default), and *does* include punctuation.

David

--
Q. What's a good holiday present for the serious Rails developer?
A. RUBY FOR RAILS by David A. Black (http://www.manning.com/black\)
    aka The Ruby book for Rails developers!
Q. Where can I get Ruby/Rails on-site training, consulting, coaching?
A. Ruby Power and Light, LLC (http://www.rubypal.com)

The problem is I don't want it to accept things like:
"[joe] hello [joe] how are [/joe] you"
where there are two opening tags before a closing tag is reached.
Similarly, I don't want to accept something like:
"hey [joe] it's hot today[/joe] where [joe] is the ac"
where there is one correct pair but then an opening tag without a
closing one.

···

--
Posted via http://www.ruby-forum.com/.

Hi --

···

On Thu, 21 Dec 2006, Joe Peck wrote:

Daniel Finnie wrote:

Why are you doing /[\s\d\w]+?/? Just use /.+?/.

Dan

Good point. I was using .+? earlier, but thought that might be part of
my problem. It seems to accept @x even if it contains more than 3
[joe][/joe] pairs.

That's because {1,3} doesn't mean there can't be another. Usually
you'd anchor it or surround it with something else, like:

   /Xa{1,3}X/

so it can't keep repeating.

David

--
Q. What's a good holiday present for the serious Rails developer?
A. RUBY FOR RAILS by David A. Black (http://www.manning.com/black\)
    aka The Ruby book for Rails developers!
Q. Where can I get Ruby/Rails on-site training, consulting, coaching?
A. Ruby Power and Light, LLC (http://www.rubypal.com)

The problem is that the Regexp is not anchored to the start and ends of the string.

/^(?!\[joe\])*.?(\[joe\].+?\[\\joe\]){1,3}(?!\[joe\])*.?$/

Joe Peck wrote:

···

Daniel Finnie wrote:

Why are you doing /[\s\d\w]+?/? Just use /.+?/.

Dan

Good point. I was using .+? earlier, but thought that might be part of my problem. It seems to accept @x even if it contains more than 3 [joe][/joe] pairs.

Joe Peck wrote:

The problem is I don't want it to accept things like:
"[joe] hello [joe] how are [/joe] you"
where there are two opening tags before a closing tag is reached. Similarly, I don't want to accept something like:
"hey [joe] it's hot today[/joe] where [joe] is the ac"
where there is one correct pair but then an opening tag without a closing one.
  

I missed the beginning of this thread, but if I recall correctly from my course on formal languages, this sort if thing can't be done with regular expressions.

Regular expressions can be used to test whether a string belongs to a certain regular language, which is a subset of all possible languages (where a language is a set of strings). Regular expressions are equivalent to finite state automata in this respect. Since a finite state automata can only be in a finite number of states. You'd like to match a possibly infinitely large number of [joe][/joe] pairs. The FSA would need a new state for every extra [joe] it reads to remember it still needs to consume a matching [/joe] for it.

If this sounds like Chinese, just remember regexpes aren't keen on matching this sort of stuff. Stacks on the other hand seem to be custom designed for these purposes.

A.

Joe Peck wrote:

The problem is I don't want it to accept things like:
"[joe] hello [joe] how are [/joe] you"
where there are two opening tags before a closing tag is reached.
Similarly, I don't want to accept something like:
"hey [joe] it's hot today[/joe] where [joe] is the ac"
where there is one correct pair but then an opening tag without a
closing one.

If I am reading your specs correctly:

  /^(((?!\[joe\]).)*(\[joe\]((?!\[joe\]).)+\[\/joe\])){1,3}((?!\[joe\]).)*$/

cheers,
andrew

···

--
Posted via http://www.ruby-forum.com/\.

Regular expressions can be used to test whether a string belongs to a
certain regular language, which is a subset of all possible languages
(where a language is a set of strings). Regular expressions are
equivalent to finite state automata in this respect. Since a finite
state automata can only be in a finite number of states. You'd like to
match a possibly infinitely large number of [joe][/joe] pairs. The FSA
would need a new state for every extra [joe] it reads to remember it
still needs to consume a matching [/joe] for it.

If this sounds like Chinese, just remember regexpes aren't keen on
matching this sort of stuff. Stacks on the other hand seem to be custom
designed for these purposes.

A.

It doesn't sound like Chinese :slight_smile:

If wouldn't have to be an infinite amount of states. Let's say these
are the states:

State 1 - no [joe] yet. If finds [joe], goes to state 2. If finds
[/joe], fails.
State 2 - [joe] found but not matching [/joe]. If it finds [joe] again
in this state, then fails. If it finds [/joe], increments count by 1
and moves to state 1.

If count goes above 3, fails.

But maybe I'll use something besides a regexp, although I thought there
would be a pretty easy way to do it.

···

--
Posted via http://www.ruby-forum.com/\.

Hi --

Joe Peck wrote:

The problem is I don't want it to accept things like:
"[joe] hello [joe] how are [/joe] you"
where there are two opening tags before a closing tag is reached. Similarly, I don't want to accept something like:
"hey [joe] it's hot today[/joe] where [joe] is the ac"
where there is one correct pair but then an opening tag without a closing one.

I missed the beginning of this thread, but if I recall correctly from my course on formal languages, this sort if thing can't be done with regular expressions.

Regular expressions can be used to test whether a string belongs to a certain regular language, which is a subset of all possible languages (where a language is a set of strings). Regular expressions are equivalent to finite state automata in this respect. Since a finite state automata can only be in a finite number of states. You'd like to match a possibly infinitely large number of [joe][/joe] pairs. The FSA would need a new state for every extra [joe] it reads to remember it still needs to consume a matching [/joe] for it.

If this sounds like Chinese, just remember regexpes aren't keen on matching this sort of stuff. Stacks on the other hand seem to be custom designed for these purposes.

If you're using scan, though, doesn't that mean that you're not really
trying to match one string to the regex, but rather a series of
strings? That means that the state machine gets completely restarted
as the scan method goes along the string. I think that's a different
situation. You're not really saying: match all the pairs; you're
saying: find the first substring that has a matching pair, then
discard it and don't worry about backtracking through it; etc.

David

···

On Fri, 22 Dec 2006, Arne Brasseur wrote:

--
Q. What's a good holiday present for the serious Rails developer?
A. RUBY FOR RAILS by David A. Black (http://www.manning.com/black\)
    aka The Ruby book for Rails developers!
Q. Where can I get Ruby/Rails on-site training, consulting, coaching?
A. Ruby Power and Light, LLC (http://www.rubypal.com)

Andrew Johnson wrote:

Joe Peck wrote:
> The problem is I don't want it to accept things like:
> "[joe] hello [joe] how are [/joe] you"
> where there are two opening tags before a closing tag is reached.
> Similarly, I don't want to accept something like:
> "hey [joe] it's hot today[/joe] where [joe] is the ac"
> where there is one correct pair but then an opening tag without a
> closing one.

If I am reading your specs correctly:

  /^(((?!\[joe\]).)*(\[joe\]((?!\[joe\]).)+\[\/joe\])){1,3}((?!\[joe\]).)*$/

[
  "good [joe] [/joe] [joe] [/joe]",
  "bad [/joe] [joe] [/joe]",
  "bad [joe] [/joe] [/joe]"
].each { |s|
  p s =~
/^(((?!\[joe\]).)*(\[joe\]((?!\[joe\]).)+\[\/joe\])){1,3}((?!\[joe\]).)*$/
}

To my knowledge, you can't do this with Ruby's current regexp engine,
though it is possible with Perl and .NET. Both of those languages
support something roughly analogous to a stack, within the expression.
I don't think Ruby 1.8's regexp engine is powerful enough to handle
this, but I would be happy to be proven wrong.

It's worth remembering that what we call 'regular expressions' these
days don't actually match the formal definition of that term, and are
much more powerful in some ways.

···

On 12/21/06, Joe Peck <joe@notsleepy.com> wrote:

> Regular expressions can be used to test whether a string belongs to a
> certain regular language, which is a subset of all possible languages
> (where a language is a set of strings). Regular expressions are
> equivalent to finite state automata in this respect. Since a finite
> state automata can only be in a finite number of states. You'd like to
> match a possibly infinitely large number of [joe][/joe] pairs. The FSA
> would need a new state for every extra [joe] it reads to remember it
> still needs to consume a matching [/joe] for it.
>
> If this sounds like Chinese, just remember regexpes aren't keen on
> matching this sort of stuff. Stacks on the other hand seem to be custom
> designed for these purposes.
>
> A.
It doesn't sound like Chinese :slight_smile:

If wouldn't have to be an infinite amount of states. Let's say these
are the states:

State 1 - no [joe] yet. If finds [joe], goes to state 2. If finds
[/joe], fails.
State 2 - [joe] found but not matching [/joe]. If it finds [joe] again
in this state, then fails. If it finds [/joe], increments count by 1
and moves to state 1.

If count goes above 3, fails.

But maybe I'll use something besides a regexp, although I thought there
would be a pretty easy way to do it.

Joe Peck wrote:

It doesn't sound like Chinese :slight_smile:

If wouldn't have to be an infinite amount of states. Let's say these are the states:

State 1 - no [joe] yet. If finds [joe], goes to state 2. If finds [/joe], fails.
State 2 - [joe] found but not matching [/joe]. If it finds [joe] again in this state, then fails. If it finds [/joe], increments count by 1 and moves to state 1.

If count goes above 3, fails.

But maybe I'll use something besides a regexp, although I thought there would be a pretty easy way to do it.

You're right, the infinte amount of states is when you have nesting, e.g. parentheses in programming languages.

I just came up with this solution, but it looks like that's not what you're after.

count=0
'[joe] hello [joe] how are [/joe] you'.scan(/\[joe\]|\[\/joe\]/).each do

m>

    if m =~ /[joe]/
        count+=1
    else
        count-=1
    end
end

if count!=0
    puts "Your joes dont match up!"
end

A.

Joe Peck wrote:

> Regular expressions can be used to test whether a string belongs to a
> certain regular language, which is a subset of all possible languages
> (where a language is a set of strings). Regular expressions are
> equivalent to finite state automata in this respect. Since a finite
> state automata can only be in a finite number of states. You'd like to
> match a possibly infinitely large number of [joe][/joe] pairs. The FSA
> would need a new state for every extra [joe] it reads to remember it
> still needs to consume a matching [/joe] for it.
>
> If this sounds like Chinese, just remember regexpes aren't keen on
> matching this sort of stuff. Stacks on the other hand seem to be custom
> designed for these purposes.
>
> A.
It doesn't sound like Chinese :slight_smile:

If wouldn't have to be an infinite amount of states. Let's say these
are the states:

State 1 - no [joe] yet. If finds [joe], goes to state 2. If finds
[/joe], fails.
State 2 - [joe] found but not matching [/joe]. If it finds [joe] again
in this state, then fails. If it finds [/joe], increments count by 1
and moves to state 1.

If count goes above 3, fails.

But maybe I'll use something besides a regexp, although I thought there
would be a pretty easy way to do it.

--
Posted via http://www.ruby-forum.com/\.

If a regular expression can't do it, does that mean we can't use
a regular expression?

No. We'll still use a regexp and add some code to help it.

If all the pairs are matched, then after partitioning and zipping
we wind up with the original pairs.

[
  "ok [joe] ok [/joe] right",
  "ok [joe] [/joe] [joe] foo [/joe]",
  "bad [joe] [/joe] foo [joe]",
  "bad [joe] [/joe] foo [/joe]",
  "bad [joe] [joe] foo [/joe]",
  "bad [joe] [joe] ",
  "bad [/joe] [joe] ",
  "bad [/joe] [/joe] "
].each { |s|
  ary = s.scan( %r{\[/?joe\]} )
  p ary
  if ary == ary.partition{|t| "[joe]"==t}.inject{|a,b| a.zip(b)
}.flatten
    puts "good\n"
  else
    puts "bad\n"
  end
}

--- output -----
["[joe]", "[/joe]"]
good
["[joe]", "[/joe]", "[joe]", "[/joe]"]
good
["[joe]", "[/joe]", "[joe]"]
bad
["[joe]", "[/joe]", "[/joe]"]
bad
["[joe]", "[joe]", "[/joe]"]
bad
["[joe]", "[joe]"]
bad
["[/joe]", "[joe]"]
bad
["[/joe]", "[/joe]"]
bad

William James wrote:

Andrew Johnson wrote:

If I am reading your specs correctly:

  /^(((?!\[joe\]).)*(\[joe\]((?!\[joe\]).)+\[\/joe\])){1,3}((?!\[joe\]).)*$/

[
  "good [joe] [/joe] [joe] [/joe]",
  "bad [/joe] [joe] [/joe]",
  "bad [joe] [/joe] [/joe]"
].each { |s|
  p s =~
/^(((?!\[joe\]).)*(\[joe\]((?!\[joe\]).)+\[\/joe\])){1,3}((?!\[joe\]).)*$/
}

Quite right:

[
  "good [joe] [/joe] [joe] [/joe]",
  "bad [/joe] [joe] [/joe]",
  "bad [joe] [/joe] [/joe]"
].each { |s|
  p s if s =~
/^(((?!\[\/?joe\]).)*(\[joe\]((?!\[\/?joe\]).)+\[\/joe\])){1,3}((?!\[\/?joe\]).)*$/
}

cheers
andrew

···

--
Posted via http://www.ruby-forum.com/\.

Andrew Johnson wrote:

William James wrote:
> Andrew Johnson wrote:
>> If I am reading your specs correctly:
>>
>> /^(((?!\[joe\]).)*(\[joe\]((?!\[joe\]).)+\[\/joe\])){1,3}((?!\[joe\]).)*$/
>
> [
> "good [joe] [/joe] [joe] [/joe]",
> "bad [/joe] [joe] [/joe]",
> "bad [joe] [/joe] [/joe]"
> ].each { |s|
> p s =~
> /^(((?!\[joe\]).)*(\[joe\]((?!\[joe\]).)+\[\/joe\])){1,3}((?!\[joe\]).)*$/
> }

Quite right:

[
  "good [joe] [/joe] [joe] [/joe]",
  "bad [/joe] [joe] [/joe]",
  "bad [joe] [/joe] [/joe]"
].each { |s|
  p s if s =~
/^(((?!\[\/?joe\]).)*(\[joe\]((?!\[\/?joe\]).)+\[\/joe\])){1,3}((?!\[\/?joe\]).)*$/
}

cheers
andrew

Yours is faster for very short strings; longer strings allow the array
method to pull ahead.

require 'benchmark'

$n = 10_000
$strings = [
  "good [joe] Wasn't that what [i]he[/i] was seeking? [/joe]
     [joe] Can't you [b]see[/b] that? [/joe]",
  "bad was Peck's boy [/joe] [joe] But he'll never know. [/joe]",
  "bad to the bone [joe] Or will he?! [/joe] mish mash mush
   Marching on Tom Tidler's ground fatigues me. [/joe]",
  "bad: too many [joe] [/joe] [joe] [/joe] [joe] [/joe] [joe] [/joe]",
  "bad: too few"
]

def regexp
  $regexp_good = 0
  $n.times{ $strings.each { |s|
    $regexp_good += 1 if s =~
/\A(((?!\[\/?joe\]).)*(\[joe\]((?!\[\/?joe\]).)+\[\/joe\])){1,3}((?!\[\/?joe\]).)*\Z/m
  } }
end
def array
  $array_good = 0
  $n.times{ $strings.each { |s|
    ary = s.scan( %r{\[/?joe\]} )
    if [2,4,6].include?(ary.size) and
      ary == ary.partition{|t| "[joe]"==t}.inject{|a,b| a.zip(b)}.
        flatten
      $array_good += 1
    end
  } }
end

Benchmark.bmbm do |x|
  x.report("regexp") { regexp }
  x.report("array") { array }
end

puts ; p $regexp_good, $array_good

Rehearsal ------------------------------------------
regexp 6.870000 0.000000 6.870000 ( 7.391000)
array 2.653000 0.000000 2.653000 ( 2.874000)
--------------------------------- total: 9.523000sec

             user system total real
regexp 6.940000 0.000000 6.940000 ( 7.441000)
array 2.634000 0.000000 2.634000 ( 2.854000)

10000
10000

William James wrote:
[snip]

Rehearsal ------------------------------------------
regexp 6.870000 0.000000 6.870000 ( 7.391000)
array 2.653000 0.000000 2.653000 ( 2.874000)
--------------------------------- total: 9.523000sec

             user system total real
regexp 6.940000 0.000000 6.940000 ( 7.441000)
array 2.634000 0.000000 2.634000 ( 2.854000)

10000
10000

The regex engine makes a difference in this case -- ruby1.8.5 + the
oniguruma engine gives:

  Rehearsal ------------------------------------------
  regexp 2.740000 0.010000 2.750000 ( 2.960385)
  array 3.120000 0.000000 3.120000 ( 3.120808)
  --------------------------------- total: 5.870000sec

               user system total real
  regexp 2.750000 0.000000 2.750000 ( 2.743031)
  array 3.140000 0.000000 3.140000 ( 3.137098)

  10000
  10000

cheers,
andrew

···

--
Posted via http://www.ruby-forum.com/\.