Rubys Regular Expression Engine

Some weeks ago I sent a message here, "Onigurama - Problem with Subexpression Call", sent on Montag, 27. Juni
2005 23:22 - unfortunately there was no comment from someone. So I try to ask the major part again.

If I'm correcly informed, the Regular Expression Engine used by Ruby will be Onigurama, starting in Ruby 1.9.
Where is the right place to sent Problem Reports or Change Recommendations against Onigurama? The Engine
itself is not only used in the Ruby environment, so I don't know what to do with Problems that have to be
addressed to Onigurama in the Ruby environment.

***** Problem Report *****

Onigurama does not recognize left recursion correctly.

As I wrote in my earlier message

···

-----------------------------------------------------------
orgstring= "5-((3+4)*5)+6+xx"
pattern = /^(?<bal>[^()]*(\(\g<bal>\)[^()]*)*)$/

puts "\nMuster: #{pattern.inspect}\nString: '#{orgstring}'"
if (res = orgstring.match(pattern)) then
  puts "Match: '#{res}'"
else
  puts "+++++ No Match"
end
-----------------------------------------------------------

produces the Error Message

-----------------------------------------------------------
never ending recursion: /^(?<bal>[^()]*(\(\g<bal>\)[^()]*)*)$/
-----------------------------------------------------------

which is definitely wrong. After using a useless alternative in this expression it works fine

-----------------------------------------------------------
orgstring= "5-((3+4)*5)+6+xx"
pattern = /^(?<bal>[^()]*((x|(\(\g<bal>\)))[^()]*)*)$/

puts "\nMuster: #{pattern.inspect}\nString: '#{orgstring}'"
if (res = orgstring.match(pattern)) then
  puts "Match: '#{res}'"
else
  puts "+++++ No Match"
end
-----------------------------------------------------------

Can anyone give me a hint where to place things related to Ruby/Onigurama?

Best regards, Wolfgang

--
Wolfgang Nádasi-Donner
wonado@donnerweb.de

Wolfgang N�dasi-Donner wrote:
           ^-- Sorry, but your missing proper encoding information.

Some weeks ago I sent a message here, "Onigurama - Problem with
Subexpression Call", sent on Montag, 27. Juni 2005 23:22 -
unfortunately there was no comment from someone. So I try to ask the
major part again.

It's easier to track if you point to the number of the message instead.
It can be found at http://www.ruby-talk.org/\.

Onigurama does not recognize left recursion correctly.

Why should it? (See below.)

As I wrote in my earlier message

-----------------------------------------------------------
orgstring= "5-((3+4)*5)+6+xx"
pattern = /^(?<bal>[^()]*(\(\g<bal>\)[^()]*)*)$/

You make my eyes bleed... Please don't use regular expressions for
matching parentheses, it simply can't be done using regular expressions
(well, .NET-"regular expressions" can, and perhaps there's some support
for it in Oniguruma as well, but definitely not in the way you're doing
it.).

Use a context free grammar for that instead. You can use a tool like
racc to do that.

puts "\nMuster: #{pattern.inspect}\nString: '#{orgstring}'"
if (res = orgstring.match(pattern)) then
  puts "Match: '#{res}'"
else
  puts "+++++ No Match"
end
-----------------------------------------------------------

produces the Error Message

-----------------------------------------------------------
never ending recursion: /^(?<bal>[^()]*(\(\g<bal>\)[^()]*)*)$/
-----------------------------------------------------------

which is definitely wrong.

No, it's definitely right. Your trying to match a <bal> inside a <bal>,
right? That sounds like infinite recursion to me,
        nikolai

···

--
Nikolai Weibull: now available free of charge at http://bitwi.se/\!
Born in Chicago, IL USA; currently residing in Gothenburg, Sweden.
main(){printf(&linux["\021%six\012\0"],(linux)["have"]+"fun"-97);}

snip >>>>>

Can anyone give me a hint where to place things related to Ruby/Onigurama?

snap >>>>>

Btw, this major question is still open.

Best regards, Wolfgang Nadasi-Donner

snip >>>>>

Onigurama does not recognize left recursion correctly.

Why should it? (See below.)

snap >>>>>

Because it is described in the Onigurama documentation

snip >>>>>

As I wrote in my earlier message

-----------------------------------------------------------
orgstring= "5-((3+4)*5)+6+xx"
pattern = /^(?<bal>[^()]*(\(\g<bal>\)[^()]*)*)$/

You make my eyes bleed... Please don't use regular expressions for
matching parentheses, it simply can't be done using regular expressions
(well, .NET-"regular expressions" can, and perhaps there's some support
for it in Oniguruma as well, but definitely not in the way you're doing
it.).

snap >>>>>

It is normal usage like the bal-Pattern in Snobol4. I agree that this is not good for fixed programms, but my
usage of Ruby is mainly interactive data analysis (now done using IRB, later using a special shell). It is
very helpful in this case,

snip >>>>>

-----------------------------------------------------------
never ending recursion: /^(?<bal>[^()]*(\(\g<bal>\)[^()]*)*)$/
-----------------------------------------------------------

which is definitely wrong.

No, it's definitely right. Your trying to match a <bal> inside a <bal>,
right? That sounds like infinite recursion to me,

snap >>>>>

Sorry, strong disagreement. The part "(\(\g<bal>\)[^()]*)*" will either recognize nothing or it must consume a
"("-Character before coming to the recursive "\b<bal>"-Part.

Best regards, Wolfgang Nadasi-Donner

Wolfgang N�dasi-Donner wrote:

> > Onigurama does not recognize left recursion correctly.

> Why should it? (See below.)

Because it is described in the Onigurama documentation

Hm, OK, sorry. It works in 3.8.5
(サービス終了のお知らせ) it seems. Btw, can't that
regular expression be simplified a bit? The two * could be turned into
a single ? and the [^()] can be turned into [^(] and [^)], or?,
        nikolai

···

--
Nikolai Weibull: now available free of charge at http://bitwi.se/\!
Born in Chicago, IL USA; currently residing in Gothenburg, Sweden.
main(){printf(&linux["\021%six\012\0"],(linux)["have"]+"fun"-97);}

snip >>>>>

Hm, OK, sorry. It works in 3.8.5
(サービス終了のお知らせ) it seems. Btw, can't that
regular expression be simplified a bit? The two * could be turned into
a single ? and the [^()] can be turned into [^(] and [^)], or?,
        nikolai

snip >>>>>

I really do think so - but I'm starting experiments with Ruby/Onigurama now. I try to find some basic building
blocks that can be used interactively by '#{patternvar}'.

The above Regex now works (it is still unreadable), but a lot has to be done. Some things related to the
produced MatchData object are still unclear for me if one is using named groups. The Example

Code >>>>>>>>>>>>>>>

orgstring= <<EOT
firstproc(p1,p2,p3(p31,p32),p4(p41,p42(p421,p422),p43),p5)
nochneproc ( einfach, nur , ein, paar, parameter )
dieletzte ( auch, mit ( fast ) , allem )
EOT

pattern1 = /(?<n>\w+)\s*(?<p>\((?<bal>[^()]*?((x|(\(\g<bal>\)))[^()]*?)*?)\))/
pattern2 = /[(,]\s*(?<bal>[^()]*?((x|(\(\g<bal>\)))[^()]*?)*?)(?=\s*[,)])/

orgstring.scan(pattern1) do
  puts "\n---- Name: '#{$~[1]}'"
  $~[2].scan(pattern2) do
    puts "Parameter: '#{$~[1]}'"
  end
end

produces

Data >>>>>>>>>>>>>>>>

---- Name: 'firstproc'
Parameter: 'p1'
Parameter: 'p2'
Parameter: 'p3(p31,p32)'
Parameter: 'p4(p41,p42(p421,p422),p43)'
Parameter: 'p5'

---- Name: 'nochneproc'
Parameter: 'einfach'
Parameter: 'nur'
Parameter: 'ein'
Parameter: 'paar'
Parameter: 'parameter'

---- Name: 'dieletzte'
Parameter: 'auch'
Parameter: 'mit ( fast )'
Parameter: 'allem'

It's still an early step.

Best regards, Wolfgang Nadasi-Donner

snip >>>>>

.... Btw, can't that
regular expression be simplified a bit? The two * could be turned into
a single ? and the [^()] can be turned into [^(] and [^)], or?

snap >>>>>

It depends on the definition of 'balanced' needed. If [^(] is used the string 'abcde(fgh' will be recognized
as balanced.

The two * are necessary if one needs to match balanced strings, independant of its structure, e.g.
'abvv(d(jhj)kjj)hd(jhb(j))hdjh'. This form has an advantage too. it is related to Friedl's loop enrolling
structure "<normal>(<special><normal>*)*' and avoids endless matching.

Much more complicated is the question "greedy or not greedy" in the recursion.

Best regards, Wolfgang Nadasi-Donner

Wolfgang N�dasi-Donner wrote:

> .... Btw, can't that regular expression be simplified a bit? The
> two * could be turned into a single ? and the [^()] can be turned
> into [^(] and [^)], or?

It depends on the definition of 'balanced' needed. If [^(] is used the
string 'abcde(fgh' will be recognized as balanced.

I assume you mean 'abcde)fgh'.

The two * are necessary if one needs to match balanced strings,
independant of its structure, e.g. 'abvv(d(jhj)kjj)hd(jhb(j))hdjh'.
This form has an advantage too. it is related to Friedl's loop
enrolling structure "<normal>(<special><normal>*)*' and avoids endless
matching.

Hm, yes that's true,
        nikolai

···

--
Nikolai Weibull: now available free of charge at http://bitwi.se/\!
Born in Chicago, IL USA; currently residing in Gothenburg, Sweden.
main(){printf(&linux["\021%six\012\0"],(linux)["have"]+"fun"-97);}