Defining regexp's and variables set by them

Sometimes I get in a situation where I have a case statement
with a several when-clauses, each of which is a regular
expression. Some of those regular-expressions may be
rather complicated. If that case statement is going to be
processed many times, then I like to define objects via
Regexp.new, and use those pre-compiled objects in the
when clauses:

    re_simple = Regexp.new("check (\d+)")
    ...
    very_huge_file.each_line { |aline}
        case aline
            when re_simple
                check_number = $1
            when re_other
       ...

The downside to this is that the regular-expression is now
defined somewhere "far away" from the when-clause that
uses it. After some testing I may need to make changes to
the regexp, such as:

    re_simple = Regexp.new("(check|test) (\d+)")

The thing is, by changing 'check' to '(check|test)', I have to
remember that the when clause also needs to change from
referencing $1 to referencing $2. Note that in this case I do
not *care* whether it matched 'check' as opposed to 'test'.
Either word is acceptable to me, so I do not need the actual
value of $1 for anything.

I was kinda wondering if it would make sense for ruby to
support something like:

    re_simple = Regexp.new("check (\d+)") { cnum = $1 }

so I could then have the when clause say:

            when re_simple
                check_number = cnum

I realize this is a trivial example, but as the expressions get
more involved, and the case-statement has many when's, it
would be nice if I could have the compiled regular-expression
set values on variable names that *I* pick, in addition to the
standard values in a MatchData object.

Another thing that this might let me do, is something like:

            when re_simple, re_other, re_yetmore
                check_number = cnum

where the different regular-expressions may find 'cnum' in
different positions in the string ($1 vs $2 vs $3), and yet
they could all be processed in the same when-clause.

Or is there already some way to do this?

···

--
Garance Alistair Drosehn = drosihn@gmail.com
Senior Systems Programmer or gad@FreeBSD.org
Rensselaer Polytechnic Institute; Troy, NY; USA

Garance A Drosehn wrote:

The thing is, by changing 'check' to '(check|test)', I have to
remember that the when clause also needs to change from
referencing $1 to referencing $2. Note that in this case I do
not *care* whether it matched 'check' as opposed to 'test'.
Either word is acceptable to me, so I do not need the actual
value of $1 for anything.

Hi Garance,

You can prevent the grouped expression from creating a
back-reference by using the ?: extension ...

s = 'start: check 41 is less than ...'

re_simple = /check (\d+)/
s =~ re_simple
p [$1, $2] #-> ["41", nil]

re_simple = /(?:check|test) (\d+)/
s =~ re_simple
p [$1, $2] #-> ["41", nil]

HTH,

daz

If that case statement is going to be
processed many times, then I like to define objects via
Regexp.new, and use those pre-compiled objects in the
when clauses:

    re_simple = Regexp.new("check (\d+)")
    ...
    very_huge_file.each_line { |aline}
        case aline
            when re_simple
                check_number = $1
            when re_other
       ...

Why? I don't think you have to do this to avoid recompiling each time.
Ruby should compile it once when the program is first parsed, and then
recompiles are not needed (unless your regexp has an interpolation).

just so long as you do this:

  when /some_rex/:

and not this:

  when Regexp.new("some_rex"):

I was kinda wondering if it would make sense for ruby to
support something like:

    re_simple = Regexp.new("check (\d+)") { cnum = $1 }

Ah! a man after my own heart. I think this would be just lovely. It's
never quite so simple, tho. What about, eg re_simple.to_s?

You might want to look at my Reg library/pattern matching language. If
I ever finish the required features, it will support things like what
you have above. Not quite the same syntax, it'd look more like:

  re_simple=/check (\d+)/>>BR[1]

and then

  case str
  when re_simple: check_number=str

Ok, that probably makes no sense to anyone but me yet.

···

On 7/31/05, Garance A Drosehn <drosihn@gmail.com> wrote:

You could define it like this:

bschroed@black:~/svn/projekte/ruby-things$ cat regexp_data.rb
class DataRegexp < Regexp
  def initialize(regexp, &block)
    @block = block
    @userdata = {}
    super(regexp)
  end
  
  def match(str)
    result = super(str)
    class <<result
      def userdata
        @userdata ||= {}
      end
    end
    @block[result] if @block
    result
  end
end

re_simple = DataRegexp.new('check (\d+)') { | mdata |
  mdata.userdata[:check_number] = mdata[1].to_i if mdata
}

if match = re_simple.match("Something")
  puts "Something matched"
end

if match = re_simple.match("check 12")
  puts "Checking #{match.userdata[:check_number]}"
end
bschroed@black:~/svn/projekte/ruby-things$ ruby regexp_data.rb
Checking 12

regards,

Brian

···

On 01/08/05, Garance A Drosehn <drosihn@gmail.com> wrote:

On 8/1/05, Garance A Drosehn <drosihn@gmail.com> wrote:
>
> The downside to this is that the regular-expression is now
> defined somewhere "far away" from the when-clause that
> uses it. After some testing I may need to make changes to
> the regexp, such as:
>
> re_simple = Regexp.new('(check|test) (\d+)')
>
> The thing is, by changing 'check' to '(check|test)', I have to
> remember that the when clause also needs to change from
> referencing $1 to referencing $2. [...]
>
> I was kinda wondering if it would make sense for ruby to
> support something like:
>
> re_simple = Regexp.new('check (\d+)') { cnum = $1 }
>
> so I could then have the when clause say:
>
> when re_simple
> check_number = cnum
>
> I realize this is a trivial example, but as the expressions get
> more involved, and the case-statement has many when's, it
> would be nice if I could have the compiled regular-expression
> set values on variable names that *I* pick, in addition to the
> standard values in a MatchData object.

I thought about this some more after going home and getting
some sleep... One obvious question is what would be the
scope of the commands inside the { ...code-fragment...}. It
also occurred to me that I sometimes I make a match, and
then I pass around the resulting MatchData object to other
methods, and *they* do things based on info in MatchData.

So, I came up with this idea:

Allow MatchData to include some user-settable value,
which would initially be set to 'nil' at the time of the match.
And then support:

    re_simple = Regexp.new('check (\d+)') { |mdata|
        mdata.userdata = mdata[1]
    }

or:

    re_simple = Regexp.new('check (\d+)') { |mdata|
        mdata.userdata = Hash.new
        mdata.userdata["cnum"] = mdata[1]
        mdata.userdata["otherval"] = mdata[7]
    }

That way, all the variables that the user is setting will
be tied to the appropriate MatchData object.

I almost think I could implement this by creating my own
subclasses for Regexp and MatchData...

--
http://ruby.brian-schroeder.de/

Stringed instrument chords: http://chordlist.brian-schroeder.de/

just so long as you do this:

  when /some_rex/:

or

   when /some_#{rex}/o:

and not this:

  when Regexp.new("some_rex"):

+--- Kero ------------------------- kero@chello@nl ---+

all the meaningless and empty words I spoke |
                      Promises -- The Cranberries |

+--- M38c --- http://members.chello.nl/k.vangelder ---+

Ah. That's one of those things which didn't sink in when I
first read about it, since I didn't need it at the time. I wish I
had paid better attention to it! Thanks.

···

On 8/1/05, daz <dooby@d10.karoo.co.uk> wrote:

You can prevent the grouped expression from creating a
back-reference by using the ?: extension ...

--
Garance Alistair Drosehn = drosihn@gmail.com
Senior Systems Programmer or gad@FreeBSD.org
Rensselaer Polytechnic Institute; Troy, NY; USA

A few of the regexp's are based on global options, which is
to say a regexp would be constant for any one run of the
program, but it is built from the value of other variables. I
don't do that very often, but sometimes I do.

This is what the /o regexp option is for; it forces the regexp to be
compiled only once, even if it has interpolations.

Does
ruby keep the compiled-code for a method after the method
is finished?

Uhhh, it's nowhere near that fancy. Ruby is a fairly traditional
interpreter, without even bytecode compilation. That's why it's so
slow.

But the main reason I like to split things up is that the
regexp's involved in my real-world example are rather
complicated. I'd like to have one section of code which
defines the regexp's, and comments why they are the
way they are.

But you were just saying you don't like the regexp distant from it's use...

What about it? My program isn't doing to_s on any regexp's
which it defines, so I don't understand the significance of your
question...

Maybe not, but if it's to be a general solution, you need to handle
all this stuff. If you're just going to use it in your own program,
then that's fine but I thought you were talking about something more
general-purpose.

You maybe be calling Regexp#to_s without knowing it if you do
something like this:

rex1=/bar/
rex2=/foo#{rex1}baz/

The interpolation calls to_s.

···

On 8/1/05, Garance A Drosehn <drosihn@gmail.com> wrote:

> Does ruby keep the compiled-code for a method after
> the method is finished?

Uhhh, it's nowhere near that fancy. Ruby is a fairly traditional
interpreter, without even bytecode compilation. That's why it's
so slow.

That is what I expected. So I'm back to wishing to have one
method which creates all the Regexp.new's as @@variables,
and then I can reference those compiled Regexp's in other
methods for that class.

> But the main reason I like to split things up is that the
> regexp's involved in my real-world example are rather
> complicated. I'd like to have one section of code which
> defines the regexp's, and comments why they are the
> way they are.

But you were just saying you don't like the regexp distant
from it's use...

Almost. I was saying that I *wanted* to have them distant,
because the result is more readable (for what I'm doing, IMO),
and for the efficiency benefit. This may seem weird, but most
of the regexp's that I'm talking about are three or four full lines
long, complete with a few regexp tricks that take a few more
lines of comments to explain what the regexp is doing. The
case-statement is *much* more readable if the regexp's are
separated from the case statement.

But there is a downside from doing that, so I am looking
for ideas on how I might eliminate that downside.

> What about it? My program isn't doing to_s on any regexp's
> which it defines, so I don't understand the significance of your
> question...

Maybe not, but if it's to be a general solution, you need to handle
all this stuff. If you're just going to use it in your own program,
then that's fine but I thought you were talking about something
more general-purpose.

The more general the solution, the better! :slight_smile:

You maybe be calling Regexp#to_s without knowing it if you
do something like this:

rex1=/bar/
rex2=/foo#{rex1}baz/

The interpolation calls to_s.

Ah. I don't do that much, but I can understand why that would
be important. Your reply and the other replies in this thread
have given me quite a few good suggestions to think about.
Very instructive. Thanks!

···

On 8/2/05, Caleb Clausen <vikkous@gmail.com> wrote:

On 8/1/05, Garance A Drosehn <drosihn@gmail.com> wrote:

--
Garance Alistair Drosehn = drosihn@gmail.com
Senior Systems Programmer or gad@FreeBSD.org
Rensselaer Polytechnic Institute; Troy, NY; USA

Does ruby keep the compiled-code for a method after
the method is finished?

Uhhh, it's nowhere near that fancy. Ruby is a fairly traditional
interpreter, without even bytecode compilation. That's why it's
so slow.

That is what I expected. So I'm back to wishing to have one
method which creates all the Regexp.new's as @@variables,
and then I can reference those compiled Regexp's in other
methods for that class.

Performance does not differ much between a regexp in place and a regexp compiled once and stored in a variable or constant (assuming no interpolation is used or interpolation with "o" is used - otherwise both scenarios have different semantics anyway and can't be compared). Ruby *has* been optimized to make in place regexps efficient - there is no recompilation of the regexp on every pass.

You can try it out with the attached script. Using several invocations either of the two is faster

                user system total real
direct 0.312000 0.000000 0.312000 ( 0.305000)
compiled 0.313000 0.000000 0.313000 ( 0.306000)

                user system total real
direct 0.313000 0.000000 0.313000 ( 0.319000)
compiled 0.297000 0.000000 0.297000 ( 0.305000)

Almost. I was saying that I *wanted* to have them distant,
because the result is more readable (for what I'm doing, IMO),
and for the efficiency benefit.

As I said there is no such thing as an efficiency benefit in using "remote" regexps.

This may seem weird, but most
of the regexp's that I'm talking about are three or four full lines
long, complete with a few regexp tricks that take a few more
lines of comments to explain what the regexp is doing. The
case-statement is *much* more readable if the regexp's are
separated from the case statement.

I'd stick with the readability argument and forget about the performance here. The question is, does the code become more readable by moving the regexps out of the case statement? I don't know your code but I'd say it's not automatically so.

Kind regards

    robert

···

Garance A Drosehn <drosihn@gmail.com> wrote:

On 8/2/05, Caleb Clausen <vikkous@gmail.com> wrote:

On 8/1/05, Garance A Drosehn <drosihn@gmail.com> wrote:

Sorry, I forgot the attachment. Here's the script:

    robert

require 'benchmark'

REPEAT = 10000

RX = /foo/

TEXT = <<EOS
akdnhkaj dahdk ahda da#dada
da
dopakdjalkjdlak djadklasd
adasklfoodköasjhdjkasdha
dadjkadjkashdjkasd#
aajdhkasjdjkfooashd
aldaksjhdjasd
EOS

Benchmark.bmbm 10 do |b|
  b.report "direct" do
    REPEAT.times { TEXT.scan(/foo/o) {|m| m + "x"} }
  end

  b.report "compiled" do
    REPEAT.times { TEXT.scan(RX) {|m| m + "x"} }
  end
end

···

Robert Klemme <bob.news@gmx.net> wrote:

Garance A Drosehn <drosihn@gmail.com> wrote:

On 8/2/05, Caleb Clausen <vikkous@gmail.com> wrote:

On 8/1/05, Garance A Drosehn <drosihn@gmail.com> wrote:

Does ruby keep the compiled-code for a method after
the method is finished?

Uhhh, it's nowhere near that fancy. Ruby is a fairly traditional
interpreter, without even bytecode compilation. That's why it's
so slow.

That is what I expected. So I'm back to wishing to have one
method which creates all the Regexp.new's as @@variables,
and then I can reference those compiled Regexp's in other
methods for that class.

Performance does not differ much between a regexp in place and a
regexp compiled once and stored in a variable or constant (assuming no
interpolation is used or interpolation with "o" is used - otherwise
both scenarios have different semantics anyway and can't be
compared). Ruby *has* been optimized to make in place regexps
efficient - there is no recompilation of the regexp on every pass.

You can try it out with the attached script. Using several
invocations either of the two is faster

               user system total real
direct 0.312000 0.000000 0.312000 ( 0.305000)
compiled 0.313000 0.000000 0.313000 ( 0.306000)

               user system total real
direct 0.313000 0.000000 0.313000 ( 0.319000)
compiled 0.297000 0.000000 0.297000 ( 0.305000)

Almost. I was saying that I *wanted* to have them distant,
because the result is more readable (for what I'm doing, IMO),
and for the efficiency benefit.

As I said there is no such thing as an efficiency benefit in using
"remote" regexps.

This may seem weird, but most
of the regexp's that I'm talking about are three or four full lines
long, complete with a few regexp tricks that take a few more
lines of comments to explain what the regexp is doing. The
case-statement is *much* more readable if the regexp's are
separated from the case statement.

I'd stick with the readability argument and forget about the
performance here. The question is, does the code become more
readable by moving the regexps out of the case statement? I don't
know your code but I'd say it's not automatically so.

Kind regards

   robert

> Almost. I was saying that I *wanted* to have them distant,
> because the result is more readable (for what I'm doing, IMO),
> and for the efficiency benefit.

As I said there is no such thing as an efficiency benefit in using
"remote" regexps.

Ah, okay. I guess I was reading too much into the word "compiled",
such that I thought it would be significantly faster.

> This may seem weird, but most
> of the regexp's that I'm talking about are three or four full lines
> long, complete with a few regexp tricks that take a few more
> lines of comments to explain what the regexp is doing. The
> case-statement is *much* more readable if the regexp's are
> separated from the case statement.

I'd stick with the readability argument and forget about the performance
here. The question is, does the code become more readable by moving
the regexps out of the case statement? I don't know your code but I'd
say it's not automatically so.

In the script that I am working on right now, it is definitely more
readable. But in most scripts I write, the readability is probably
about the same either way. It wouldn't surprise me if readability
was usually better with regexp's in the case statement that uses
them, especially if they are all single-line regexp's.

I first wrote this script with the regexp's in place, and that was
getting too messy (IMO). So I've now redone them with the
regexp's separate. I might do some performance comparision
of the two versions once I'm done. But I doubt that will be very
accurate, because I am changing so many other things at the
same time.

···

On 8/4/05, Robert Klemme <bob.news@gmx.net> wrote:

Garance A Drosehn <drosihn@gmail.com> wrote:

--
Garance Alistair Drosehn = drosihn@gmail.com
Senior Systems Programmer or gad@FreeBSD.org
Rensselaer Polytechnic Institute; Troy, NY; USA