Iterating through a string and removing leading characters

This is going to seem a little strange (for a number of reasons I might
mention below), but I would like to iterate through a string from beginning
to end, examining each character, then discarding (or moving) them
(elsewhere).

This is so that I can, upon finding certain character(s), run one of several
REs on the next part of the string. I'm looking to find a very efficient way
to do this.

Here are the approaches I'm considering so far. Any other suggestions?

   * ?? (Is there any method which chops the first character from the string?
How efficient is that (especially compared to the method described in the
next bullet).
   * Reverse the string, then use something like s[length (-1)] to examine
then chop to discard the last character. The problem with this approach is
that I then have to reverse the string again to have the REs work. (I
(briefly) thought about just setting up the REs in reverse, but I foresee a
lot of difficulty there--scanning through the string in reverse (from last
character to first, negates the advantage I was hoping to gain, that of
starting the RE match only from positions where the RE could possibly match.)

Barring anything better, the approach I may take is to iterate through the
string and, when I find a potential RE match, use s[n,len] to return a
partial string to be checked against the RE.

Aside: I'm trying to make a very efficient parser in Ruby for a wiki-like
thing I want to build, and am trying to avoid the approach of simply making
multiple scans through the document for each of a fairly large number of REs.

I might be guilty of premature optimization, but I prefer to think of it as
doing some proof of concept testing before committing to a design.

I have done some tests that show a 1 to 10% savings in time by taking a
similar approach for REs that could only match at the beginning of a string.
(At some point I'll "publish" those results on WikiLearn (or the next
incarnation thereof).) The next REs are considerably more complex as they
can match anywhere in the string--if the savings from the same approach for
them is only 1 to 10%, the complexity will not be worth it. If by some
chance it exceeds say 50%, I will seriously consider that complexity.

Randy Kramer

In data 3/18/2005, "Randy Kramer" <rhkramer@gmail.com> ha scritto:

This is going to seem a little strange (for a number of reasons I might
mention below), but I would like to iterate through a string from beginning
to end, examining each character, then discarding (or moving) them
(elsewhere).

This is so that I can, upon finding certain character(s), run one of several
REs on the next part of the string. I'm looking to find a very efficient way
to do this.

Here are the approaches I'm considering so far. Any other suggestions?

  * ?? (Is there any method which chops the first character from the string?
How efficient is that (especially compared to the method described in the
next bullet).
  * Reverse the string, then use something like s[length (-1)] to examine
then chop to discard the last character. The problem with this approach is
that I then have to reverse the string again to have the REs work. (I
(briefly) thought about just setting up the REs in reverse, but I foresee a
lot of difficulty there--scanning through the string in reverse (from last
character to first, negates the advantage I was hoping to gain, that of
starting the RE match only from positions where the RE could possibly match.)

Barring anything better, the approach I may take is to iterate through the
string and, when I find a potential RE match, use s[n,len] to return a
partial string to be checked against the RE.

Aside: I'm trying to make a very efficient parser in Ruby for a wiki-like
thing I want to build, and am trying to avoid the approach of simply making
multiple scans through the document for each of a fairly large number of REs.

I might be guilty of premature optimization, but I prefer to think of it as
doing some proof of concept testing before committing to a design.

I have done some tests that show a 1 to 10% savings in time by taking a
similar approach for REs that could only match at the beginning of a string.
(At some point I'll "publish" those results on WikiLearn (or the next
incarnation thereof).) The next REs are considerably more complex as they
can match anywhere in the string--if the savings from the same approach for
them is only 1 to 10%, the complexity will not be worth it. If by some
chance it exceeds say 50%, I will seriously consider that complexity.

I implemented a semi-stateful wiki parser in a somewhat similar manner
using StringScanner. It's fairly fast.

Randy Kramer

E

Randy Kramer wrote:

   * ?? (Is there any method which chops the first character from the string? How efficient is that (especially compared to the method described in the next bullet).

str.slice!(0)

I think this is actually fairly efficient, but I suppose there's a memmove involved at the C level.

Perhaps you could set the characters you don't need to '\0' and remove all of them after the iteration has ended with a Regexp?

"Randy Kramer" <rhkramer@gmail.com> schrieb im Newsbeitrag news:200503181244.16009.rhkramer@gmail.com...

This is going to seem a little strange (for a number of reasons I might
mention below), but I would like to iterate through a string from beginning
to end, examining each character, then discarding (or moving) them
(elsewhere).

This is so that I can, upon finding certain character(s), run one of several
REs on the next part of the string. I'm looking to find a very efficient way
to do this.

Here are the approaches I'm considering so far. Any other suggestions?

  * ?? (Is there any method which chops the first character from the string?
How efficient is that (especially compared to the method described in the
next bullet).
  * Reverse the string, then use something like s[length (-1)] to examine
then chop to discard the last character. The problem with this approach is
that I then have to reverse the string again to have the REs work. (I
(briefly) thought about just setting up the REs in reverse, but I foresee a
lot of difficulty there--scanning through the string in reverse (from last
character to first, negates the advantage I was hoping to gain, that of
starting the RE match only from positions where the RE could possibly match.)

Barring anything better, the approach I may take is to iterate through the
string and, when I find a potential RE match, use s[n,len] to return a
partial string to be checked against the RE.

I'd bet that this approach is slower than a pure regexp based approach. If you cannot stuck all exact regexps into one (see below) then maybe some form of stripped regexps might help. For example:

rx1 = /ab+/
rx2 = /cd+/

rx_all = /(ab+)|(cd+)/

rx_stripped = /[ab](\w+)/
# then, use these on the second part
rx_stripped_1 = /^b+/
rx_stripped_2 = /^d+/

This is just a simple example for demonstration. For these simple regexps rx_all is the most efficient one I'm sure.

Aside: I'm trying to make a very efficient parser in Ruby for a wiki-like
thing I want to build, and am trying to avoid the approach of simply making
multiple scans through the document for each of a fairly large number of REs.

What does "fairly large" mean? I would try to start with stucking *all* these regexps into one - if the rx engine does not choke on that regexp I'd assume that this is the most efficient way to do it, as then you have the best ratio of machine code to ruby interpretation. Maybe you just show us all these regexps so we can better understand the problem.

I might be guilty of premature optimization, but I prefer to think of it as
doing some proof of concept testing before committing to a design.

I have done some tests that show a 1 to 10% savings in time by taking a
similar approach for REs that could only match at the beginning of a string.
(At some point I'll "publish" those results on WikiLearn (or the next
incarnation thereof).) The next REs are considerably more complex as they
can match anywhere in the string--if the savings from the same approach for
them is only 1 to 10%, the complexity will not be worth it. If by some
chance it exceeds say 50%, I will seriously consider that complexity.

Now I'm getting really curios. Care to post some more details?

Kind regards

    robert

Florian Gross <flgr@ccan.de> writes:

Randy Kramer wrote:

   * ?? (Is there any method which chops the first character from
the string? How efficient is that (especially compared to the
method described in the next bullet).

str.slice!(0)

I think this is actually fairly efficient, but I suppose there's a
memmove involved at the C level.

Perhaps you could set the characters you don't need to '\0' and remove
all of them after the iteration has ended with a Regexp?

Better use String#delete! for that, no?

···

--
Christian Neukirchen <chneukirchen@gmail.com> http://chneukirchen.org

"Robert Klemme" <bob.news@gmx.net> writes:

"Randy Kramer" <rhkramer@gmail.com> schrieb im Newsbeitrag
news:200503181244.16009.rhkramer@gmail.com...

This is going to seem a little strange (for a number of reasons I might
mention below), but I would like to iterate through a string from
beginning
to end, examining each character, then discarding (or moving) them
(elsewhere).

This is so that I can, upon finding certain character(s), run one of
several
REs on the next part of the string. I'm looking to find a very
efficient way
to do this.

Barring anything better, the approach I may take is to iterate through the
string and, when I find a potential RE match, use s[n,len] to return a
partial string to be checked against the RE.

I'd bet that this approach is slower than a pure regexp based
approach. If you cannot stuck all exact regexps into one (see below)
then maybe some form of stripped regexps might help. For example:

rx1 = /ab+/
rx2 = /cd+/

rx_all = /(ab+)|(cd+)/

rx_stripped = /[ab](\w+)/
# then, use these on the second part
rx_stripped_1 = /^b+/
rx_stripped_2 = /^d+/

This reminds me of my old idea for matching against multiple regexp
by automatically combining them into larger, alternating regexps
thereby reducing the number of matches from n to log2(n)...
Anyone willing to code that?

To match t against A, B, C, D, E, F, G and H, you first match t
against A|B|C|D|E|F|G|H, then against A|B|C|D, then against A|B (or
E>F), then against A, C, E or G. Would be of great use for strscan,
too.

···

    robert

--
Christian Neukirchen <chneukirchen@gmail.com> http://chneukirchen.org

Thanks to all who replied so far. I also want to look into the StringScanner
approach (I'll reply separately with some questions about that), and I can't
believe I couldn't find the ways to delete the first character of a string.
Guess I am a newbie!

I'd bet that this approach is slower than a pure regexp based approach.

So far, you're very right--my approach took about 30 times as long as the pure
regexp approach, although my Ruby code might not be very efficient. (In case
nobody noticed, I'm very much a newbie to Ruby.)

If
you cannot stuck all exact regexps into one (see below) then maybe some
form of stripped regexps might help. For example:

This sounds like its worth a try, but:
   1) I haven't created all the necessary REs yet
   2) Question below (for clarification)

rx1 = /ab+/
rx2 = /cd+/

rx_all = /(ab+)|(cd+)/

rx_stripped = /[ab](\w+)/

Question: IIUC, the [ab] above should be [ac]?

# then, use these on the second part
rx_stripped_1 = /^b+/
rx_stripped_2 = /^d+/

This is just a simple example for demonstration. For these simple regexps
rx_all is the most efficient one I'm sure.

What does "fairly large" mean? I would try to start with stucking *all*
these regexps into one - if the rx engine does not choke on that regexp I'd
assume that this is the most efficient way to do it, as then you have the
best ratio of machine code to ruby interpretation. Maybe you just show us
all these regexps so we can better understand the problem.

It's hard even to guess, I intended to combine several REs into one anyway
when they had a lot of commonality. For example, the TWiki markup for
headings (which I'm planning to use) is like this:

---* Level 1
---** Level 2
---*** Level 3
---**** Level 4
---***** Level 5
---****** Level 6

I've planned to use one RE for all the above, then determine the level from
the length of the match (like level = len - 3).

Likewise, "inline" markup is *for bold*, _for italic,_ __for bold italic__,
and so forth. I'd try to have one RE looking for words preceded by _, *, or
__, and another with words ending with the same. (And might combine words
marked with % for %TWikiVariables% as well.

With "optimizations" like this, I'd guess on the order of 15 or so regexps.

Now I'm getting really curios. Care to post some more details?

I presume you mean on the 1 to 10% savings? I planned to do that, I'll try to
put something on WikiLearn this weekend then post something here.

Randy Kramer

···

On Saturday 19 March 2005 08:04 am, Robert Klemme wrote:

StringScanner looks interesting, and I'm starting to dig into it. Care to
provide any more information? Do you have a working wiki?

Randy Kramer

···

On Friday 18 March 2005 03:42 pm, ES wrote:

I implemented a semi-stateful wiki parser in a somewhat similar manner
using StringScanner. It's fairly fast.

str.slice!(0)

Thanks! (Can't believe I didn't recognize that approach.)

I think this is actually fairly efficient, but I suppose there's a
memmove involved at the C level.

Perhaps you could set the characters you don't need to '\0' and remove
all of them after the iteration has ended with a Regexp?

Sounds like a possibility, although my first set of tests show the REs to be
much faster.

regards,
Randy Kramer

···

On Friday 18 March 2005 04:24 pm, Florian Gross wrote:

"Christian Neukirchen" <chneukirchen@gmail.com> schrieb im Newsbeitrag news:m2vf7n28y0.fsf@lilith.local...

Florian Gross <flgr@ccan.de> writes:

Randy Kramer wrote:

   * ?? (Is there any method which chops the first character from
the string? How efficient is that (especially compared to the
method described in the next bullet).

str.slice!(0)

I think this is actually fairly efficient, but I suppose there's a
memmove involved at the C level.

Perhaps you could set the characters you don't need to '\0' and remove
all of them after the iteration has ended with a Regexp?

Better use String#delete! for that, no?

For removing \0 chars? Yepp, that seems like an efficient way to do it. For direct removing slice! and String#= are most efficient:

Benchmark.bm do

?> end
      user system total real
=> true

Benchmark.bm do |r|

?> r.report("slice!") do
?> 10000.times { "aaaaaaaaaaaaaa".slice!(0,5) }

end
r.report("=") do

?> 10000.times { "aaaaaaaaaaaaaa"[0,5]="" }

end

      user system total real
slice! 0.140000 0.000000 0.140000 ( 0.130000)
= 0.063000 0.000000 0.063000 ( 0.065000)
=> true

Kind regards

    robert

In data 3/19/2005, "Randy Kramer" <rhkramer@gmail.com> ha scritto:

I implemented a semi-stateful wiki parser in a somewhat similar manner
using StringScanner. It's fairly fast.

StringScanner looks interesting, and I'm starting to dig into it. Care to
provide any more information? Do you have a working wiki?

I do. I was waiting to complete the actual application it is
to feature in before releasing it but if you want to see the
parsing code, I'll do a round or two of refactoring and post
it on my website, work permitting.

Randy Kramer

E

···

On Friday 18 March 2005 03:42 pm, ES wrote:

"Randy Kramer" <rhkramer@gmail.com> schrieb im Newsbeitrag news:200503190900.45805.rhkramer@gmail.com...

Thanks to all who replied so far. I also want to look into the StringScanner
approach (I'll reply separately with some questions about that), and I can't
believe I couldn't find the ways to delete the first character of a string.
Guess I am a newbie!

I'd bet that this approach is slower than a pure regexp based approach.

So far, you're very right--my approach took about 30 times as long as the pure
regexp approach, although my Ruby code might not be very efficient. (In case
nobody noticed, I'm very much a newbie to Ruby.)

If
you cannot stuck all exact regexps into one (see below) then maybe some
form of stripped regexps might help. For example:

This sounds like its worth a try, but:
  1) I haven't created all the necessary REs yet
  2) Question below (for clarification)

rx1 = /ab+/
rx2 = /cd+/

rx_all = /(ab+)|(cd+)/

rx_stripped = /[ab](\w+)/

Question: IIUC, the [ab] above should be [ac]?

Exactly. My mistake.

# then, use these on the second part
rx_stripped_1 = /^b+/
rx_stripped_2 = /^d+/

This is just a simple example for demonstration. For these simple regexps
rx_all is the most efficient one I'm sure.

What does "fairly large" mean? I would try to start with stucking *all*
these regexps into one - if the rx engine does not choke on that regexp I'd
assume that this is the most efficient way to do it, as then you have the
best ratio of machine code to ruby interpretation. Maybe you just show us
all these regexps so we can better understand the problem.

It's hard even to guess, I intended to combine several REs into one anyway
when they had a lot of commonality. For example, the TWiki markup for
headings (which I'm planning to use) is like this:

---* Level 1
---** Level 2
---*** Level 3
---**** Level 4
---***** Level 5
---****** Level 6

I've planned to use one RE for all the above, then determine the level from
the length of the match (like level = len - 3).

Sounds very reasonable.

Likewise, "inline" markup is *for bold*, _for italic,_ __for bold italic__,
and so forth. I'd try to have one RE looking for words preceded by _, *, or
__, and another with words ending with the same. (And might combine words
marked with % for %TWikiVariables% as well.

Seeing this lets me think of another approach: split the string into tokens with a simple regexp and do the rest of the calculation on the tokens. Drawback is of course if the input is large this will not perform very well. Hm...

With "optimizations" like this, I'd guess on the order of 15 or so regexps.

Now I'm getting really curios. Care to post some more details?

I presume you mean on the 1 to 10% savings?

Yeah, and on the rx's as well (you did show some of theme alread).

I planned to do that, I'll try to
put something on WikiLearn this weekend then post something here.

Some experiments:

str = "** caption \nfoo _italic_ asdasd __bold__ var=%var%"

=> "** caption \nfoo _italic_ asdasd __bold__ var=%var%"

str.scan /(^\s*\*+)|(([*_])\3*)(.*?)\2|(%\w+%)/m do |m| p m end

["**", nil, nil, nil, nil]
[nil, "_", "_", "italic", nil]
[nil, "__", "_", "bold", nil]
[nil, nil, nil, nil, "%var%"]
=> "** caption \nfoo _italic_ asdasd __bold__ var=%var%"

str.scan /(^\s*\*+)|(([*_])\3*)(.*?)\2|%(\w+)%/m do |m| p m end

["**", nil, nil, nil, nil]
[nil, "_", "_", "italic", nil]
[nil, "__", "_", "bold", nil]
[nil, nil, nil, nil, "var"]
=> "** caption \nfoo _italic_ asdasd __bold__ var=%var%"

Kind regards

    robert

···

On Saturday 19 March 2005 08:04 am, Robert Klemme wrote:

Wow, cool idea!

I can contribute a slight improvement to the idea. You make a Huffman tree
using the number of times that a regexp has been matched yet, or any
better prediction of how often it will come up in the future, and you
tweak the associativity of your disjunction correspondingly, such that it
matches the structure of the tree.

If, in the sequence of actual matches, there is a finite upper bound on
the number of positive-recurrent states (that is, matches that do occur a
non-zero percentage of times in practice), then the order of my algorithm
is O(1), am I right?

Anyhow, it will accelerate the parsing, especially for a very redundant
text in a complex syntax, e.g. parsing something that could be any Ruby
code but which just happens to be a huuuuge array literal of hex fixnums.

Btw, anyone knows other algorithms that use Huffman to compress the *time*
instead of the *space* ?

···

On Sat, 19 Mar 2005, Christian Neukirchen wrote:

This reminds me of my old idea for matching against multiple regexp by
automatically combining them into larger, alternating regexps thereby
reducing the number of matches from n to log2(n)... Anyone willing to
code that? To match t against A, B, C, D, E, F, G and H, you first
match t against A|B|C|D|E|F|G|H, then against A|B|C|D, then against
A>B (or E|F), then against A, C, E or G. Would be of great use for
strscan, too.

_____________________________________________________________________
Mathieu Bouchard -=- Montréal QC Canada -=- http://artengine.ca/matju

"Christian Neukirchen" <chneukirchen@gmail.com> schrieb im Newsbeitrag
news:m2vf7n28y0.fsf@lilith.local...
> Better use String#delete! for that, no?

For direct removing slice! and String#= are most efficient:

What is the significance to the # in String#delete! and String#= above? (I
can't find anything with the # in it like that in "Ruby In a Nutshell"--is it
something new, or just some shorthand/jargon? (Or did I look in the wrong
places :wink:

>> Benchmark.bm do

Ahh, that's useful, I was using Time::now to time my functions.

Randy Kramer

···

On Saturday 19 March 2005 07:59 am, Robert Klemme wrote:

It would be interesting to see, if it's not too much trouble.

regards,
Randy Kramer

···

On Saturday 19 March 2005 01:33 pm, ES wrote:

if you want to see the
parsing code, I'll do a round or two of refactoring and post
it on my website, work permitting.

See RWP_RE_Tests < Wikilearn < TWiki -- it's rather ugly
but the information is there.

Randy Kramer

···

On Sunday 20 March 2005 03:19 am, Robert Klemme wrote:

"Randy Kramer" <rhkramer@gmail.com> schrieb im Newsbeitrag
> I planned to do that, I'll try to
> put something on WikiLearn this weekend then post something here.

Mathieu Bouchard <matju@sympatico.ca> writes:

This reminds me of my old idea for matching against multiple regexp by
automatically combining them into larger, alternating regexps thereby
reducing the number of matches from n to log2(n)... Anyone willing to
code that? To match t against A, B, C, D, E, F, G and H, you first
match t against A|B|C|D|E|F|G|H, then against A|B|C|D, then against
A>B (or E|F), then against A, C, E or G. Would be of great use for
strscan, too.

Wow, cool idea!

I can contribute a slight improvement to the idea. You make a Huffman tree
using the number of times that a regexp has been matched yet, or any
better prediction of how often it will come up in the future, and you
tweak the associativity of your disjunction correspondingly, such that it
matches the structure of the tree.

A kind of self-optimizing regular expression. Will rock if done
properly. :slight_smile:

If, in the sequence of actual matches, there is a finite upper bound on
the number of positive-recurrent states (that is, matches that do occur a
non-zero percentage of times in practice), then the order of my algorithm
is O(1), am I right?

But for a possibly large constant runtime, no?

Anyhow, it will accelerate the parsing, especially for a very redundant
text in a complex syntax, e.g. parsing something that could be any Ruby
code but which just happens to be a huuuuge array literal of hex fixnums.

Btw, anyone knows other algorithms that use Huffman to compress the *time*
instead of the *space* ?

I wonder if maybe a PATRICIA (crit-bit) tree could help too, but then,
Ruby regexp cannot match on bit-level.

···

On Sat, 19 Mar 2005, Christian Neukirchen wrote:

Mathieu Bouchard -=- Montréal QC Canada -=- http://artengine.ca/matju

--
Christian Neukirchen <chneukirchen@gmail.com> http://chneukirchen.org

Randy Kramer wrote:

What is the significance to the # in String#delete! and String#= above? (I can't find anything with the # in it like that in "Ruby In a Nutshell"--is it something new, or just some shorthand/jargon? (Or did I look in the wrong places :wink:

It is simply a notational convention for identifying instance methods. Since it is not real Ruby syntax it can be misleading, but its usage is so widespread it's probably here to stay.

"Randy Kramer" <rhkramer@gmail.com> schrieb im Newsbeitrag
news:200503201839.47741.rhkramer@gmail.com...

> "Randy Kramer" <rhkramer@gmail.com> schrieb im Newsbeitrag
> > I planned to do that, I'll try to
> > put something on WikiLearn this weekend then post something here.

See RWP_RE_Tests < Wikilearn < TWiki -- it's rather

ugly

but the information is there.

Randy Kramer

Some remarks:

- The comparison between 5 and 6 does not seem fair, as you iterate in 6
but not in 5.

- You don't use String#scan or #split which you are likely to need in
practice, because you want to sift through complete documents and want to
treat all occurrences.

- The differences between the check-first-char approach and the pure RE
approach are so insignificant that I'd not bother using the more complex
code. I'd stick with pure RE based approaches and only try to optimize if
performance is bad. (You mentioned premature optimization already... :-))

Kind regards

    robert

···

On Sunday 20 March 2005 03:19 am, Robert Klemme wrote:

Mathieu Bouchard <matju@sympatico.ca> writes:
> If, in the sequence of actual matches, there is a finite upper bound on
> the number of positive-recurrent states (that is, matches that do occur a
> non-zero percentage of times in practice), then the order of my algorithm
> is O(1), am I right?
But for a possibly large constant runtime, no?

Depends on how seldom the Huffman tree is updated. I see it like this: you
update the frequency counts at every iteration (every use of the
regexp). This is essentially a kind of profiler. Then later on you
reparent Huffman subtrees all in one shot. Suppose that this reparenting
task is O(n)-time. Then to get an amortised O(1)-time you will need to
perform it O(1/n)-often, because O(n)*O(1/n)=O(1). Then the constants of
the O(1/n) can be lowered to inversely match the constants of the O(n) so
that the constants of the O(1) are kept as low as desired.

> Btw, anyone knows other algorithms that use Huffman to compress the *time*
> instead of the *space* ?
I wonder if maybe a PATRICIA (crit-bit) tree could help too, but then,
Ruby regexp cannot match on bit-level.

I don't really know that one.

However if I were to solve the problem of finding which sub-regexp has
been matched in (A|B|C|...), I'd edit re.c and add a (?)-feature for
filling a $-slot with a value that doesn't come from the string, e.g.

/((?"Aah"(A))|(?"Bay"(B))|(?"Say"(C))|(?"Day"(D)))/

Would put one of "Aah", "Bay", "Say", "Day" strings in $2...

But this doesn't make sense yet, as one would expect it to instead be put
in one of $2, $4, $6, $8, ... to be consistent with current regexp
semantics; and looking up possibly all of those looking for a nonnil
$-slot is a O(n)-time thing. There ought to be a better way, that is,
something both fast and consistent with current semantics, but I can't
think of any as of now. Do you have any ideas? If you have something good
then I think it should be a RCR.

···

On Mon, 21 Mar 2005, Christian Neukirchen wrote:

_____________________________________________________________________
Mathieu Bouchard -=- Montréal QC Canada -=- http://artengine.ca/matju