Iterating through a string and removing leading characters

Randy_Kramer · 20 March 2005 00:58

Thanks!

···

On Saturday 19 March 2005 05:29 pm, Timothy Hunter wrote:

Randy Kramer wrote:
> What is the significance to the # in String#delete! and String#= above?
> (I can't find anything with the # in it like that in "Ruby In a
> Nutshell"--is it something new, or just some shorthand/jargon? (Or did I
> look in the wrong places

It is simply a notational convention for identifying instance methods.
Since it is not real Ruby syntax it can be misleading, but its usage is
so widespread it's probably here to stay.

Randy_Kramer · 21 March 2005 14:20

Some remarks:

- The comparison between 5 and 6 does not seem fair, as you iterate in 6
but not in 5.

Oops, for a minute I thought I had really screwed up (like by not doing 5 the
10000 times). AFAIK, I don't need to iterate (through the string) in 5 as
the RE is not anchored to the beginning of the string--it still checks the
entire string (line) for that pattern, which is what I try to achieve in 6 by
the iteration through the string and check a character then invoke RE
approach. I am sure that my Ruby code to do that is not the best, and I may
learn something by making it better, but I agree with your conclusion /
recommendation (below) at least for the time being (although I do plan to
play with str::scan and StringScanner at least a little bit (I presume they
do similar things, but perhaps the 2nd is optimized somehow, particularly if
I "require C" or whatever)).

- You don't use String#scan or #split which you are likely to need in
practice, because you want to sift through complete documents and want to
treat all occurrences.

I need to let that sink in a bit. In general I do want to treat all
occurrences, but I plan to scan a line (actually a paragraph) at a time, and
some things can only occur at the beginning of a line, so those would only be
checked at the beginning of a line.

- The differences between the check-first-char approach and the pure RE
approach are so insignificant that I'd not bother using the more complex
code. I'd stick with pure RE based approaches and only try to optimize if
performance is bad. (You mentioned premature optimization already... :-))

Thanks! I pretty much agree at this time, although I want to play a little
bit with StringScanner.

regards,
Randy Kramer

···

On Monday 21 March 2005 04:44 am, Robert Klemme wrote:

Anonymous_Coward · 3 April 2005 13:36

Mathieu Bouchard <matju@sympatico.ca> writes:
> If, in the sequence of actual matches, there is a finite upper bound on
> the number of positive-recurrent states (that is, matches that do occur a
> non-zero percentage of times in practice), then the order of my algorithm
> is O(1), am I right?
But for a possibly large constant runtime, no?

Depends on how seldom the Huffman tree is updated. I see it like this: you
update the frequency counts at every iteration (every use of the
regexp). This is essentially a kind of profiler. Then later on you
reparent Huffman subtrees all in one shot. Suppose that this reparenting
task is O(n)-time. Then to get an amortised O(1)-time you will need to
perform it O(1/n)-often, because O(n)*O(1/n)=O(1). Then the constants of
the O(1/n) can be lowered to inversely match the constants of the O(n) so
that the constants of the O(1) are kept as low as desired.

> Btw, anyone knows other algorithms that use Huffman to compress the *time*
> instead of the *space* ?
I wonder if maybe a PATRICIA (crit-bit) tree could help too, but then,
Ruby regexp cannot match on bit-level.

I don't really know that one.

However if I were to solve the problem of finding which sub-regexp has
been matched in (A|B|C|...), I'd edit re.c and add a (?)-feature for
filling a $-slot with a value that doesn't come from the string, e.g.

/((?"Aah"(A))|(?"Bay"(B))|(?"Say"(C))|(?"Day"(D)))/

Would put one of "Aah", "Bay", "Say", "Day" strings in $2...

But this doesn't make sense yet, as one would expect it to instead be put
in one of $2, $4, $6, $8, ... to be consistent with current regexp
semantics; and looking up possibly all of those looking for a nonnil
$-slot is a O(n)-time thing. There ought to be a better way, that is,
something both fast and consistent with current semantics, but I can't
think of any as of now. Do you have any ideas? If you have something good
then I think it should be a RCR.

This is a worthy idea, certainly! I should not expect it to cause
any confusion so long as the notation is standardised, particularly
through the standard ? extension switch. Perhaps the inner braces
would not be allowed for clarity? Unfortunately the rubyish ?!
(in-place method) and ?# (string interpolation) are already taken

/(?-> 'match' 'replacement')/

_____________________________________________________________________
Mathieu Bouchard -=- Montréal QC Canada -=- http://artengine.ca/matju

E

No-one expects the Solaris POSIX implementation!

···

Le 3/4/2005, "Mathieu Bouchard" <matju@sympatico.ca> a écrit:

On Mon, 21 Mar 2005, Christian Neukirchen wrote:

Robert · 21 March 2005 14:54

"Randy Kramer" <rhkramer@gmail.com> schrieb im Newsbeitrag
news:200503210919.00242.rhkramer@gmail.com...

> Some remarks:
>
> - The comparison between 5 and 6 does not seem fair, as you iterate

in 6

> but not in 5.

Oops, for a minute I thought I had really screwed up (like by not doing

5 the

10000 times). AFAIK, I don't need to iterate (through the string) in 5

as

the RE is not anchored to the beginning of the string--it still checks

the

entire string (line) for that pattern, which is what I try to achieve in

6 by

the iteration through the string and check a character then invoke RE
approach. I am sure that my Ruby code to do that is not the best, and I

may

learn something by making it better, but I agree with your conclusion /
recommendation (below) at least for the time being (although I do plan

to

play with str::scan and StringScanner at least a little bit (I presume

they

do similar things, but perhaps the 2nd is optimized somehow,

particularly if

I "require C" or whatever)).

A particular performance show stopper in test 6 is String# i.e. you
create a new String object for each test; object creation is comparatively
expensive even though Strings share their internal buffer. But the GC has
to be informed etc. and this is quite some overhead. If you want fast
code, create as few instances as possible. The same holds for Java in 99%
of all cases.

Another general remark: it should be faster to use a range, Fixnum#upto or
Fixnum#times for iterating because then you have iteration in C and you
don't need to recalculate the limit on each iteration:

# old
i = 0
until i==s1.length-6 do
  if s1[i] == 91
    s1[i,s1.length] =~
/\A\[(([A-Z]\w*)\.)?(.*)(#([A-Z]\w*))?\](\[(.*)\])?\]/
  end
  i += 1
end

# with range
(0...(s1.length-6)).each do |i|
  if s1[i] == ?[
    s1[i,s1.length] =~
/\A\[(([A-Z]\w*)\.)?(.*)(#([A-Z]\w*))?\](\[(.*)\])?\]/
  end
end

# with upto
0.upto(s1.length-7) do |i|
  if s1[i] == ?[
    s1[i,s1.length] =~
/\A\[(([A-Z]\w*)\.)?(.*)(#([A-Z]\w*))?\](\[(.*)\])?\]/
  end
end

# with times
(s1.length-6).times do |i|
  if s1[i] == ?[
    s1[i,s1.length] =~
/\A\[(([A-Z]\w*)\.)?(.*)(#([A-Z]\w*))?\](\[(.*)\])?\]/
  end
end

And you can use "?[" instead of "91" which is far less readable.

> - You don't use String#scan or #split which you are likely to need in
> practice, because you want to sift through complete documents and

want to

> treat all occurrences.

I need to let that sink in a bit. In general I do want to treat all
occurrences, but I plan to scan a line (actually a paragraph) at a time,

and

some things can only occur at the beginning of a line, so those would

only be

checked at the beginning of a line.

Hm, if you know that the size of files is limited (i.e. something like
just a few KB) then it's usually worth slurping in the whole file with
something like this

contents = File.open(f){|io| io.read}

and then iterate through the whole thing with #scan. You can still use ^
to anchor at line beginnings.

# get the initial sequen until the first non whitespace
# just an example
contents.scan /^\s+\S/ do |m|
p m[0]
end

> - The differences between the check-first-char approach and the pure

RE

> approach are so insignificant that I'd not bother using the more

complex

> code. I'd stick with pure RE based approaches and only try to

optimize if

> performance is bad. (You mentioned premature optimization already...

:-))

Thanks! I pretty much agree at this time, although I want to play a

little

bit with StringScanner.

Of course, new toys have to be played with!

Kind regards

robert

···

On Monday 21 March 2005 04:44 am, Robert Klemme wrote:

Randy_Kramer · 3 April 2005 14:10

>However if I were to solve the problem of finding which sub-regexp has
>been matched in (A|B|C|...), I'd edit re.c and add a (?)-feature for
>filling a $-slot with a value that doesn't come from the string, e.g.
>
>/((?"Aah"(A))|(?"Bay"(B))|(?"Say"(C))|(?"Day"(D)))/
>
>Would put one of "Aah", "Bay", "Say", "Day" strings in $2...
>
>But this doesn't make sense yet, as one would expect it to instead be put
>in one of $2, $4, $6, $8, ... to be consistent with current regexp
>semantics; and looking up possibly all of those looking for a nonnil
>$-slot is a O(n)-time thing. There ought to be a better way, that is,
>something both fast and consistent with current semantics, but I can't
>think of any as of now. Do you have any ideas? If you have something good
>then I think it should be a RCR.

I like this idea (I think it would be helpful with my problem of fast parsing
of TWiki markup). (And, if I decide to do a character by character thing
myself in c, it sounds like re.c would be something to study.)

But, instead of (or maybe in addition to) returning the "Aah", "Bay" in $2, $4
or whatever, how about returning an integer (somehow) that indicates which
one matched?

As a Ruby newbie, I'm not sure that's all I'd be looking for--if Ruby gives me
a way to access those strings directly by the integer, that would be great.
For example, say the integer is returned as $a (to avoid collision with $1,
$2 ...). I'd like to be able to access the "Aah", "Bay, ... by something
like $($a)--barring that, I can simply maintain a separate array of the
"Aah"... and access that array using $a.

Randy Kramer

···

On Sunday 03 April 2005 09:36 am, Saynatkari wrote:

Le 3/4/2005, "Mathieu Bouchard" <matju@sympatico.ca> a écrit:

This is a worthy idea, certainly! I should not expect it to cause
any confusion so long as the notation is standardised, particularly
through the standard ? extension switch. Perhaps the inner braces
would not be allowed for clarity? Unfortunately the rubyish ?!
(in-place method) and ?# (string interpolation) are already taken

/(?-> 'match' 'replacement')/

Randy_Kramer · 22 March 2005 16:58

Robert,

I want to thank you for all your help, it's like having a personal tutor!

Some feedback / observations below that don't really require any response.

"Randy Kramer" <rhkramer@gmail.com> schrieb im Newsbeitrag
news:200503210919.00242.rhkramer@gmail.com...
> > Some remarks:
> > - The comparison between 5 and 6 does not seem fair, as you iterate
in 6
> > but not in 5.

After some more testing, your remark seems more on target than I originally
thought--I can account for almost all the 30x increase in required time (for
6) by the additional iterations (repeated invocations of the RE engine).
It's like the invocation is the expensive part, and whether it looks for a
pattern at one point or scans the remainder of the(se short) strings is
negligible. (See the results of tests 6d and 6e below.)

A particular performance show stopper in test 6 is String# i.e. you
create a new String object for each test; object creation is comparatively
expensive even though Strings share their internal buffer. But the GC has
to be informed etc. and this is quite some overhead. If you want fast
code, create as few instances as possible. The same holds for Java in 99%
of all cases.

I'm surprised that Ruby creates a new String object for each test--I would
have hoped/thought that it was simply letting me "peek" at a portion of the
existing string (especially since it's only a test). I presume that the
StringScanner behaves more sanely in that respect, but I guess I'll find
out.

Thanks for all of the following! I did substitute them in test 6 to see what
they would do.

# old (6): ~18 seconds
# with range (6a): didn't work, see below
# with upto (6b): ~14 seconds
# with times (6c): ~12 seconds
#6d: ~0.75 seconds (This is the test that convinced me the iterations are the
problem, I revised the (with times) program to only call the RE once,
although it still scans only from the start of the string--I guess I should
try test 6e with the RE not anchored.)
#6e: ~0.6 seconds (Same as 6d, except I removed the \A anchor--and now I'm
puzzled, how is this faster than the anchored version?? Anyway, at this time
I don't care, I'll just "file it away" as a little anomaly to perhaps
understand some day (and, as I haven't run the test multiple times or similar
in an attempt to discount garbage collection, maybe that is the problem.)

I did create new test programs (6a, 6b, 6c) but I haven't uploaded them to the
TWiki--if you are really interested I can do that, but, as I say below, I'm
not going to lose sleep over the problem with range.

For some reason that I haven't figured out (yet?), the "with range" option
didn't work. I'm not going to lose sleep over it--I did try some
troubleshooting, but it may be a rather subtle bug (or I have a very dense
head).

When I run it as part of a program (re_test_6a.rb), I get the following error
messages:

bash-2.05b$ re_test6a.rb
./re_test6a.rb:40: Invalid char `\240' in expression
./re_test6a.rb:41: Invalid char `\240' in expression
...
./re_test6a.rb:66: Invalid char `\240' in expression
bash-2.05b$

When I simply copy the "individual loop" part of the code (i.e., the portion
you show below under # with range) into IRB and running it (after defining
the appropriate strings), I get (and get kicked out of IRB) BTW, this is the
result of attempting to paste the five lines into IRB as a group:

irb(main):021:0> (0...(s1.length-6)).each do |i|
irb(main):022:1* if s1[i] == ?[
SyntaxError: compile error
(irb):21: syntax error
from (irb):21
from (null):0
bash-2.05b$

As I try to troubleshoot (by removing pieces from the loop), everything seems
to work OK (and I'm learning what some of those pieces do

Anyway, since I went this far, I have uploaded programs 6a thru 6e to the
TWiki, but I am not requesting / suggesting that anyone try to spend time
debugging 6a.

RWP_RE_Tests < Wikilearn < TWiki?

# old
i = 0
until i==s1.length-6 do
  if s1[i] == 91
    s1[i,s1.length] =~
/\A\[(([A-Z]\w*)\.)?(.*)(#([A-Z]\w*))?\](\[(.*)\])?\]/
  end
  i += 1
end

# with range
(0...(s1.length-6)).each do |i|
  if s1[i] == ?[
    s1[i,s1.length] =~
/\A\[(([A-Z]\w*)\.)?(.*)(#([A-Z]\w*))?\](\[(.*)\])?\]/
  end
end

# with upto
0.upto(s1.length-7) do |i|
  if s1[i] == ?[
    s1[i,s1.length] =~
/\A\[(([A-Z]\w*)\.)?(.*)(#([A-Z]\w*))?\](\[(.*)\])?\]/
  end
end

# with times
(s1.length-6).times do |i|
  if s1[i] == ?[
    s1[i,s1.length] =~
/\A\[(([A-Z]\w*)\.)?(.*)(#([A-Z]\w*))?\](\[(.*)\])?\]/
  end
end

For anyone following along my next efforts are going to be focused on
StringScanner and then making the necessary substitutions. In parallel I
will probably try to refine the REs.

The remainder of this looks useful as well!

regards,
Randy Kramer

···

On Monday 21 March 2005 09:54 am, Robert Klemme wrote:

> On Monday 21 March 2005 04:44 am, Robert Klemme wrote:

Hm, if you know that the size of files is limited (i.e. something like
just a few KB) then it's usually worth slurping in the whole file with
something like this

contents = File.open(f){|io| io.read}

and then iterate through the whole thing with #scan. You can still use ^
to anchor at line beginnings.

# get the initial sequen until the first non whitespace
# just an example
contents.scan /^\s+\S/ do |m|
p m[0]
end

Anonymous_Coward · 3 April 2005 14:16

>However if I were to solve the problem of finding which sub-regexp has
>been matched in (A|B|C|...), I'd edit re.c and add a (?)-feature for
>filling a $-slot with a value that doesn't come from the string, e.g.
>
>/((?"Aah"(A))|(?"Bay"(B))|(?"Say"(C))|(?"Day"(D)))/
>
>Would put one of "Aah", "Bay", "Say", "Day" strings in $2...
>
>But this doesn't make sense yet, as one would expect it to instead be put
>in one of $2, $4, $6, $8, ... to be consistent with current regexp
>semantics; and looking up possibly all of those looking for a nonnil
>$-slot is a O(n)-time thing. There ought to be a better way, that is,
>something both fast and consistent with current semantics, but I can't
>think of any as of now. Do you have any ideas? If you have something good
>then I think it should be a RCR.

I like this idea (I think it would be helpful with my problem of fast parsing
of TWiki markup). (And, if I decide to do a character by character thing
myself in c, it sounds like re.c would be something to study.)

But, instead of (or maybe in addition to) returning the "Aah", "Bay" in $2, $4
or whatever, how about returning an integer (somehow) that indicates which
one matched?

I mentioned this a while ago; I edited my strscan.c to provide
methods #matched_groups and #first_matched_group for accessing
this information (I used it to dispatch a block to process a
particular type of match). I can clean it up and post a patch
somewhere if it seems a useful feature for other people.

As a Ruby newbie, I'm not sure that's all I'd be looking for--if Ruby gives me
a way to access those strings directly by the integer, that would be great.
For example, say the integer is returned as $a (to avoid collision with $1,
$2 ...). I'd like to be able to access the "Aah", "Bay, ... by something
like $($a)--barring that, I can simply maintain a separate array of the
"Aah"... and access that array using $a.

Randy Kramer

This is a worthy idea, certainly! I should not expect it to cause
any confusion so long as the notation is standardised, particularly
through the standard ? extension switch. Perhaps the inner braces
would not be allowed for clarity? Unfortunately the rubyish ?!
(in-place method) and ?# (string interpolation) are already taken

/(?-> 'match' 'replacement')/

E

No-one expects the Solaris POSIX implementation!

···

Le 3/4/2005, "Randy Kramer" <rhkramer@gmail.com> a écrit:

On Sunday 03 April 2005 09:36 am, Saynatkari wrote:

Le 3/4/2005, "Mathieu Bouchard" <matju@sympatico.ca> a écrit:

Robert · 23 March 2005 09:04

"Randy Kramer" <rhkramer@gmail.com> schrieb im Newsbeitrag
news:200503221157.21361.rhkramer@gmail.com...

Robert,

I want to thank you for all your help, it's like having a personal

tutor!

You're welcome! I'm glad I could help by sharing my experiences.

Some feedback / observations below that don't really require any

response.

Well, some comments below nevertheless...

> "Randy Kramer" <rhkramer@gmail.com> schrieb im Newsbeitrag
> news:200503210919.00242.rhkramer@gmail.com...
> > > Some remarks:
> > > - The comparison between 5 and 6 does not seem fair, as you

iterate

> in 6
> > > but not in 5.

After some more testing, your remark seems more on target than I

originally

thought--I can account for almost all the 30x increase in required time

(for

6) by the additional iterations (repeated invocations of the RE engine).
It's like the invocation is the expensive part, and whether it looks for

a

pattern at one point or scans the remainder of the(se short) strings is
negligible. (See the results of tests 6d and 6e below.)

Well, that clearly shows that simple scanning with a RE is superior to
iterating and then scanning.

> A particular performance show stopper in test 6 is String# i.e. you
> create a new String object for each test; object creation is

comparatively

> expensive even though Strings share their internal buffer. But the GC

has

> to be informed etc. and this is quite some overhead. If you want fast
> code, create as few instances as possible. The same holds for Java in

99%

> of all cases.

I'm surprised that Ruby creates a new String object for each test--I

would

have hoped/thought that it was simply letting me "peek" at a portion of

the

existing string (especially since it's only a test).

The internal buffer (the characters) is shared but there is a new Ruby
instance each time you invoke String#:

10.times { puts s1[2,4].id }

134979736
134979676
134979652
134979592
134979496
134979472
134979436
134979364
134979268
134979196
=> 10

I presume that the
StringScanner behaves more sanely in that respect, but I guess I'll find
out.

Never used that myself but it's sure worth a try.

Thanks for all of the following! I did substitute them in test 6 to see

what

they would do.

# old (6): ~18 seconds
# with range (6a): didn't work, see below
# with upto (6b): ~14 seconds
# with times (6c): ~12 seconds
#6d: ~0.75 seconds (This is the test that convinced me the iterations

are the

problem, I revised the (with times) program to only call the RE once,
although it still scans only from the start of the string--I guess I

should

try test 6e with the RE not anchored.)
#6e: ~0.6 seconds (Same as 6d, except I removed the \A anchor--and now

I'm

puzzled, how is this faster than the anchored version?? Anyway, at this

time

I don't care, I'll just "file it away" as a little anomaly to perhaps
understand some day (and, as I haven't run the test multiple times or

similar

in an attempt to discount garbage collection, maybe that is the

problem.)

I did create new test programs (6a, 6b, 6c) but I haven't uploaded them

to the

TWiki--if you are really interested I can do that, but, as I say below,

I'm

not going to lose sleep over the problem with range.

For some reason that I haven't figured out (yet?), the "with range"

option

didn't work. I'm not going to lose sleep over it--I did try some
troubleshooting, but it may be a rather subtle bug (or I have a very

dense

head).

When I run it as part of a program (re_test_6a.rb), I get the following

error

messages:

bash-2.05b$ re_test6a.rb
/re_test6a.rb:40: Invalid char `\240' in expression
/re_test6a.rb:41: Invalid char `\240' in expression
..
/re_test6a.rb:66: Invalid char `\240' in expression
bash-2.05b$

See comment below.

When I simply copy the "individual loop" part of the code (i.e., the

portion

you show below under # with range) into IRB and running it (after

defining

the appropriate strings), I get (and get kicked out of IRB) BTW, this is

the

result of attempting to paste the five lines into IRB as a group:

irb(main):021:0> (0...(s1.length-6)).each do |i|
irb(main):022:1* if s1[i] == ?[
SyntaxError: compile error
(irb):21: syntax error
from (irb):21
from (null):0
bash-2.05b$

s1="a"*10

=> "aaaaaaaaaa"

(0...(s1.length-6)).each do |i|

?> if s1[i] == ?[

puts "yes"
end
end

=> 0...4

I guess this and the other syntax error above are caused by copying and
pasting some characters outside the ASCII range. I have experienced
similar errors in the past. Sometimes they look like whitespace
characters so you don't recognize them on first sight.

As I try to troubleshoot (by removing pieces from the loop), everything

seems

to work OK (and I'm learning what some of those pieces do

Anyway, since I went this far, I have uploaded programs 6a thru 6e to

the

TWiki, but I am not requesting / suggesting that anyone try to spend

time

debugging 6a.

RWP_RE_Tests < Wikilearn < TWiki?

> # old
> i = 0
> until i==s1.length-6 do
> if s1[i] == 91
> s1[i,s1.length] =~
> /\A\[(([A-Z]\w*)\.)?(.*)(#([A-Z]\w*))?\](\[(.*)\])?\]/
> end
> i += 1
> end
>
> # with range
> (0...(s1.length-6)).each do |i|
> if s1[i] == ?[
> s1[i,s1.length] =~
> /\A\[(([A-Z]\w*)\.)?(.*)(#([A-Z]\w*))?\](\[(.*)\])?\]/
> end
> end
>
> # with upto
> 0.upto(s1.length-7) do |i|
> if s1[i] == ?[
> s1[i,s1.length] =~
> /\A\[(([A-Z]\w*)\.)?(.*)(#([A-Z]\w*))?\](\[(.*)\])?\]/
> end
> end
>
> # with times
> (s1.length-6).times do |i|
> if s1[i] == ?[
> s1[i,s1.length] =~
> /\A\[(([A-Z]\w*)\.)?(.*)(#([A-Z]\w*))?\](\[(.*)\])?\]/
> end
> end

For anyone following along my next efforts are going to be focused

on

StringScanner and then making the necessary substitutions. In parallel

I

will probably try to refine the REs.

Please let me/us know how that works out.

The remainder of this looks useful as well!

regards,
Randy Kramer

> Hm, if you know that the size of files is limited (i.e. something like
> just a few KB) then it's usually worth slurping in the whole file with
> something like this
>
> contents = File.open(f){|io| io.read}
>
> and then iterate through the whole thing with #scan. You can still

use ^

> to anchor at line beginnings.
>
> # get the initial sequen until the first non whitespace
> # just an example
> contents.scan /^\s+\S/ do |m|
> p m[0]
> end

Kind regards

robert

···

On Monday 21 March 2005 09:54 am, Robert Klemme wrote:
> > On Monday 21 March 2005 04:44 am, Robert Klemme wrote:

Randy_Kramer · 3 April 2005 15:53

Thanks, I might take you up on that offer, but I'll be busy for 2 to 4 weeks
with some other things. After I get back into parsing, I'll get back to you.

(I'm not much at patching, and I'd hate to ask you to do something if I wasn't
fairly certain I'd be able to use it. Still, if others see use for it ...)

regards,
Randy Kramer

···

On Sunday 03 April 2005 10:16 am, Saynatkari wrote:

Le 3/4/2005, "Randy Kramer" <rhkramer@gmail.com> a écrit:
>But, instead of (or maybe in addition to) returning the "Aah", "Bay" in
> $2, $4 or whatever, how about returning an integer (somehow) that
> indicates which one matched?

I mentioned this a while ago; I edited my strscan.c to provide
methods #matched_groups and #first_matched_group for accessing
this information (I used it to dispatch a block to process a
particular type of match). I can clean it up and post a patch
somewhere if it seems a useful feature for other people.

Randy_Kramer · 23 March 2005 15:16

The internal buffer (the characters) is shared but there is a new Ruby

instance each time you invoke String#:
>> 10.times { puts s1[2,4].id }

134979736
134979676
134979652
134979592
134979496
134979472
134979436
134979364
134979268
134979196
=> 10

Oh, yeah, thanks--I could/should have tried that.

Aside: but:

irb(main):005:0> 10.times { puts s1[2].id }
211
211
211
211
211
211
211
211
211
211
=> 10
irb(main):006:0>

Which (maybe) looks nicer/more reasonable, except I don't know why the 3 digit
ID in this case.

> bash-2.05b$ re_test6a.rb
> /re_test6a.rb:40: Invalid char `\240' in expression
> /re_test6a.rb:41: Invalid char `\240' in expression
> ..
> /re_test6a.rb:66: Invalid char `\240' in expression
> bash-2.05b$

I guess this and the other syntax error above are caused by copying and
pasting some characters outside the ASCII range. I have experienced
similar errors in the past. Sometimes they look like whitespace
characters so you don't recognize them on first sight.

That was apparently the problem--I removed all the whitespace within the loop,
then replaced each with a space, and now it runs. I'll upload the revised
program to TWiki later today (not getting through at the moment).

http://twiki.org/cgi-bin/view/Wikilearn/RWP_RE_Tests

Please let me/us know how that works out.

Sure, although things may slow down for a while with 3 Ruby library books in
my hands and until tax season is over.

regards,
Randy Kramer

···

On Wednesday 23 March 2005 04:04 am, Robert Klemme wrote:

Florian_Gross · 23 March 2005 15:29

Randy Kramer wrote:

The internal buffer (the characters) is shared but there is a new Ruby
instance each time you invoke String#:

Oh, yeah, thanks--I could/should have tried that.

Aside: but:

irb(main):005:0> 10.times { puts s1[2].id }
211 [times ten]
=> 10
irb(main):006:0>

Which (maybe) looks nicer/more reasonable, except I don't know why the 3 digit ID in this case.

string[number] returns either nil or the ASCII number of the character at that place.

So "foo"[0] == ?f. And the object_id of low Fixnums is usually very low as their value is directly stored in it.

You can work around this by instead doing "foo"[0, 1] which will return "f". I think this behavior is subject to change in Rite, but it's not yet clear what it will change to.

···

On Wednesday 23 March 2005 04:04 am, Robert Klemme wrote:

Randy_Kramer · 23 March 2005 17:35

Randy Kramer wrote:
> Aside: but:
>
> irb(main):005:0> 10.times { puts s1[2].id }
> 211 [times ten]
> => 10
> irb(main):006:0>
>
> Which (maybe) looks nicer/more reasonable, except I don't know why the 3
> digit ID in this case.

string[number] returns either nil or the ASCII number of the character
at that place.

Thanks, but I'm still confused--s1 is "This is a test", so 211 is neither nil
nor ASCII for "i", and besides, I asked for the (object_)id.

211 does happen to be 2*?i+1--maybe there's a clue there? (and the same thing
holds for the previous character (h) which shows up as 209)

My original concern: I was hoping that s1[0] and similar did not create new
objects, and I suspect they don't, but I'm not sure.

Randy Kramer

···

On Wednesday 23 March 2005 10:29 am, Florian Gross wrote:

So "foo"[0] == ?f. And the object_id of low Fixnums is usually very low
as their value is directly stored in it.

You can work around this by instead doing "foo"[0, 1] which will return
"f". I think this behavior is subject to change in Rite, but it's not
yet clear what it will change to.

Florian_Gross · 23 March 2005 20:34

Randy Kramer wrote:

Thanks, but I'm still confused--s1 is "This is a test", so 211 is neither nil nor ASCII for "i", and besides, I asked for the (object_)id.

211 does happen to be 2*?i+1--maybe there's a clue there? (and the same thing holds for the previous character (h) which shows up as 209)

irb(main):001:0> ?i.id
(irb):1: warning: Object#id will be deprecated; use Object#object_id
=> 211

This is consistent with the explanation from my earlier posting:

'So "foo"[0] == ?f. And the object_id of low Fixnums is usually very low as their value is directly stored in it.'

My original concern: I was hoping that s1[0] and similar did not create new objects, and I suspect they don't, but I'm not sure.

Fixnums are never created (they are only referred to) so this does not cause any object creation overhead.

Robert · 23 March 2005 20:39

"Randy Kramer" <rhkramer@gmail.com> schrieb im Newsbeitrag news:200503231224.46149.rhkramer@gmail.com...

Randy Kramer wrote:
> Aside: but:
>
> irb(main):005:0> 10.times { puts s1[2].id }
> 211 [times ten]
> => 10
> irb(main):006:0>
>
> Which (maybe) looks nicer/more reasonable, except I don't know why the > 3
> digit ID in this case.

string[number] returns either nil or the ASCII number of the character
at that place.

Thanks, but I'm still confused--s1 is "This is a test", so 211 is neither nil
nor ASCII for "i", and besides, I asked for the (object_)id.

211 does happen to be 2*?i+1--maybe there's a clue there? (and the same thing
holds for the previous character (h) which shows up as 209)

I think there is a relation between object ids and values for Fixnums but I'm not sure. It's also quite unimportant IMHO. But it seems to be exactly the relationship you assumed:

20.times {|ch| printf "%02x %02x %02x\n", ch, ch.id, (ch<<1)+1}

00 01 01
01 03 03
02 05 05
03 07 07
04 09 09
05 0b 0b
06 0d 0d
07 0f 0f
08 11 11
09 13 13
0a 15 15
0b 17 17
0c 19 19
0d 1b 1b
0e 1d 1d
0f 1f 1f
10 21 21
11 23 23
12 25 25
13 27 27
=> 20

My original concern: I was hoping that s1[0] and similar did not create new
objects, and I suspect they don't, but I'm not sure.

They don't because Fixnums are treated specially for performance reasons.

s = "This is a test"

=> "This is a test"

10.times {c=s[2]; p [c, c.chr, c.id]}

[105, "i", 211]
=> 10

Regards

robert

···

On Wednesday 23 March 2005 10:29 am, Florian Gross wrote:

Randy_Kramer · 24 March 2005 19:05

Thanks, Florian and Robert! (It took a little while for this to sink in, but
I think I've now got it. Hope I can remember it

Well, maybe I'd better recap for myself (correct me if I'm still off base):

* (from earlier emails) Something like s1[start, length] (which I
misunderstood earlier to be s1[start, end]) doesn't just let you look at a
portion (substring) of the original string, but actually creates a new
(sub)string (with all the overhead of creating a new string).

* on the other hand, s1[index] does not create a new (one byte) string, but
simply returns a Fixnum (which happens to be an object) representing the
ASCII value of the character. In other words, this has none of the overhead
of creating a new string. (I guess this is where I'm still a little
uncertain--does it go through some of the overhead of creating a new string,
but the resultant new (one byte) strings all have the same value (the ASCII
code for "i" which is Fixnum 105 of which (see next item) there is only one,
which has object_id 211? Does it really matter to me? Probably not, maybe
it's just idle curiosity.)

* when I do an object_id on that Fixnum, I always get the same value (211
for 105 (the Fixnum which represents "i")) because there is only one Fixnum
object in the system with the value 105.

I don't need a response to my "idle curiosity" question under the 2nd bullet
if the rest of my understanding is basicly (sp?) correct.

regards,
Randy Kramer

···

On Wednesday 23 March 2005 03:34 pm, Florian Gross wrote:

Randy Kramer wrote:
> Thanks, but I'm still confused--s1 is "This is a test", so 211 is neither
> nil nor ASCII for "i", and besides, I asked for the (object_)id.
>
> 211 does happen to be 2*?i+1--maybe there's a clue there? (and the same
> thing holds for the previous character (h) which shows up as 209)

irb(main):001:0> ?i.id
(irb):1: warning: Object#id will be deprecated; use Object#object_id
=> 211

This is consistent with the explanation from my earlier posting:

'So "foo"[0] == ?f. And the object_id of low Fixnums is usually very low
as their value is directly stored in it.'

> My original concern: I was hoping that s1[0] and similar did not create
> new objects, and I suspect they don't, but I'm not sure.

Fixnums are never created (they are only referred to) so this does not
cause any object creation overhead.

Robert · 25 March 2005 12:04

"Randy Kramer" <rhkramer@gmail.com> schrieb im Newsbeitrag news:200503241403.52395.rhkramer@gmail.com...

Well, maybe I'd better recap for myself (correct me if I'm still off base):

* (from earlier emails) Something like s1[start, length] (which I
misunderstood earlier to be s1[start, end]) doesn't just let you look at a
portion (substring) of the original string, but actually creates a new
(sub)string (with all the overhead of creating a new string).

Correct.

* on the other hand, s1[index] does not create a new (one byte) string, but
simply returns a Fixnum (which happens to be an object) representing the
ASCII value of the character. In other words, this has none of the overhead
of creating a new string.

Correct.

(I guess this is where I'm still a little
uncertain--does it go through some of the overhead of creating a new string,

No.

but the resultant new (one byte) strings all have the same value (the ASCII

It's not a new string but simply a Fixnum; in other languages it would be a Character instance - if there was a specialized class / type for this.

code for "i" which is Fixnum 105 of which (see next item) there is only one,
which has object_id 211?

Right.

Does it really matter to me?

Not really, athough understanding the performance effects and other pecularities of Fixnum can be useful at times.

Probably not, maybe
it's just idle curiosity.)

.... which aids us in learning new things - so it's not too bad to have. (Only cats suffer more often than is good for them from curiosity induced negative effects...)

* when I do an object_id on that Fixnum, I always get the same value (211
for 105 (the Fixnum which represents "i")) because there is only one Fixnum
object in the system with the value 105.

Correct.

I don't need a response to my "idle curiosity" question under the 2nd bullet
if the rest of my understanding is basicly (sp?) correct.

"sp?"?

Happy Easter

robert

Glenn_Parker1 · 25 March 2005 14:38

Robert Klemme wrote:

"Randy Kramer" <rhkramer@gmail.com> schrieb im Newsbeitrag news:200503241403.52395.rhkramer@gmail.com...

I don't need a response to my "idle curiosity" question under the 2nd bullet
if the rest of my understanding is basicly (sp?) correct.

"sp?"?

(sp?) indicates questionable spelling.

"basicly" -> "basically"

···

--
Glenn Parker | glenn.parker-AT-comcast.net | <http://www.tetrafoil.com/>

Randy_Kramer · 25 March 2005 14:54

Thanks!

---<good stuff snipped>----

.... which aids us in learning new things - so it's not too bad to have.
(Only cats suffer more often than is good for them from curiosity induced
negative effects...)

"sp?"?

Sorry, couldn't remember how to spell basically. Looks like it is basically,
but, interestingly, Google found 63,000 plus instances of basicly.

Happy Easter to you!

regards,
Randy Kramer

···

On Friday 25 March 2005 07:04 am, Robert Klemme wrote:

Lasse_Koskela · 25 March 2005 15:11

Hi all,

I just tried to install Rails with gem and got this nasty error:

C:\ruby\bin>gem install rails --remote

C:\ruby\bin>"c:\ruby\bin\ruby.exe" "c:\ruby\bin\gem" install rails --remote
Attempting remote installation of 'rails'
Updating Gem source index for: http://gems.rubyforge.org
Install required dependency rake? [Yn] y
Install required dependency activesupport? [Yn] y
Install required dependency activerecord? [Yn] y
Install required dependency actionpack? [Yn] y
Install required dependency actionmailer? [Yn] y
Install required dependency actionwebservice? [Yn] y
c:/ruby/lib/ruby/site_ruby/1.8/rubygems/loadpath_manager.rb:5:in
`require__': No such file to load -- iconv (LoadError)
        from c:/ruby/lib/ruby/site_ruby/1.8/rubygems/loadpath_manager.rb:5:in
`require'
        from c:/ruby/lib/ruby/gems/1.8/gems/activesupport-1.0.2/lib/active_support/dependencies.rb:197:in
`require'
        from c:/ruby/lib/ruby/gems/1.8/gems/actionmailer-0.8.0/lib/action_mailer/vendor/tmail/quoting.rb:1
        from c:/ruby/lib/ruby/site_ruby/1.8/rubygems/loadpath_manager.rb:5:in
`require__'
        from c:/ruby/lib/ruby/site_ruby/1.8/rubygems/loadpath_manager.rb:5:in
`require'
        from c:/ruby/lib/ruby/gems/1.8/gems/activesupport-1.0.2/lib/active_support/dependencies.rb:197:in
`require'
        from c:/ruby/lib/ruby/gems/1.8/gems/actionmailer-0.8.0/lib/action_mailer/vendor/tmail/mail.rb:18
        from c:/ruby/lib/ruby/site_ruby/1.8/rubygems/loadpath_manager.rb:5:in
`require__'
         ... 20 levels...
        from c:/ruby/lib/ruby/site_ruby/1.8/rubygems/cmd_manager.rb:90:in
`process_args'
        from c:/ruby/lib/ruby/site_ruby/1.8/rubygems/cmd_manager.rb:63:in `run'
        from c:/ruby/lib/ruby/site_ruby/1.8/rubygems/gem_runner.rb:9:in `run'
        from c:/ruby/bin/gem:11

C:\ruby\bin>gem install rails --remote

C:\ruby\bin>"c:\ruby\bin\ruby.exe" "c:\ruby\bin\gem" install rails --remote
Attempting remote installation of 'rails'
c:/ruby/lib/ruby/site_ruby/1.8/rubygems/loadpath_manager.rb:5:in
`require__': No such file to load -- iconv (LoadError)
        from c:/ruby/lib/ruby/site_ruby/1.8/rubygems/loadpath_manager.rb:5:in
`require'
        from c:/ruby/lib/ruby/gems/1.8/gems/activesupport-1.0.2/lib/active_support/dependencies.rb:197:in
`require'
        from c:/ruby/lib/ruby/gems/1.8/gems/actionmailer-0.8.0/lib/action_mailer/vendor/tmail/quoting.rb:1
        from c:/ruby/lib/ruby/site_ruby/1.8/rubygems/loadpath_manager.rb:5:in
`require__'
        from c:/ruby/lib/ruby/site_ruby/1.8/rubygems/loadpath_manager.rb:5:in
`require'
        from c:/ruby/lib/ruby/gems/1.8/gems/activesupport-1.0.2/lib/active_support/dependencies.rb:197:in
`require'
        from c:/ruby/lib/ruby/gems/1.8/gems/actionmailer-0.8.0/lib/action_mailer/vendor/tmail/mail.rb:18
        from c:/ruby/lib/ruby/site_ruby/1.8/rubygems/loadpath_manager.rb:5:in
`require__'
         ... 20 levels...
        from c:/ruby/lib/ruby/site_ruby/1.8/rubygems/cmd_manager.rb:90:in
`process_args'
        from c:/ruby/lib/ruby/site_ruby/1.8/rubygems/cmd_manager.rb:63:in `run'
        from c:/ruby/lib/ruby/site_ruby/1.8/rubygems/gem_runner.rb:9:in `run'
        from c:/ruby/bin/gem:11

Also, running "ruby -version" produces an error, which I guess could be related:

C:\ruby\bin>ruby -version
ruby 1.8.2 (2004-12-25) [i386-mswin32]
-e:1: undefined local variable or method `rsion' for main:Object (NameError)

I had just downloaded and installed the Ruby 1.8.2-14 Windows installer.

My experience with Ruby has so far been playing around with small
hello world'ish scripts so I'd obviously be thankful for any hints
towards what's wrong in my setup.

Thanks.

-Lasse-

Robert · 26 March 2005 10:04

"Randy Kramer" <rhkramer@gmail.com> schrieb im Newsbeitrag news:200503250952.50603.rhkramer@gmail.com...

Sorry, couldn't remember how to spell basically. Looks like it is basically,
but, interestingly, Google found 63,000 plus instances of basicly.

That might just show that people are apt to make this spelling error.

It's funny how Google introduced a new way of spell check - kind of fuzzy logic check: the version with most Google hits is viewed as the one with the greates likelyhood of correctness.

One could also say that it's a way of making spelling more democratic: correct spelling is not defined by some abstract instance (Duden for German) but by all people using that language. But then again, this might be how it was for ages - just Google makes spellings spread more rapidly.

Languages - natural and artificial - are really fascinating...

Kind regards

robert

Topic		Replies	Views
WHY does this not work? ruby-talk	22	329	10 August 2011
Is there a better string.each? ruby-talk	44	447	9 July 2002
Iterate chars in a string ruby-talk	24	380	21 March 2006
Strings vs arrays ruby-talk	24	149	10 July 2005
Count substrings in string, scan too slow ruby-talk	17	189	30 June 2010

Iterating through a string and removing leading characters

Related topics