Possible bug with StringScanner class

John_Halderman · 22 July 2005 19:30

I'm not sure if this is a bug or intentional behavior, so I thought I would
post it here to see what the community thought of what was happening. If you
set up a StringScanner object to perform iterative matching on a string the
behavior of \A and ^ seem to always match. It seems to me that \A should
only match if it is the first match performed, and ^ should only match if
bol? returns true, which should be after a \n or if it is the first match
performed. Here is some code I ran in irb to illustrate the problem:

require 'strscan'
sc = StringScanner.new("the white elephant eats grass")
sc.scan(/the\s+/)
sc.bol?
sc.scan(/^white\s+/)
sc.scan(/\Aelephant\s+/)

this code produced the following result.
irb(main):001:0> require 'strscan'
=> true
irb(main):002:0> sc = StringScanner.new("the white elephant eats grass")
=> #<StringScanner 0/29 @ "the w...">
irb(main):003:0> sc.scan(/the\s+/)
=> "the "
irb(main):004:0> sc.bol?
=> false
irb(main):005:0> sc.scan(/^white\s+/)
=> "white "
irb(main):006:0> sc.scan(/\Aelephant\s+/)
=> "elephant "

Any thoughts and or advice on this matter are greatly appreciated.

-John Halderman

Eric_Mahurin1 · 22 July 2005 19:53

You should think of the current position as the beginning of
the string for matching. In addition, the regexp that scan
gets is implicitly anchored to that spot. So specifiing \A or
^ at the beginning of a regexp for scan is redundant.

···

--- John Halderman <jhalderman@gmail.com> wrote:

I'm not sure if this is a bug or intentional behavior, so I
thought I would
post it here to see what the community thought of what was
happening. If you
set up a StringScanner object to perform iterative matching
on a string the
behavior of \A and ^ seem to always match. It seems to me
that \A should
only match if it is the first match performed, and ^ should
only match if
bol? returns true, which should be after a \n or if it is the
first match
performed.

__________________________________
Do you Yahoo!?
Yahoo! Mail - Find what you need with new enhanced search.
http://info.mail.yahoo.com/mail_250

John_Halderman · 22 July 2005 20:13

> I'm not sure if this is a bug or intentional behavior, so I
> thought I would
> post it here to see what the community thought of what was
> happening. If you
> set up a StringScanner object to perform iterative matching
> on a string the
> behavior of \A and ^ seem to always match. It seems to me
> that \A should
> only match if it is the first match performed, and ^ should
> only match if
> bol? returns true, which should be after a \n or if it is the
> first match
> performed.

You should think of the current position as the beginning of
the string for matching. In addition, the regexp that scan
gets is implicitly anchored to that spot. So specifiing \A or
^ at the beginning of a regexp for scan is redundant.

There is nothing in the documentation to suggest that the current position
should be considered the beginning of a string for matching purposes, only
that any match must start at that position. That would mean a regexp
beginning with ^ would need the current position to be preceded by \n or be
the at the beginning of the string in order for it to match. Furthermore,
the existence of bol? suggests that the current position is not to be
considered the beginning of the line. As for whether is should be considered
the beginning of the string, that remains ambiguous, although I believe it
makes more sense for it not to be so.

···

On 7/22/05, Eric Mahurin <eric_mahurin@yahoo.com> wrote:

--- John Halderman <jhalderman@gmail.com> wrote:

__________________________________
Do you Yahoo!?
Yahoo! Mail - Find what you need with new enhanced search.
http://info.mail.yahoo.com/mail_250

Eric_Mahurin1 · 22 July 2005 20:26

I think it makes perfect sense. scan/scan_until/etc only can
look at what is after the current position. They have no
visibility to what is before the current position. So, you
should consider it to be the beginning of the string for
matching purposes. Whether you like it or not, that is the way
it works and I think it is intentional.

···

--- John Halderman <jhalderman@gmail.com> wrote:

On 7/22/05, Eric Mahurin <eric_mahurin@yahoo.com> wrote:
>
> --- John Halderman <jhalderman@gmail.com> wrote:
>
> > I'm not sure if this is a bug or intentional behavior, so
I
> > thought I would
> > post it here to see what the community thought of what
was
> > happening. If you
> > set up a StringScanner object to perform iterative
matching
> > on a string the
> > behavior of \A and ^ seem to always match. It seems to me
> > that \A should
> > only match if it is the first match performed, and ^
should
> > only match if
> > bol? returns true, which should be after a \n or if it is
the
> > first match
> > performed.
>
> You should think of the current position as the beginning
of
> the string for matching. In addition, the regexp that scan
> gets is implicitly anchored to that spot. So specifiing \A
or
> ^ at the beginning of a regexp for scan is redundant.

There is nothing in the documentation to suggest that the
current position
should be considered the beginning of a string for matching
purposes, only
that any match must start at that position. That would mean a
regexp
beginning with ^ would need the current position to be
preceded by \n or be
the at the beginning of the string in order for it to match.
Furthermore,
the existence of bol? suggests that the current position is
not to be
considered the beginning of the line. As for whether is
should be considered
the beginning of the string, that remains ambiguous, although
I believe it
makes more sense for it not to be so.

____________________________________________________
Start your day with Yahoo! - make it your home page
http://www.yahoo.com/r/hs

John_Halderman · 22 July 2005 20:49

I think it makes perfect sense. scan/scan_until/etc only can
look at what is after the current position. They have no
visibility to what is before the current position. So, you
should consider it to be the beginning of the string for
matching purposes. Whether you like it or not, that is the way
it works and I think it is intentional.

I understand the way it works, I am not debating how it works. However I do
not think this is intentional behavior. From the documentation provided via
ri:

Scanning a string means remembering the position of a _scan
pointer_, which is just an index. The point of scanning is to move
forward a bit at a time, so matches are sought after the scan
pointer; usually immediately after it.

Given the string "test string", here are the pertinent scan pointer
positions:

t e s t s t r i n g
0 1 2 ... 1
0

When you #scan for a pattern (a regular expression), the match must
occur at the character after the scan pointer. If you use
#scan_until, then the match can occur anywhere after the scan
pointer. In both cases, the scan pointer moves _just beyond_ the
last character of the match, ready to scan again from the next
character onwards. This is demonstrated by the example above.

This says nothing about what scan has available to it when it matches, only
where the match must occur. When you match a ^ the match always happens at
the first character after a \n or at the beginning of a string. Therefore
the position of the match would still be valid the purposes of scan even
though the \n was before the current scan position. This can be demonstrated
with the following code:

r = /^abc/
s = "efg\nabc"
m = r.match(s)
s[m.begin(0)..m.end(0)]

which produces the following output:

irb(main):001:0> r = /^abc/
=> /^abc/
irb(main):002:0> s = "efg\nabc"
=> "efg\nabc"
irb(main):003:0> m = r.match(s)
=> #<MatchData:0xb7eaf45c>
irb(main):004:0> s[m.begin(0)..m.end(0)]
=> "abc"

As you can see, the \n is not included in the match but is required for the
match to occur. Therefore I believe it only makes sense that scan should be
using the bol? to determine if a regexp beginning with a ^ matches, not
always matching that. That is why it seems to me that this is an oversight
in the implementation of StringScanner.

-John Halderman

Eric_Mahurin1 · 22 July 2005 21:25

I think quoting the documentation only hurt your argument. It
says "matches are sought after the scan pointer" right there.
I believe most of the methods do exactly that, but there are
some exceptions that look/go backwards. You mentioned one:
bol? - looks back one character to see if it is a newline or at
pos=0. I think the reason they put this method in is for the
purpose you are wanting. What's wrong with using:

scanner.bol? and scanner.scan(/.../)

instead of trying to get this to do what you want:

scanner.scan(/^.../)

BTW, if you want to try to a more general
iterator(external)/cursor/stream/scanner, try my cursor
package:

http://rubyforge.org/projects/cursor/

I have some regexp stuff in there that acts like StringScanner,
but it was an afterthought and I will probably redo its
interface.

···

--- John Halderman <jhalderman@gmail.com> wrote:

> I think it makes perfect sense. scan/scan_until/etc only
can
> look at what is after the current position. They have no
> visibility to what is before the current position. So, you
> should consider it to be the beginning of the string for
> matching purposes. Whether you like it or not, that is the
way
> it works and I think it is intentional.

I understand the way it works, I am not debating how it
works. However I do
not think this is intentional behavior. From the
documentation provided via
ri:

Scanning a string means remembering the position of a _scan
pointer_, which is just an index. The point of scanning is to
move
forward a bit at a time, so matches are sought after the scan
pointer; usually immediately after it.

____________________________________________________
Start your day with Yahoo! - make it your home page
http://www.yahoo.com/r/hs

John_Halderman · 22 July 2005 21:47

Like I was saying though, ^ doesnt match at the newline, but only requires
that a newline exist before the character it is considering for matching.
Technically since StringScanner simply uses a pointer to store the current
match location the entrie string is still available to you to perform tests
on. In order to implement this you would only need one character lookback,
which wouldn't be a huge deal from what I can tell especially in light of
the fact that bol? is already inplemented.

The reason I can't use bol? and a regexp, is that I am not writing the
regexp and I will not know what to expect. It isn't even a matter of what I
am implementing but a question of whether or not StringScanner is
implemented correctly and the documentation is incorrect, or if the
documentation is correct and the StringScanner is implemented incorrectly. I
believe it to be the later because it would provide more useful
functionality, and a more correct interpretation of regular expressions. For
my own purposes I will have to implement my own StringScanner type class
that meets my requirements, but I would like to see the discrepancies
between the documentation and the StringScanner class resolved.

Thanks for your input.
-j

···

On 7/22/05, Eric Mahurin <eric_mahurin@yahoo.com> wrote:

--- John Halderman <jhalderman@gmail.com> wrote:

> > I think it makes perfect sense. scan/scan_until/etc only
> can
> > look at what is after the current position. They have no
> > visibility to what is before the current position. So, you
> > should consider it to be the beginning of the string for
> > matching purposes. Whether you like it or not, that is the
> way
> > it works and I think it is intentional.
>
>
> I understand the way it works, I am not debating how it
> works. However I do
> not think this is intentional behavior. From the
> documentation provided via
> ri:
>
> Scanning a string means remembering the position of a _scan
> pointer_, which is just an index. The point of scanning is to
> move
> forward a bit at a time, so matches are sought after the scan
> pointer; usually immediately after it.

I think quoting the documentation only hurt your argument. It
says "matches are sought after the scan pointer" right there.
I believe most of the methods do exactly that, but there are
some exceptions that look/go backwards. You mentioned one:
bol? - looks back one character to see if it is a newline or at
pos=0. I think the reason they put this method in is for the
purpose you are wanting. What's wrong with using:

scanner.bol? and scanner.scan(/.../)

instead of trying to get this to do what you want:

scanner.scan(/^.../)

BTW, if you want to try to a more general
iterator(external)/cursor/stream/scanner, try my cursor
package:

http://rubyforge.org/projects/cursor/

I have some regexp stuff in there that acts like StringScanner,
but it was an afterthought and I will probably redo its
interface.

____________________________________________________
Start your day with Yahoo! - make it your home page
http://www.yahoo.com/r/hs

Topic		Replies	Views
[Q] difference between StringScanner#scan and Regexp#match ruby-talk	5	120	26 February 2008
Regex oddity ruby-talk	3	98	23 July 2011
StringScanner::search_full documentation error -? ruby-talk	0	130	1 January 2009
POLS violation? /\s*/ no match at StringScanner end ruby-talk	3	104	21 March 2006
StringScanner using skip_until not detecting special characters ruby-talk	3	174	20 July 2011

Possible bug with StringScanner class

Related topics