Regexp: why does (re)* return only last repetition?

Robert · 9 May 2003 08:58

“Clifford Heath” cjh_nospam@managesoft.com schrieb im Newsbeitrag
news:1052444464.885588@excalibur.osa.com.au…

While trying to build an RE to parse a shell-style regexp into an
array of non-wild, wild, non-wild, wild, etc I found (again) that
the grouping operator (), when followed by *, returns only the last
match into the MatchData:

str = ‘foobar?baz’
regex = Regexp.new('([?]|(?:[^?]+))’, Regexp::EXTENDED);

matches = regex.match(str)
p matches[1…(matches.length-1)]

yields:

[“baz”]

That’s std behavior on all platforms. A possible reason is that for
substitutions like with gsub you would need an additional argument that
tells which element of the array. If you want to have the complete thing
just place another set of bracktes around.

To achieve what you want, this is better:

str = 'foobar?baz’
matches = str.scan /[?]|(?:[^*?]+)/

which yields exactly what you were looking for:

[“foo”, “*”, “bar”, “?”, “baz”]

Annoying. I wanted [“foo”, “*”, “bar”, “?”, “baz”].
How to do this most simply?

Also, why is this invalid:

re = Regexp.new(‘[a-z\]’)

Why not simply do

re = /[a-z\]/

robert

Austin_Ziegler2 · 9 May 2003 15:59

It’s invalid because, for some reason, it’s seeing it as a ]. It
seems that interpolation is being done twice; once by the implicit
string constructor and once by Regexp#new. Thus, your string
‘[a-z\]’ is seen by Regexp#new as ‘[a-z]’. If you double the
backslashes, you’ll get the desired result (e.g., ‘[a-z\\]’).
Alternatively, you can explicitly escape the right bracket and it
will work as well: ‘[a-z\]’.

As to why, Robert, one might not actually be able to do this, as the
regexp could be specified through user input.

-austin
– Austin Ziegler, austin@halostatue.ca on 2003.05.09 at 11:48:08

···

On Fri, 9 May 2003 17:58:11 +0900, Robert Klemme wrote:

Clifford Heath:

Annoying. I wanted [“foo”, “*”, “bar”, “?”, “baz”]. How to do
this most simply?

Also, why is this invalid:

re = Regexp.new(‘[a-z\]’)
Why not simply do

re = /[a-z\]/

Clifford_Heath1 · 11 May 2003 23:17

Robert Klemme wrote:

To achieve what you want, this is better:
matches = str.scan /[?]|(?:[^?]+)/

Yup, thanks, much better.

re = Regexp.new(‘[a-z\]’)
Why not simply do
re = /[a-z\]/

Because it was part of a larger extended re.

Thoughts on Dir.glob anyone?

Clifford.

Robert · 12 May 2003 08:39

“Clifford Heath” cjh_nospam@managesoft.com schrieb im Newsbeitrag
news:1052694877.603426@excalibur.osa.com.au…

Robert Klemme wrote:

To achieve what you want, this is better:
matches = str.scan /[?]|(?:[^?]+)/

Yup, thanks, much better.

re = Regexp.new(‘[a-z\]’)
Why not simply do
re = /[a-z\]/

Because it was part of a larger extended re.

But you can still use re = /…/ instead of re = Regexp.new ‘…’

robert

Austin_Ziegler2 · 12 May 2003 13:18

Again, if and only if one is specifying the regular expression
statically.

-austin
– Austin Ziegler, austin@halostatue.ca on 2003.05.12 at 09:15:33

···

On Mon, 12 May 2003 17:39:19 +0900, Robert Klemme wrote:

Clifford Heath:

Robert Klemme wrote:

Clifford Heath:

re = Regexp.new(‘[a-z\]’)
Why not simply do
re = /[a-z\]/
Because it was part of a larger extended re.
But you can still use re = /…/ instead of re = Regexp.new ‘…’

Brian_Candler · 12 May 2003 14:51

It doesn’t have to be static:

pat = “ana”
str = “banana”
index = /#{pat}/ =~ str
p index # >>1

Regards,

Brian.

···

On Mon, May 12, 2003 at 10:18:00PM +0900, Austin Ziegler wrote:

On Mon, 12 May 2003 17:39:19 +0900, Robert Klemme wrote:

Clifford Heath:

Robert Klemme wrote:

Clifford Heath:

re = Regexp.new(‘[a-z\]’)
Why not simply do
re = /[a-z\]/
Because it was part of a larger extended re.
But you can still use re = /…/ instead of re = Regexp.new ‘…’

Again, if and only if one is specifying the regular expression
statically.

Kent_Dahl2 · 12 May 2003 15:02

Austin Ziegler wrote:

Clifford Heath:

Robert Klemme wrote:

Clifford Heath:

re = Regexp.new(‘[a-z\]’)
Why not simply do
re = /[a-z\]/
Because it was part of a larger extended re.
But you can still use re = /…/ instead of re = Regexp.new ‘…’

Again, if and only if one is specifying the regular expression
statically.

Am I missing something or just misunderstanding what you mean by
specifying it statically?

def check( name, string )
puts /Mr. (#{name.capitalize})/.match(string)[1]
end
check “nelson”, “Hello Mr. Nelson, how are you?” #=> Nelson

···

On Mon, 12 May 2003 17:39:19 +0900, Robert Klemme wrote:

–
([ Kent Dahl ]/)_ ~ [ http://www.stud.ntnu.no/~kentda/ ]/~
))_student/(( _d L b_/ NTNU - graduate engineering - 5. year )
( __õ|õ// ) )Industrial economics and technological management(
_/ö____/ (_engineering.discipline=Computer::Technology)

Austin_Ziegler2 · 12 May 2003 22:29

You changed the pattern.

1 | pat = “[a-z\]”
2 | str = “abcd\efgh”
3 | index = /#{pat}/ =~ str
4 | p index

line 3 results in:
RegexpError: premature end of regular expression: /[a-z]/

Change the pattern to “[\a-z]” and the error goes away.

When building a pattern that includes a character class specifier,
it needs to be specified carefully or Regexp.new(pat) and /#{pat}/
won’t work.

-austin
– Austin Ziegler, austin@halostatue.ca on 2003.05.12 at 18:23:38

···

On Mon, 12 May 2003 23:51:44 +0900, Brian Candler wrote:

On Mon, May 12, 2003 at 10:18:00PM +0900, Austin Ziegler wrote:

On Mon, 12 May 2003 17:39:19 +0900, Robert Klemme wrote:

Clifford Heath:

Robert Klemme wrote:

Clifford Heath:

re = Regexp.new(‘[a-z\]’)
Why not simply do
re = /[a-z\]/
Because it was part of a larger extended re.
But you can still use re = /…/ instead of re = Regexp.new ‘…’
Again, if and only if one is specifying the regular expression
statically.
It doesn’t have to be static:

pat = “ana”
str = “banana”
index = /#{pat}/ =~ str
p index # >> 1

Austin_Ziegler2 · 12 May 2003 22:32

It’s the patttern specified specifically: “[a-z\]”. So far as I can
tell, the backslash is being interpolated twice (once by String#new
and once by Regexp#new). Is this a bug? I’m not sure, but I think
so; it doesn’t do it twice with “[\a-z]”, “[a-z\]”, or
“[a-z\\]”.

-austin
– Austin Ziegler, austin@halostatue.ca on 2003.05.12 at 18:29:34

···

On Tue, 13 May 2003 00:02:34 +0900, Kent Dahl wrote:

Austin Ziegler wrote:

On Mon, 12 May 2003 17:39:19 +0900, Robert Klemme wrote:

Clifford Heath:

Robert Klemme wrote:

Clifford Heath:

re = Regexp.new(‘[a-z\]’)
Why not simply do
re = /[a-z\]/
Because it was part of a larger extended re.
But you can still use re = /…/ instead of re = Regexp.new
‘…’
Again, if and only if one is specifying the regular expression
statically.
Am I missing something or just misunderstanding what you mean by
specifying it statically?

Brian_Candler · 12 May 2003 22:54

You changed the pattern.

1 | pat = “[a-z\]”
2 | str = “abcd\efgh”
3 | index = /#{pat}/ =~ str
4 | p index

line 3 results in:
RegexpError: premature end of regular expression: /[a-z]/

Double-quoted strings interpolate backslashes: so in your example pat
contains

  [ a - z \ ]

which is not a valid RE, because it has no terminating close-bracket.
] means “match a literal ]”, so you have

  [        start of character class
  a        first character in range
  -        range separator
  \]       last character in range

but no ‘end of character class’

It’s the same with /…/ delimiters:

irb(main):001:0> /[a-z]/
SyntaxError: compile error
(irb):1: premature end of regular expression: /[a-z]/
from (irb):1

Change the pattern to “[\a-z]” and the error goes away.

Yes, that’s

  [ \ a - z ]

I guess ‘\a’ is treated as just ‘a’ in a character class.

It’s not ‘match start of string’ anyway, as that doesn’t make sense in a
character class…

Regards,

Brian.

···

On Tue, May 13, 2003 at 07:29:24AM +0900, Austin Ziegler wrote:

Michael_Campbell1 · 12 May 2003 22:56

It’s the patttern specified specifically: “[a-z\]”. So far as I can
tell, the backslash is being interpolated twice (once by String#new
and once by Regexp#new). Is this a bug? I’m not sure, but I think
so; it doesn’t do it twice with “[\a-z]”, “[a-z\]”, or
“[a-z\\]”.

Is this with 1.8? I thought I remembered a thread about some new character
class “warnings” code that do all sorts of weird unusual things that the pre 1.8
versions didn’t have.

Austin_Ziegler2 · 13 May 2003 03:02

You changed the pattern.

1 | pat = “[a-z\]”
2 | str = “abcd\efgh”
3 | index = /#{pat}/ =~ str
4 | p index

line 3 results in:
RegexpError: premature end of regular expression: /[a-z]/
Double-quoted strings interpolate backslashes: so in your example
pat contains:
[a-z]

No:
irb(main):001:0> “[a-z\]”
=> “[a-z\]”

It’s rather irrelevant anyway, as the exact same behaviour happens
with single-quoted strings (change line 1 to ‘’ instead of “”).

Change the pattern to “[\a-z]” and the error goes away.
Yes, that’s
[ \ a - z ]

No:
irb(main):008:0> “[\a-z]”
=> “[\a-z]”

I guess ‘\a’ is treated as just ‘a’ in a character class.

No:
irb(main):009:0> /[\a]/ =~ “abc\a”
=> 3

This behaviour is in both 1.6.8 and 1.8.0 (2003-05-09). I really
think that it’s a bug because [\a-z] and [a-z\] should be
semantically equivalent. It works when done as a literal; it doesn’t
work when substitution is done.

-austin
– Austin Ziegler, austin@halostatue.ca on 2003.05.12 at 22:54:26

···

On Tue, 13 May 2003 07:54:02 +0900, Brian Candler wrote:

On Tue, May 13, 2003 at 07:29:24AM +0900, Austin Ziegler wrote:

Austin_Ziegler2 · 13 May 2003 03:02

As I noted to Brian Candler, both 1.6.8 and 1.8.0/2003-05-09.

-austin
– Austin Ziegler, austin@halostatue.ca on 2003.05.12 at 23:02:05

···

On Tue, 13 May 2003 07:56:07 +0900, Mike Campbell wrote:

It’s the patttern specified specifically: “[a-z\]”. So far as I
can tell, the backslash is being interpolated twice (once by
String#new and once by Regexp#new). Is this a bug? I’m not sure,
but I think so; it doesn’t do it twice with “[\a-z]”,
“[a-z\]”, or “[a-z\\]”.
Is this with 1.8? I thought I remembered a thread about some new
character class “warnings” code that do all sorts of weird unusual
things that the pre 1.8 versions didn’t have.

Brian_Candler · 13 May 2003 07:10

You changed the pattern.

1 | pat = “[a-z\]”
2 | str = “abcd\efgh”
3 | index = /#{pat}/ =~ str
4 | p index

line 3 results in:
RegexpError: premature end of regular expression: /[a-z]/
Double-quoted strings interpolate backslashes: so in your example
pat contains:
[a-z]

No:
irb(main):001:0> “[a-z\]”
=> “[a-z\]”

Now try:
a = “[a-z\]”
p a.length # => 6
a.each_byte { |c| puts “%c” % c } # => [ a - z \ ]

irb is deceiving you, because ‘inspect’ outputs strings in a way which can
be re-input as strings into Ruby. A single backslash is printed as two
backslashes.

It’s rather irrelevant anyway, as the exact same behaviour happens
with single-quoted strings (change line 1 to ‘’ instead of “”).

And single-quoted strings have the same issue it turns out:

      a = '\\'
      p a.length   # => 1

I am not sure why that should be, since ‘\n’.length is 2. I think you may
have uncovered a bug here.

No:
irb(main):009:0> /[\a]/ =~ “abc\a”
=> 3

“\a” in a double-quoted string is a ‘BEL’ (a for Audible), ASCII code 7

irb(main):001:0> “\a”[0]
=> 7

Brian.

···

On Tue, May 13, 2003 at 12:02:06PM +0900, Austin Ziegler wrote:

On Tue, 13 May 2003 07:54:02 +0900, Brian Candler wrote:

On Tue, May 13, 2003 at 07:29:24AM +0900, Austin Ziegler wrote:

Brian_Candler · 13 May 2003 07:51

I just realised, it’s not a bug: it’s necessary to give a mechanism for
inserting a single quote within a single-quoted string. This is done by
escaping it with backslash:

      '\''.length     #>> 1   (just a single quote)

But that in turn means that to get a literal backslash it also needs to be
escaped:

      '\\'.length     #>> 1   (just a backslash)

All other backslash-X sequences are inserted as the backslash and the X.

Regexps are subject to certain quoting rules too. Try:

z = /ab\c/
puts z.inspect

This program crashes under both ruby 1.6 ("unterminated regexp meets end of
file) and 1.8 (“unterminated string meets end of file”).

But z = /ab\d/ works. Somebody care to explain that one?

Regards,

Brian.

···

On Tue, May 13, 2003 at 04:10:36PM +0900, Brian Candler wrote:

And single-quoted strings have the same issue it turns out:
      a = '\\'
      p a.length   # => 1
I am not sure why that should be, since ‘\n’.length is 2. I think you may
have uncovered a bug here.

Austin_Ziegler2 · 13 May 2003 17:05

Okay, I think the problem here is:

a = "[a-z\]"
b = /#{a}/

When the RE is being built, a is being interpolated again. If I’ve
built my regular expression source string properly, it should NOT be
interpolated again. Is there any way to build a RE from a string
which does not reinterpolate the string? Is there a way to add such
a functionality if it does not exist?

Frankly, /#{a}/ for the above string should be no different than
/[a-z\]/.

-austin
– Austin Ziegler, austin@halostatue.ca on 2003.05.13 at 13:03:02

ts1 · 13 May 2003 09:34

But z = /ab\d/ works. Somebody care to explain that one?

\d digit
\c or \C- control character

/\cM/ control-M

pigeon% ruby -e 'p (/\cM/ =~ 13.chr); p (/\C-M/ =~ 13.chr)'
0
0
pigeon%

Guy Decoux

David_A_Black2 · 13 May 2003 18:21

Hi –

Okay, I think the problem here is:

a = “[a-z\]”
b = /#{a}/

When the RE is being built, a is being interpolated again. If I’ve
built my regular expression source string properly, it should NOT be
interpolated again. Is there any way to build a RE from a string
which does not reinterpolate the string? Is there a way to add such
a functionality if it does not exist?

I’m not sure what you mean by “interpolate again” (or reinterpolate).
I think everything is happening just once: you’ve created a string
([a-z]), and you’re interpolating it into a regex.

Frankly, /#{a}/ for the above string should be no different than
/[a-z\]/.

Except… a is a 6-char string ([a-z]), and the contents of the regex
there is 7 characters There’s no way for Ruby to backtrack and
know that, when you created the string, you typed \ twice. You might
have produced the string this way:

a = [91, 97, 45, 122, 92, 93].map {|c| c.chr}.join # [a-z]

David

···

On Wed, 14 May 2003, Austin Ziegler wrote:

–
David Alan Black
home: dblack@superlink.net
work: blackdav@shu.edu
Web: http://pirate.shu.edu/~blackdav

HAL_9000 · 13 May 2003 18:32

I think what he wants is reasonable in a way…
it would be nice if there were a way to specify
a string (or a regex) that did not expand
backslashes. If there were, that would solve his
minor dilemma.

“Interpolate” is not the right word here… but
there are definitely two levels of processing
going on. For example: The sequence of characters
“\\n” gets mapped internally to “\n” and if that
in turn were used in a regex, it would be collpsed
again into \n. Correct?

What about a naive solution like this?

class String
def raw
self.inspect[1…-2]
end
end

And then /#{myvar.raw}/ or some such. Am I way
off base here? This is untested.

Hal

···

----- Original Message -----
From: dblack@superlink.net
To: “ruby-talk ML” ruby-talk@ruby-lang.org
Sent: Tuesday, May 13, 2003 1:21 PM
Subject: Re: Regexp: why does (re)* return only last repetition?

I’m not sure what you mean by “interpolate again” (or reinterpolate).
I think everything is happening just once: you’ve created a string
([a-z]), and you’re interpolating it into a regex.

Frankly, /#{a}/ for the above string should be no different than
/[a-z\]/.

Except… a is a 6-char string ([a-z]), and the contents of the regex
there is 7 characters There’s no way for Ruby to backtrack and
know that, when you created the string, you typed \ twice. You might
have produced the string this way:

a = [91, 97, 45, 122, 92, 93].map {|c| c.chr}.join # [a-z]

Brian_Candler · 13 May 2003 22:55

Ugh. But thank you for explaining!

Brian.

···

On Tue, May 13, 2003 at 06:34:09PM +0900, ts wrote:

But z = /ab\d/ works. Somebody care to explain that one?

\d digit
\c or \C- control character

/\cM/ control-M

pigeon% ruby -e ‘p (/\cM/ =~ 13.chr); p (/\C-M/ =~ 13.chr)’
0
0
pigeon%

Topic		Replies	Views
Regexp: why does (re)* return only last repetition? ruby-talk	1	125	9 May 2003
Can't find appropriate regexp ruby-talk	16	98	24 June 2003
Regex help ruby-talk	17	85	21 January 2004
Regexp Error? ruby-talk	15	112	15 May 2004
Regexp Error? ruby-talk	14	91	14 May 2004

Regexp: why does (re)* return only last repetition?

Related topics