Novice Q: What's the difference between /\s*/ and /(\s)*/?

(Mike Meng) #1

Hi,
  I'm new to Ruby and reading 'Programming Ruby 2/e' now. I encountered
a tricky problem while reading chapter 5, 'String" section. Here is the
problem:

# code
line = '/jazz/j00319.mp3 | 2:58 | Louis Armstrong | Wonderful World'
file, duration, artist, title = line.chomp.split(/\s*\|\s*/)
# code end

Run the code we get:
file=='/jazz/j00319.mp3'
duration=='2:58'
artist=='Louis Armstrong'
title=='Wonderful World'

While if I change the regex pattern in 'split' to /(\s)*\|(\s)*/,
that is,
# code
line = '/jazz/j00319.mp3 | 2:58 | Louis Armstrong | Wonderful World'
file, duration, artist, title = line.chomp.split(/(\s)*\|(\s)*/)
# code end

We get:
file=='/jazz/j00319.mp3'
duration==' '
artist==' '
title=='2:58'

What makes the differece? Any comments are appreciated.

(W. James) #2

Mike Meng wrote:

# code
line = '/jazz/j00319.mp3 | 2:58 | Louis Armstrong | Wonderful World'
file, duration, artist, title = line.chomp.split(/\s*\|\s*/)
# code end

Run the code we get:
file=='/jazz/j00319.mp3'
duration=='2:58'
artist=='Louis Armstrong'
title=='Wonderful World'

While if I change the regex pattern in 'split' to /(\s)*\|(\s)*/,
that is,
# code
line = '/jazz/j00319.mp3 | 2:58 | Louis Armstrong | Wonderful World'
file, duration, artist, title = line.chomp.split(/(\s)*\|(\s)*/)
# code end

We get:
file=='/jazz/j00319.mp3'
duration==' '
artist==' '
title=='2:58'

What makes the differece? Any comments are appreciated.

Without the captures, the substrings on which the string is split
are discarded. When you include captures, they are included in
the resulting array. Which makes sense: why would you include
captures if you didn't want to do something with them?

(daz) #3

Mike Meng wrote:

# code
line = '/jazz/j00319.mp3 | 2:58 | Louis Armstrong | Wonderful World'
file, duration, artist, title = line.chomp.split(/\s*\|\s*/)
# code end

[...]
While if I change the regex pattern in 'split' to /(\s)*\|(\s)*/,
[...]
What makes the differece? Any comments are appreciated.

Hi Mike,

You're not seeing the difference because of your assignments.
Try playing with this:

···

#-----------------------------------------------------------
def splt(patt)
  res = LINE.split(patt)
  print "#-> (%s)\n#-> %2d: " % [patt.inspect, res.size]
  res.to_a.each {|col| print ' (%s)' % [col]}
  puts; puts
end

LINE = 'ABC_:_KLM_:_NOP_:_XYZ'

splt(/_:_/)
splt(/(_:_)/)
splt(/_(:)_/)
splt(/(_):(_)/)
splt(/((_):(_))/)
splt(/((_)(:)(_))/)
splt(/_:_K/)
splt(/(_:_)K/)
splt(/(_:_K)/)
splt(/((_:_K))/)
splt(/(((_:_K)))/)
#-----------------------------------------------------------

#-> (/_:_/)
#-> 4: (ABC) (KLM) (NOP) (XYZ)

#-> (/(_:_)/)
#-> 7: (ABC) (_:_) (KLM) (_:_) (NOP) (_:_) (XYZ)

#-> (/_(:)_/)
#-> 7: (ABC) (:slight_smile: (KLM) (:slight_smile: (NOP) (:slight_smile: (XYZ)

#-> (/(_):(_)/)
#-> 10: (ABC) (_) (_) (KLM) (_) (_) (NOP) (_) (_) (XYZ)

#-> (/((_):(_))/)
#-> 13: (ABC) (_:_) (_) (_) (KLM) (_:_) (_) (_) (NOP) (_:_) (_) (_) (XYZ)

#-> (/((_)(:)(_))/)
#-> 16: (ABC) (_:_) (_) (:slight_smile: (_) (KLM) (_:_) (_) (:slight_smile: (_) (NOP) (_:_) (_) (:slight_smile: (_) (XYZ)

#-> (/_:_K/)
#-> 2: (ABC) (LM_:_NOP_:_XYZ)

#-> (/(_:_)K/)
#-> 3: (ABC) (_:_) (LM_:_NOP_:_XYZ)

#-> (/(_:_K)/)
#-> 3: (ABC) (_:_K) (LM_:_NOP_:_XYZ)

#-> (/((_:_K))/)
#-> 4: (ABC) (_:_K) (_:_K) (LM_:_NOP_:_XYZ)

#-> (/(((_:_K)))/)
#-> 5: (ABC) (_:_K) (_:_K) (_:_K) (LM_:_NOP_:_XYZ)

daz

(Julian Leviston) #4

I'm not sure if someone's already answered this, but...

putting parentheses around things groups them... and it's treated as though it's a single regexp...

so:
/\s*/ means match a space, zero or more times to the extent of the contiguous spaces...

but
/(\s)*/ means "match a space, zero or more times to the extent of THIS CONTIGUOUS MATCH. It first matches zero spaces, then the limit of the zero spaces is ... (funnily enough) zero spaces, so it doesn't go any further. You don't want to use parentheses.

There have been whole books written on regular expressions. If you're going to use them well, they're worth reading, I'd suggest.

Julian.

···

On 18/08/2005, at 3:31 PM, Mike Meng wrote:

Hi,
  I'm new to Ruby and reading 'Programming Ruby 2/e' now. I encountered
a tricky problem while reading chapter 5, 'String" section. Here is the
problem:

# code
line = '/jazz/j00319.mp3 | 2:58 | Louis Armstrong | Wonderful World'
file, duration, artist, title = line.chomp.split(/\s*\|\s*/)
# code end

Run the code we get:
file=='/jazz/j00319.mp3'
duration=='2:58'
artist=='Louis Armstrong'
title=='Wonderful World'

While if I change the regex pattern in 'split' to /(\s)*\|(\s)*/,
that is,
# code
line = '/jazz/j00319.mp3 | 2:58 | Louis Armstrong | Wonderful World'
file, duration, artist, title = line.chomp.split(/(\s)*\|(\s)*/)
# code end

We get:
file=='/jazz/j00319.mp3'
duration==' '
artist==' '
title=='2:58'

What makes the differece? Any comments are appreciated.

(Mike Meng) #5

Thank you, William.

Is this hehavior defined by regex spec or by String#split? Where can I
find detailed explaination?

mike

(Gavin Kistner) #6

Actually, using parentheses here will not affect what is matched, only what is saved. Even with the parens, each time the accumulator is run it re-matches the character class. Either that, or I'm misinterpreting the results below:

" \t\nHello".match( /^\s*(\w+)/ ) #=> "Hello" , nil
" \t\nHello".match( /^(\s)*(\w+)/ ) #=> "\n" , "Hello"
" \t\nHello".match( /^(\s*)(\w+)/ ) #=> " \t\n" , "Hello"
"\t Hello".match( /^\s*(\w+)/ ) #=> "Hello" , nil
"\t Hello".match( /^(\s)*(\w+)/ ) #=> " " , "Hello"
"\t Hello".match( /^(\s*)(\w+)/ ) #=> "\t " , "Hello"
"\n \tHello".match( /^\s*(\w+)/ ) #=> "Hello" , nil
"\n \tHello".match( /^(\s)*(\w+)/ ) #=> "\t" , "Hello"
"\n \tHello".match( /^(\s*)(\w+)/ ) #=> "\n \t" , "Hello"

(Results are the first two saved subexpressions of the match.)

strings = [
   " \t\nHello",
   "\t Hello",
   "\n \tHello"
]

patterns = [
   /^\s*(\w+)/,
   /^(\s)*(\w+)/,
   /^(\s*)(\w+)/
]

strings.each_with_index{ |str, str_num|
   patterns.each_with_index{ |re, re_num|
     if match = str.match( re )
       info = [ str.inspect, re.inspect, match[1].inspect, match[2].inspect ]
       puts "%s.match( %-14s ) #=> %-8s, %-5s" % info
     end
   }
}
puts "\n(Results are the first two saved subexpressions of the match.)"

···

On Aug 24, 2005, at 6:58 AM, Julian Leviston wrote:

I'm not sure if someone's already answered this, but...

putting parentheses around things groups them... and it's treated as though it's a single regexp...

so:
/\s*/ means match a space, zero or more times to the extent of the contiguous spaces...

but
/(\s)*/ means "match a space, zero or more times to the extent of THIS CONTIGUOUS MATCH. It first matches zero spaces, then the limit of the zero spaces is ... (funnily enough) zero spaces, so it doesn't go any further. You don't want to use parentheses.

(Mike Meng) #7

Thank you, Julian.

I took O'Reilly's Mastering Regular Expressions by Jeff Friedl. On the
page 326, it says:

"Capture parentheses change the whole face of split. When they are
used, the return list
has additional, independent elements interjected for the items
captureed by the parentheses."

(Gavin Kistner) #8

Three more pertinent data points:

"Hello".match( /^\s*(\w+)/ ) #=> "Hello" , nil
"Hello".match( /^(\s)*(\w+)/ ) #=> nil , "Hello"
"Hello".match( /^(\s*)(\w+)/ ) #=> "" , "Hello"

···

On Aug 24, 2005, at 7:47 AM, Gavin Kistner wrote:

" \t\nHello".match( /^\s*(\w+)/ ) #=> "Hello" , nil
" \t\nHello".match( /^(\s)*(\w+)/ ) #=> "\n" , "Hello"
" \t\nHello".match( /^(\s*)(\w+)/ ) #=> " \t\n" , "Hello"
"\t Hello".match( /^\s*(\w+)/ ) #=> "Hello" , nil
"\t Hello".match( /^(\s)*(\w+)/ ) #=> " " , "Hello"
"\t Hello".match( /^(\s*)(\w+)/ ) #=> "\t " , "Hello"
"\n \tHello".match( /^\s*(\w+)/ ) #=> "Hello" , nil
"\n \tHello".match( /^(\s)*(\w+)/ ) #=> "\t" , "Hello"
"\n \tHello".match( /^(\s*)(\w+)/ ) #=> "\n \t" , "Hello"

(Jeff Wood) #9

Ok, now for a clean and simple answer...

The parenthesis create "groups". Groups give you the ability to save
away parts of the matched pattern for easy access after a match has
been made. And yes, the O'Reilly book "Mastering Regular Expressions"
would be a good read it explains this concept quite completely.

When you use Regexp#match, if you find a match, you are returned a
MatchData object which provides an [] API for standard array style
access.

Element 0 ( [0] ) provides the COMPLETE match. If you defined any
groups in the regular expression, they will then appear as additional
elements in the MatchData [] array. Each group is then accessable
using ordinals ( [1] .. [9] I don't know what happens after 9, I've
never needed that many ) ... Remember that if you define a group to be
inside an optional region of the regular expression, that group will
return nil.

So, if you were parsing phone numbers in the format of:

···

###-###-####

You could save yourself a bit of code and define your regex to be:

a = "My phone number is : 800-555-1212"
b = /(\d{3})\-(\d{3})\-(\d{4})/

c = b.match( a )

if c
  puts c[0] # returns the complete match : 800-555-1212
  puts c[1] # returns group 1 : 800
  puts c[2] # returns group 2 : 555
  puts c[3] # returns group 3 : 1212
else
  puts "Not a match"
end

I hope that helps. Remember that [0] always exists, but the other
items only exist if you define groups within your regular expression.

j.

On 8/24/05, Gavin Kistner <gavin@refinery.com> wrote:

On Aug 24, 2005, at 7:47 AM, Gavin Kistner wrote:
> " \t\nHello".match( /^\s*(\w+)/ ) #=> "Hello" , nil
> " \t\nHello".match( /^(\s)*(\w+)/ ) #=> "\n" , "Hello"
> " \t\nHello".match( /^(\s*)(\w+)/ ) #=> " \t\n" , "Hello"
> "\t Hello".match( /^\s*(\w+)/ ) #=> "Hello" , nil
> "\t Hello".match( /^(\s)*(\w+)/ ) #=> " " , "Hello"
> "\t Hello".match( /^(\s*)(\w+)/ ) #=> "\t " , "Hello"
> "\n \tHello".match( /^\s*(\w+)/ ) #=> "Hello" , nil
> "\n \tHello".match( /^(\s)*(\w+)/ ) #=> "\t" , "Hello"
> "\n \tHello".match( /^(\s*)(\w+)/ ) #=> "\n \t" , "Hello"
>

Three more pertinent data points:

"Hello".match( /^\s*(\w+)/ ) #=> "Hello" , nil
"Hello".match( /^(\s)*(\w+)/ ) #=> nil , "Hello"
"Hello".match( /^(\s*)(\w+)/ ) #=> "" , "Hello"

--
"So long, and thanks for all the fish"

Jeff Wood

(David A. Black) #10

Hi --

Ok, now for a clean and simple answer...

The parenthesis create "groups". Groups give you the ability to save
away parts of the matched pattern for easy access after a match has
been made. And yes, the O'Reilly book "Mastering Regular Expressions"
would be a good read it explains this concept quite completely.

When you use Regexp#match, if you find a match, you are returned a
MatchData object which provides an [] API for standard array style
access.

Element 0 ( [0] ) provides the COMPLETE match. If you defined any
groups in the regular expression, they will then appear as additional
elements in the MatchData [] array. Each group is then accessable
using ordinals ( [1] .. [9] I don't know what happens after 9, I've
never needed that many ) ...

I believe that would be 10 :slight_smile:

   irb(main):003:0> m = /(((((((((((a)))))))))))/.match("a")
   => #<MatchData:0xbf4bbd84>
   irb(main):004:0> $10
   => "a"
   irb(main):005:0> m[10]
   => "a"

Remember that if you define a group to be
inside an optional region of the regular expression, that group will
return nil.

I'm not sure what you mean there.

   irb(main):009:0> /((a)?)?/.match("a").to_a
   => ["a", "a", "a"]

David

···

On Thu, 25 Aug 2005, Jeff Wood wrote:

--
David A. Black
dblack@wobblini.net

(Jeff Wood) #11

Regarding nil values for groups

if you define your regular expressions like this :

a = "My phone number is : 555-1212"
b = /((\d{3})\-)?(\d{3})\-(\d{4})/
c = b.match( a )
puts c.to_a

c should be [ "555-1212", nil, nil, "555", "1212" ]

[0] holds a copy of the complete match
[1] matches the parens from char 0 through char 10
      - The following question mark states that either 0 or 1 instance of the
      previous group should be accepted in the pattern match.
[2] matches the parens from char 1 through char 7
[3] matches the parens from char 12 through char 18
[4] matches the parens from char 21 through char 27

I hope that made sense too.

j.

···

On 8/24/05, David A. Black <dblack@wobblini.net> wrote:

Hi --

On Thu, 25 Aug 2005, Jeff Wood wrote:

> Ok, now for a clean and simple answer...
>
> The parenthesis create "groups". Groups give you the ability to save
> away parts of the matched pattern for easy access after a match has
> been made. And yes, the O'Reilly book "Mastering Regular Expressions"
> would be a good read it explains this concept quite completely.
>
> When you use Regexp#match, if you find a match, you are returned a
> MatchData object which provides an [] API for standard array style
> access.
>
> Element 0 ( [0] ) provides the COMPLETE match. If you defined any
> groups in the regular expression, they will then appear as additional
> elements in the MatchData [] array. Each group is then accessable
> using ordinals ( [1] .. [9] I don't know what happens after 9, I've
> never needed that many ) ...

I believe that would be 10 :slight_smile:

   irb(main):003:0> m = /(((((((((((a)))))))))))/.match("a")
   => #<MatchData:0xbf4bbd84>
   irb(main):004:0> $10
   => "a"
   irb(main):005:0> m[10]
   => "a"

> Remember that if you define a group to be
> inside an optional region of the regular expression, that group will
> return nil.

I'm not sure what you mean there.

   irb(main):009:0> /((a)?)?/.match("a").to_a
   => ["a", "a", "a"]

David

--
David A. Black
dblack@wobblini.net

--
"So long, and thanks for all the fish"

Jeff Wood

(ts) #12

moulon% ruby -e '"The gateway is broken ? yes / no ?" =~ /(yes)|(no)/; p $1,$2'
"yes"
nil
moulon%

Guy Decoux

···

On Thu, 25 Aug 2005, Jeff Wood wrote:

Remember that if you define a group to be
inside an optional region of the regular expression, that group will
return nil.

I'm not sure what you mean there.

(David A. Black) #13

Hi --

Regarding nil values for groups

if you define your regular expressions like this :

a = "My phone number is : 555-1212"
b = /((\d{3})\-)?(\d{3})\-(\d{4})/
c = b.match( a )
puts c.to_a

c should be [ "555-1212", nil, nil, "555", "1212" ]

Oh, well, yes -- if there's no match for the group. I was taking you
very literally:

Remember that if you define a group to be
inside an optional region of the regular expression, that group will
return nil.

You didn't include the "if there's no match" bit :slight_smile:

David

···

On Thu, 25 Aug 2005, Jeff Wood wrote:

--
David A. Black
dblack@wobblini.net