Regular expressions help

Hi,
How do I split the below string into words..Words can be either a
consecutive set of non whitespace characters or anything withn " "

'hi hello "hello world" hey yo'

should return
[hi, hello, hello world,hey,yo]

I tried to somehow do a collect , but not sure if there is a way to
retain a variable in between 2 invocations and then concat them and
return as one string..
Ofcourse if there is a smart way to do it in one shot using a regex
then i can do a scan on the string

'hi hello "hello world" hey yo'

should return
[hi, hello, hello world,hey,yo]

  'hi hello "hello world" hey yo'.scan(/\w+/)

=> ["hi", "hello", "hello", "world", "hey", "yo"]

Sorry I couldn't find a more verbose way. Maybe there is one!

require 'shellwords'
include Shellwords

str = 'hi hello "hello world" hey yo'

p shellwords(str)

Harry

···

On Sun, Jul 13, 2008 at 3:20 AM, Vivek <krishna.vivek@gmail.com> wrote:

Hi,
How do I split the below string into words..Words can be either a
consecutive set of non whitespace characters or anything withn " "

'hi hello "hello world" hey yo'

should return
[hi, hello, hello world,hey,yo]

--
A Look into Japanese Ruby List in English

Axel wrote:

str = 'hi hello "hello world" hey yo'
str.gsub!( / \" [^\"]* \" /x ) {|e| e[1..-2].gsub(' ', "\007") }
result = str.scan( / [\w\007]+ /x ).map {|e| e.gsub("\007", " ") }
p result

     str = 'hi hello "hello world" hey yo'
     p str.scan(/(".*")|(\w+)/).flatten.compact

  => ["hi", "hello", "hello world", "hey", "yo"]

Greedy matching to the rescue!

phlip wrote:

'hi hello "hello world" hey yo'

should return
[hi, hello, hello world,hey,yo]

  'hi hello "hello world" hey yo'.scan(/\w+/)

=> ["hi", "hello", "hello", "world", "hey", "yo"]

But this returns "hello world" as two entries, not one as required.

···

--
Posted via http://www.ruby-forum.com/\.

    str = 'hi hello "hello world" hey yo'
    p str.scan(/(".*")|(\w+)/).flatten.compact

=> ["hi", "hello", "hello world", "hey", "yo"]

Greedy matching to the rescue!

Also, non-capturing groups help us remove the .flatten.compact nonsense:

     p str.scan(/(?:".*")|(?:\w+)/)

  => ["hi", "hello", "\"hello world\"", "hey", "yo"]

I'm not sure why one version capture the "" marks and the other did not...

Hi --

Axel wrote:

str = 'hi hello "hello world" hey yo'
str.gsub!( / \" [^\"]* \" /x ) {|e| e[1..-2].gsub(' ', "\007") }
result = str.scan( / [\w\007]+ /x ).map {|e| e.gsub("\007", " ") }
p result

   str = 'hi hello "hello world" hey yo'
   p str.scan(/(".*")|(\w+)/).flatten.compact

=> ["hi", "hello", "hello world", "hey", "yo"]

That's not quite the result, though:

   >> str = 'hi hello "hello world" hey yo'
   => "hi hello \"hello world\" hey yo"
   >> str.scan(/(".*")|(\w+)/).flatten.compact
   => ["hi", "hello", "\"hello world\"", "hey", "yo"]

The "'s are returned as part of the string '"hello world"'. Also, you
get the wrong result if you have two quoted strings in a row, because
of the greediness:

str = 'one "two" "three" four'

=> "one \"two\" \"three\" four"

str.scan(/(".*")|(\w+)/).flatten.compact

=> ["one", "\"two\" \"three\"", "four"] # only three strings

Try this:

   str.scan(/"([^"]+)"|(\w+)/).flatten.compact

Of course this assumes no embedded/escaped/nested "'s, etc.

David

···

On Sun, 13 Jul 2008, phlip wrote:

--
Rails training from David A. Black and Ruby Power and Light:
   Intro to Ruby on Rails July 21-24 Edison, NJ
   Advancing With Rails August 18-21 Edison, NJ
See http://www.rubypal.com for details and updates!

should return
[hi, hello, hello world,hey,yo]

But this returns "hello world" as two entries, not one as required.

The "should return" clause is not well-formed anyway...

Hi --

···

On Sun, 13 Jul 2008, phlip wrote:

    str = 'hi hello "hello world" hey yo'
    p str.scan(/(".*")|(\w+)/).flatten.compact

=> ["hi", "hello", "hello world", "hey", "yo"]

Greedy matching to the rescue!

Also, non-capturing groups help us remove the .flatten.compact nonsense:

   p str.scan(/(?:".*")|(?:\w+)/)

=> ["hi", "hello", "\"hello world\"", "hey", "yo"]

I'm not sure why one version capture the "" marks and the other did not...

They both did :slight_smile: (See my previous post.)

David

--
Rails training from David A. Black and Ruby Power and Light:
   Intro to Ruby on Rails July 21-24 Edison, NJ
   Advancing With Rails August 18-21 Edison, NJ
See http://www.rubypal.com for details and updates!

    p str.scan(/(?:".*")|(?:\w+)/)

=> ["hi", "hello", "\"hello world\"", "hey", "yo"]

Probably want:

str.scan(/(?:"[^"]*")|(?:\w+)/)

...else the greediness will extend over multiple quoted
strings...

'hi hello "hello world" hey yo "marmoset knocked you out" foo bar'
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

vs.

'hi hello "hello world" hey yo "marmoset knocked you out" foo bar'
           ^^^^^^^^^^^^^

I'm not sure why one version capture the "" marks and the
other did not...

Strange... They both did, on my system...(?)

BTW, in ruby 1.9, we have lookbehind, so we can avoid picking
up the quotes, with:

str.scan(/(?:(?<=")[^"]*(?="))|(?:\w+)/)

Regards,

Bill

···

From: "phlip" <phlip2005@gmail.com>

David A. Black wrote:

=> ["hi", "hello", "hello world", "hey", "yo"]

That's not quite the result, though:

I suspect I copied the wrong line from my transcript!

But...

The "'s are returned as part of the string '"hello world"'. Also, you
get the wrong result if you have two quoted strings in a row, because
of the greediness:

     str = 'hi hello "hello world" "hey yo"'
     p str.scan(/(?:".*")|(?:\w+)/)

  => ["hi", "hello", "\"hello world\" \"hey yo\""] # bad

     p str.scan(/(?:".*?")|(?:\w+)/)

  => ["hi", "hello", "\"hello world\"", "\"hey yo\""] # good!

(-:

   str.scan(/"([^"]+)"|(\w+)/).flatten.compact

The non-greedy matcher .*? looks cuter.

Of course this assumes no embedded/escaped/nested "'s, etc.

Using regexps as real language parsers makes certain baby deities cry...

···

--
   Phlip

Hi --

should return
[hi, hello, hello world,hey,yo]

But this returns "hello world" as two entries, not one as required.

The "should return" clause is not well-formed anyway...

On the (usually misappropriated, but hopefully not here) Occam's Razor
principle[1], I would refrain from positing that there's actually
supposed to be a comma between the second "hello" and "world", or that
the quotation marks that were removed to illustrate the results are
actually supposed to be reinstated as literals. We can wait for a
ruling from Vivek, though; he's now got just about every permutation
to choose from :slight_smile: (Including shellwords, thanks to Harry, and that of
course is the best. Or at least, if Occam is right, then Harry is
right :slight_smile:

David

[1] http://pespmc1.vub.ac.be/occamraz.html (yes, it's still "Link To
Something Other Than Wikipedia!" Week [barely])

···

On Sun, 13 Jul 2008, phlip wrote:

--
Rails training from David A. Black and Ruby Power and Light:
   Intro to Ruby on Rails July 21-24 Edison, NJ
   Advancing With Rails August 18-21 Edison, NJ
See http://www.rubypal.com for details and updates!

Hi --

···

On Sun, 13 Jul 2008, phlip wrote:

David A. Black wrote:

=> ["hi", "hello", "hello world", "hey", "yo"]

That's not quite the result, though:

I suspect I copied the wrong line from my transcript!

But...

The "'s are returned as part of the string '"hello world"'. Also, you
get the wrong result if you have two quoted strings in a row, because
of the greediness:

   str = 'hi hello "hello world" "hey yo"'
   p str.scan(/(?:".*")|(?:\w+)/)

=> ["hi", "hello", "\"hello world\" \"hey yo\""] # bad

   p str.scan(/(?:".*?")|(?:\w+)/)

=> ["hi", "hello", "\"hello world\"", "\"hey yo\""] # good!

I don't think the OP wanted the literal quotation marks as part of the
results, though. In other words you'd want the third string to be:

    hello world

rather than

    "hello world"

David

--
Rails training from David A. Black and Ruby Power and Light:
   Intro to Ruby on Rails July 21-24 Edison, NJ
   Advancing With Rails August 18-21 Edison, NJ
See http://www.rubypal.com for details and updates!

Hi David and others,

On the (usually misappropriated, but hopefully not here) Occam's Razor
principle[1], I would refrain from positing that there's actually
supposed to be a comma between the second "hello" and "world", or that
the quotation marks that were removed to illustrate the results are
actually supposed to be reinstated as literals. We can wait for a
ruling from Vivek, though; he's now got just about every permutation
to choose from :slight_smile: (Including shellwords, thanks to Harry, and that of
course is the best. Or at least, if Occam is right, then Harry is
right :slight_smile:

Thanks for the replies..Indeed I don't want the quotes to be a part
of the string
This one suggested above by works for me

irb(main):028:0> s
=> "hi there \"hello world\" namaste \"yo man\" \"gutten morgen\" ola
\"what's up\" world"
irb(main):029:0> s.scan(/"([^"]+)"|(\w+)/).flatten.compact
=> ["hi", "there", "hello world", "namaste", "yo man", "gutten
morgen", "ola", "what's up", "world"]

I presume that should capture pretty much any kind of combination..
and I don't have the case where there are nested " so that looks good.
(unless someone can think of a case that breaks )
thanks so much..I had hit a dead end trying to do this!!

Vivek Krishna

Hi --

···

On Sun, 13 Jul 2008, Vivek wrote:

Hi David and others,

On the (usually misappropriated, but hopefully not here) Occam's Razor
principle[1], I would refrain from positing that there's actually
supposed to be a comma between the second "hello" and "world", or that
the quotation marks that were removed to illustrate the results are
actually supposed to be reinstated as literals. We can wait for a
ruling from Vivek, though; he's now got just about every permutation
to choose from :slight_smile: (Including shellwords, thanks to Harry, and that of
course is the best. Or at least, if Occam is right, then Harry is
right :slight_smile:

Thanks for the replies..Indeed I don't want the quotes to be a part
of the string
This one suggested above by works for me

irb(main):028:0> s
=> "hi there \"hello world\" namaste \"yo man\" \"gutten morgen\" ola
\"what's up\" world"
irb(main):029:0> s.scan(/"([^"]+)"|(\w+)/).flatten.compact
=> ["hi", "there", "hello world", "namaste", "yo man", "gutten
morgen", "ola", "what's up", "world"]

I presume that should capture pretty much any kind of combination..
and I don't have the case where there are nested " so that looks good.
(unless someone can think of a case that breaks )
thanks so much..I had hit a dead end trying to do this!!

Don't forget the shellwords library though -- a very convenient way to
do this.

David

--
Rails training from David A. Black and Ruby Power and Light:
   Intro to Ruby on Rails July 21-24 Edison, NJ
   Advancing With Rails August 18-21 Edison, NJ
See http://www.rubypal.com for details and updates!

Don't forget the shellwords library though -- a very convenient way
to do this.

Is there a link for these listed on the web?

require 'shellwords'

... should work in 1.8 and 1.9 ruby :slight_smile:

···

From: "humax" <billgate@microsoft.nl>

On Sun 13 Jul 2008 11:06:23, David A. Black wrote: