Ruby Regular expression

Hi
I am reading a file and want to extract. all quotations given in "" or
''
I was using regular expression.

/\w+('|")(\w+)('|")/

for example.
Mr. Ayush said "we need to change ourselves to change the world"

result would be
we need to change ourselves to change the world.

but there is loop hole that this pattern would extract "'.
can anyone help? so that i can extract only "" or ''

···

--
Posted via http://www.ruby-forum.com/.

How about this?
http://www.rubular.com/r/CRRsiTHMkG
/("[\w ']+"|'[\w ]+')/

You can remove the quotes at either end post-match if that's a
requirement as well.

···

--
Posted via http://www.ruby-forum.com/.

Doing this is tricky, the robustness of a regexp approach depends on what
you can assume about the input. For example, in a programming language
escaping a quote \" would be valid but unsupported, or in English
apostrophes could be taken as single quotes.

A regexp solution that is broken in those scenarios but works for the easy
cases is:

    ("|')((?:(?!\1).)*)\1

The regexp says: if you match either " o ', then countinue matching as long
as you do not find the matched quote, and until you find the closing quote
(needed because you could reach end of file with an unbalanced quote).

The second group has the string without quotes.

Whether this is going to work well for you input is something you have to
evaluate.

Interesting solution! I also tried

("|')([^\1]*)\1

which looked fine initially

irb(main):025:0> "foo 'bar' \"baz\" buz".scan(/("|')([^\1]*)\1/).map(&:last)
=> ["bar", "baz"]

but broke later:

irb(main):030:0> "foo 'bar' \"baz\" buz \"bongo's
kongo\"".scan(/("|')([^\1]*)\1/)
=> [["'", "bar' \"baz\" buz \"bongo"]]

where your solution still works:

irb(main):031:0> "foo 'bar' \"baz\" buz \"bongo's
kongo\"".scan(/("|')((?:(?!\1).)*)\1/)
=> [["'", "bar"], ["\"", "baz"], ["\"", "bongo's kongo"]]

However, we can also use non greediness to achieve the same:

irb(main):032:0> "foo 'bar' \"baz\" buz \"bongo's kongo\"".scan(/("|')(.*?)\1/)
=> [["'", "bar"], ["\"", "baz"], ["\"", "bongo's kongo"]]
irb(main):033:0> "foo 'bar' \"baz\" buz \"bongo's
kongo\"".scan(/("|')(.*?)\1/).map(&:last)
=> ["bar", "baz", "bongo's kongo"]

Adding some escaping capabilities we get ("|')((?:\\.|(?!\1).)*)\1

irb(main):038:0> "foo 'bar' \"baz\" buz \"bongo's kongo\" gingo said
\"foo \\\" bar\" yes".scan(/("|')((?:\\.|(?!\1).)*)\1/).map(&:last)
=> ["bar", "baz", "bongo's kongo", "foo \\\" bar"]

:wink:

Kind regards

robert

···

On Wed, Dec 11, 2013 at 10:58 AM, Xavier Noria <fxn@hashref.com> wrote:

Doing this is tricky, the robustness of a regexp approach depends on what
you can assume about the input. For example, in a programming language
escaping a quote \" would be valid but unsupported, or in English
apostrophes could be taken as single quotes.

A regexp solution that is broken in those scenarios but works for the easy
cases is:

    ("|')((?:(?!\1).)*)\1

The regexp says: if you match either " o ', then countinue matching as long
as you do not find the matched quote, and until you find the closing quote
(needed because you could reach end of file with an unbalanced quote).

The second group has the string without quotes.

--
remember.guy do |as, often| as.you_can - without end
http://blog.rubybestpractices.com/