Newlines included in bracket negation

Chris_Morris2 · 26 October 2007 21:12

(... that subject probably makes no sense ...)

Anyway, I have some unexpected (to me) behavior in the following regexp.
This example is contrived, but based on a real need. Can anyone explain why
the result is multi-line, even though the re is not?

require 'test/unit'

class TestRE < Test::Unit::TestCase
  def test_newlines
    src = "happy\n\nbirthday"
    assert_equal("hday", src.scan(/h[^x]*?day/).to_s)
  end
end

produces

Finished in 0.031 seconds.

1) Failure:
test_newlines_consumed_in_not_section(TestRE) ...
<"hday"> expected but was
<"happy\n\nbirthday">.

···

--
Chris
http://clabs.org

Chris_Morris2 · 26 October 2007 21:30

Adding \n inside the brackets fixes it, I just wouldn't expect to have to do
this since I didn't add the multiline mode option.

require 'test/unit'

class TestRE < Test::Unit::TestCase
  def test_newlines
    src = "happy\n\nbirthday"
    assert_equal("hday", src.scan(/h[^x\n]*?day/).to_s)
  end
end

···

--
Chris
http://clabs.org

7stud · 26 October 2007 21:38

Chris Morris wrote:

(... that subject probably makes no sense ...)

Anyway, I have some unexpected (to me) behavior in the following regexp.
This example is contrived, but based on a real need. Can anyone explain
why
the result is multi-line, even though the re is not?

require 'test/unit'

class TestRE < Test::Unit::TestCase
  def test_newlines
    src = "happy\n\nbirthday"
    assert_equal("hday", src.scan(/h[^x]*?day/).to_s)
  end
end

produces

Finished in 0.031 seconds.

  1) Failure:
test_newlines_consumed_in_not_section(TestRE) ...
<"hday"> expected but was
<"happy\n\nbirthday">.

Can anyone explain why
the result is multi-line, even though the re is not?

It's not a question of the re being multi-line or not, it's a question
of the re being greedy v. non-greedy. But because there is only one
match for your regex, the issue of greedy v. non-greedy is irrelevant.

If you think about it, there is really no concept of 'lines' with
regards to text. There really is only one line--one, long, continuous
line of characters. Some of those characters might be '\n' characters,
and we may choose to interpret a '\n' as a new line, but that doesn't
change the fact that there is still just one continuous string of
characters. A regex has nothing inherently programmed into it that will
cause it to stop looking for matches when a '\n' is encountered in the
sequence of characters. The regex character '.' will stop searching
at a newline, but that is not true of regex's generally. In any case,
you do not use the '.' character in your regex, so that behavior is
irrelevant.

···

--
Posted via http://www.ruby-forum.com/\.

Jesus_Gabriel_y_Gala · 26 October 2007 21:43

There's also something I don't understand, similar to the above.
I always thought that in a non-multiline regexp, the dot didn't match
newlines (\n), so I don't understand this:

irb(main):036:0> re = /(h)(.*)(day)/
=> /(h)(.*)(day)/
irb(main):037:0> "happy\n\nbirthday".match(re).captures
=> ["h", "", "day"]
irb(main):038:0> re = /(h)(.*)(day)/m
=> /(h)(.*)(day)/m
irb(main):039:0> "happy\n\nbirthday".match(re).captures
=> ["h", "appy\n\nbirth", "day"]

I thought the first case wouldn't match.
Can anyone shed some light?

Jesus.

···

On 10/26/07, Chris Morris <the.chrismo@gmail.com> wrote:

Adding \n inside the brackets fixes it, I just wouldn't expect to have to do
this since I didn't add the multiline mode option.

require 'test/unit'

class TestRE < Test::Unit::TestCase
  def test_newlines
    src = "happy\n\nbirthday"
    assert_equal("hday", src.scan(/h[^x\n]*?day/).to_s)
  end
end

Jesus_Gabriel_y_Gala · 26 October 2007 21:46

Can you check my example above? I'm using a greedy match of .* which I
thought would match up to a \n in a non-multiline regexp, and would
include everything in a multiline one. I must be confused at some
point

Jesus.

···

On 10/26/07, 7stud -- <bbxx789_05ss@yahoo.com> wrote:

The regex character '.' will stop searching
at a newline, but that is not true of regex's generally. In any case,
you do not use the '.' character in your regex, so that behavior is
irrelevant.

a11 · 27 October 2007 02:33

from memory, 'multiline' affects *only* the behavior of '.' in res the re

[^x] => 'not x'

simply matches any char that is not 'x' - including newline

it's the same in perl and python iirc

cheers.

a @ http://codeforpeople.com/

···

On Oct 26, 2007, at 3:30 PM, Chris Morris wrote:

Adding \n inside the brackets fixes it, I just wouldn't expect to have to do
this since I didn't add the multiline mode option.

require 'test/unit'

class TestRE < Test::Unit::TestCase
  def test_newlines
    src = "happy\n\nbirthday"
    assert_equal("hday", src.scan(/h[^x\n]*?day/).to_s)
  end
end

--
we can deny everything, except that we have the possibility of being better. simply reflect on that.
h.h. the 14th dalai lama

Gavin_Kistner3 · 26 October 2007 22:20

The last four characters of the word "birthday" match the regexp /
h.*day/, without crossing any newlines. Perhaps you were thinking of /
h.+day/, which does not match.

···

On Oct 26, 3:43 pm, "Jesús Gabriel y Galán" <jgabrielyga...@gmail.com> wrote:

There's also something I don't understand, similar to the above.
I always thought that in a non-multiline regexp, the dot didn't match
newlines (\n), so I don't understand this:

irb(main):036:0> re = /(h)(.*)(day)/
=> /(h)(.*)(day)/
irb(main):037:0> "happy\n\nbirthday".match(re).captures
=> ["h", "", "day"]
irb(main):038:0> re = /(h)(.*)(day)/m
=> /(h)(.*)(day)/m
irb(main):039:0> "happy\n\nbirthday".match(re).captures
=> ["h", "appy\n\nbirth", "day"]

I thought the first case wouldn't match.
Can anyone shed some light?

Chris_Morris2 · 29 October 2007 01:51

Yeah, it behaves that way. I guess I need to adjust my expectations

···

On 10/26/07, ara.t.howard <ara.t.howard@gmail.com> wrote:

from memory, 'multiline' affects *only* the behavior of '.' in res
the re

[^x] => 'not x'

simply matches any char that is not 'x' - including newline

it's the same in perl and python iirc

--
Chris
http://clabs.org

Jesus_Gabriel_y_Gala · 26 October 2007 22:34

I need more sleep, for sure. I was of course thinking on the first "h"
and the last "day". That explains it

irb(main):043:0> "happy\n\nday".match(re).captures
NoMethodError: undefined method `captures' for nil:NilClass

Thanks,

Jesus.

···

On 10/27/07, Phrogz <phrogz@mac.com> wrote:

On Oct 26, 3:43 pm, "Jesús Gabriel y Galán" <jgabrielyga...@gmail.com> > wrote:

> There's also something I don't understand, similar to the above.
> I always thought that in a non-multiline regexp, the dot didn't match
> newlines (\n), so I don't understand this:
>
> irb(main):036:0> re = /(h)(.*)(day)/
> => /(h)(.*)(day)/
> irb(main):037:0> "happy\n\nbirthday".match(re).captures
> => ["h", "", "day"]
> irb(main):038:0> re = /(h)(.*)(day)/m
> => /(h)(.*)(day)/m
> irb(main):039:0> "happy\n\nbirthday".match(re).captures
> => ["h", "appy\n\nbirth", "day"]
>
> I thought the first case wouldn't match.
> Can anyone shed some light?

The last four characters of the word "birthday" match the regexp /
h.*day/, without crossing any newlines. Perhaps you were thinking of /
h.+day/, which does not match.

7stud · 26 October 2007 23:23

Jesús Gabriel y Galán wrote:

> => ["h", "", "day"]
h.+day/, which does not match.

I need more sleep, for sure. I was of course thinking on the first "h"
and the last "day". That explains it

A clue was in the capture results:

rb(main):036:0> re = /(h)(.*)(day)/
=> /(h)(.*)(day)/
irb(main):037:0> "happy\n\nbirthday".match(re).captures
=> ["h", "", "day"]

The fact that the (.*) matched nothing was an indication that something
was amiss.

···

On 10/27/07, Phrogz <phrogz@mac.com> wrote:

--
Posted via http://www.ruby-forum.com/\.

Rob_Biedenharn1 · 26 October 2007 23:47

Nothing amiss there at all. the * is match "zero or more times" and so it is perfectly fine to match zero occurrences of any character (except newline) between the 'h' and the 'day'

-Rob

Rob Biedenharn http://agileconsultingllc.com
Rob@AgileConsultingLLC.com

···

On Oct 26, 2007, at 7:23 PM, 7stud -- wrote:

Jesús Gabriel y Galán wrote:

On 10/27/07, Phrogz <phrogz@mac.com> wrote:

=> ["h", "", "day"]

h.+day/, which does not match.

I need more sleep, for sure. I was of course thinking on the first "h"
and the last "day". That explains it

A clue was in the capture results:

rb(main):036:0> re = /(h)(.*)(day)/
=> /(h)(.*)(day)/
irb(main):037:0> "happy\n\nbirthday".match(re).captures
=> ["h", "", "day"]

The fact that the (.*) matched nothing was an indication that something
was amiss.

7stud · 27 October 2007 00:36

Jesús Gabriel y Galán wrote:

I was of course thinking on the first "h"
and the last "day".

Rob Biedenharn wrote:

Nothing amiss there at all.

Ok.

···

--
Posted via http://www.ruby-forum.com/\.

Topic		Replies	Views
Multiline regexp and newlines ruby-talk	2	81	29 September 2008
Surprising Regexp Behavior ruby-talk	2	86	13 September 2005
Question concerning multiline regexps and best practice ruby-talk	8	112	7 December 2005
Problem replacing newlines in regexp ruby-talk	5	102	30 April 2007
Line breaks in multiline regexp ruby-talk	6	125	21 April 2008

Newlines included in bracket negation

Related topics