Odd regexp behavior

I'm running 1.9.2-p180

I have the following regexp: /\s(".*?")(\s|$)/

For some reason it isn't matching the end of the following line or any line
with a similar format. By end I mean the entire user-agent string
msnbot...htm)\"\n

"207.46.13.53 - - [21/Apr/2010:04:05:29 -0600] \"GET
/dualcredit/courses/general.php HTTP/1.1\" 200 27731 \"-\" \"msnbot/2.0b (+
http://search.msn.com/msnbot.htm)\"\n"

However it matches against the following:

" \"msnbot/2.0b (+http://search.msn.com/msnbot.htm)\"\n"

I am at a total loss as to why. I'm not too sure how to go about debugging
it either.

···

--
"Hey brother Christian with your high and mighty errand, Your actions speak
so loud, I can’t hear a word you’re saying."

-Greg Graffin (Bad Religion)

Don't match. You're not trying to match, you're trying to extract. I see two easy ways to do this:

1) split on /\"/ and get the field you want (the last one, not counting the newline).
2) scan for everything within quotes and get the last one.

I'm sure there are others, but those are my two favorite go-to string methods.

examples:

s = "207.46.13.53 - - [21/Apr/2010:04:05:29 -0600] \"GET /dualcredit/courses/general.php HTTP/1.1\" 200 27731 \"-\" \"msnbot/2.0b (+http://search.msn.com/msnbot.htm\)\"\n"

p s.strip.split(/\"/).last
# => "msnbot/2.0b (+http://search.msn.com/msnbot.htm\)"

p s.scan(/\".+?\"/).last[1..-2]
# => "msnbot/2.0b (+http://search.msn.com/msnbot.htm\)"

···

On Aug 10, 2011, at 13:50 , Glen Holcomb wrote:

I'm running 1.9.2-p180

I have the following regexp: /\s(".*?")(\s|$)/

For some reason it isn't matching the end of the following line or any line
with a similar format. By end I mean the entire user-agent string
msnbot...htm)\"\n

"207.46.13.53 - - [21/Apr/2010:04:05:29 -0600] \"GET
/dualcredit/courses/general.php HTTP/1.1\" 200 27731 \"-\" \"msnbot/2.0b (+
http://search.msn.com/msnbot.htm\)\"\n"

However it matches against the following:

" \"msnbot/2.0b (+http://search.msn.com/msnbot.htm\)\"\n"

I am at a total loss as to why. I'm not too sure how to go about debugging
it either.

Glen Holcomb wrote in post #1016060:

I'm running 1.9.2-p180

I have the following regexp: /\s(".*?")(\s|$)/

For some reason it isn't matching the end of the following line or any
line
with a similar format. By end I mean the entire user-agent string

It's pretty simple: $ matches before a newline--not the end of a string.
I

···

--
Posted via http://www.ruby-forum.com/\.

str = "207.46.13.53 - - [21/Apr/2010:04:05:29 -0600] \"GET
/dualcredit/courses/general.php HTTP/1.1\" 200 27731 \"-\" \"msnbot/2.0b
(+http://search.msn.com/msnbot.htm)\"\n"

str.scan(/^ .* $/x) do |match|
  puts match
  puts '-' * 20
end

--output:--
207.46.13.53 - - [21/Apr/2010:04:05:29 -0600] "GET

···

--------------------
/dualcredit/courses/general.php HTTP/1.1" 200 27731 "-" "msnbot/2.0b
--------------------
(+http://search.msn.com/msnbot.htm)"
--------------------

--
Posted via http://www.ruby-forum.com/.

Also,

str = "hello\nhello"

str.scan(/^hell/) {|match| p match}
puts "-" * 20
str.scan(/\Ahell/) {|match| p match}

--output:--
"hell"
"hell"

···

--------------------
"hell"

--
Posted via http://www.ruby-forum.com/.

7stud -- wrote in post #1016113:

It's pretty simple: $ matches before a newline--not the end of a string.

Actually, $ matches after a newline.

···

--
Posted via http://www.ruby-forum.com/\.

Hmmm... maybe I should have posted from the beginning rather than where I
had gotten to in my attempt to solve my problem.

I am tyring to parse log files and am pulling out encapsulated fields so
that I can split the line in a sane way. I have the following regex which I
am using to do that:

/\s(".*?")(\s|$)/

Now as to why I'm looking for the \s before and the \s or $ after. It turns
out that some of the user agent strings are in a format like "\"Custom
Agent\"=\"Mozilla ...\""\n

I had been using the regex Ryan suggested earlier until I discovered the
nested quotes.

The expression above works for quoted strings surrounded by spaces but not
the last one on the line. I've tried changing $ to \n and that didn't make
any difference.

Here is the exact code I'm using:

x = "encapsulatorhere"
stash = {}

gen_encap_matches.each do |encapex|
line.gsub!(encapex) do |match|
x.next!
stash = $1
@job.log_format.separator + x + @job.log_format.separator
end
end

gen_encap_matches just creates the regexes from a list of encapsulation
characters.

I'm at a complete loss as to why it won't grab the last quoted string in the
line.

···

On Thu, Aug 11, 2011 at 12:41 AM, 7stud -- <bbxx789_05ss@yahoo.com> wrote:

Also,

str = "hello\nhello"

str.scan(/^hell/) {|match| p match}
puts "-" * 20
str.scan(/\Ahell/) {|match| p match}

--output:--
"hell"
"hell"
--------------------
"hell"

--
Posted via http://www.ruby-forum.com/\.

--
"Hey brother Christian with your high and mighty errand, Your actions speak
so loud, I can’t hear a word you’re saying."

-Greg Graffin (Bad Religion)

7stud -- wrote in post #1016116:

7stud -- wrote in post #1016113:

It's pretty simple: $ matches before a newline--not the end of a string.

Actually, $ matches after a newline.

Whoops.

puts "hello\nworld"
p "hello\nworld"

--output:--
hello
world
"hello\nworld"

def show_match(str,re)
  if str =~ re
    "#{$`}<<#{$&}>>#{$'}"
  else
    "no match"
  end
end

p show_match("hello\nhello", /llo$/)
p show_match("hello\nhello", /llo\z/x)

--output:--
"he<<llo>>\nhello"
"hello\nhe<<llo>>"

···

--
Posted via http://www.ruby-forum.com/\.

So, after a little digging on Stackoverflow I decided to try an explicit
lookahead. For what ever reason it works.

/\s(".*?")(?=\s|$)/ matches where /\s(".*?")(\s|$)/ won't.

···

On Thu, Aug 11, 2011 at 5:52 AM, Glen Holcomb <damnbigman@gmail.com> wrote:

On Thu, Aug 11, 2011 at 12:41 AM, 7stud -- <bbxx789_05ss@yahoo.com> wrote:

> Also,
>
>
> str = "hello\nhello"
>
> str.scan(/^hell/) {|match| p match}
> puts "-" * 20
> str.scan(/\Ahell/) {|match| p match}
>
> --output:--
> "hell"
> "hell"
> --------------------
> "hell"
>
> --
> Posted via http://www.ruby-forum.com/\.
>
>
Hmmm... maybe I should have posted from the beginning rather than where I
had gotten to in my attempt to solve my problem.

I am tyring to parse log files and am pulling out encapsulated fields so
that I can split the line in a sane way. I have the following regex which
I
am using to do that:

/\s(".*?")(\s|$)/

Now as to why I'm looking for the \s before and the \s or $ after. It
turns
out that some of the user agent strings are in a format like "\"Custom
Agent\"=\"Mozilla ...\""\n

I had been using the regex Ryan suggested earlier until I discovered the
nested quotes.

The expression above works for quoted strings surrounded by spaces but not
the last one on the line. I've tried changing $ to \n and that didn't make
any difference.

Here is the exact code I'm using:

x = "encapsulatorhere"
stash = {}

gen_encap_matches.each do |encapex|
line.gsub!(encapex) do |match|
x.next!
stash = $1
@job.log_format.separator + x + @job.log_format.separator
end
end

gen_encap_matches just creates the regexes from a list of encapsulation
characters.

I'm at a complete loss as to why it won't grab the last quoted string in
the
line.

--
"Hey brother Christian with your high and mighty errand, Your actions speak
so loud, I can’t hear a word you’re saying."

-Greg Graffin (Bad Religion)

--
"Hey brother Christian with your high and mighty errand, Your actions speak
so loud, I can’t hear a word you’re saying."

-Greg Graffin (Bad Religion)

[...]

Now as to why I'm looking for the \s before and the \s or $ after. It
turns out that some of the user agent strings are in a format like "\"Custom
Agent\"=\"Mozilla ...\""\n

So, after a little digging on Stackoverflow I decided to try an explicit
lookahead. For what ever reason it works.

/\s(".*?")(?=\s|$)/ matches where /\s(".*?")(\s|$)/ won't.

It sounds like you have a solution, but don't understand it. I'd like to help you understand it, but I don't understand what you're trying to match. The sample string you provide above does not match your regex (and obviously so, as there is never whitespace before a quote)

Could you please provide a single string that you're matching against, and describe what you are trying to match?

···

On Aug 12, 2011, at 07:28 AM, Glen Holcomb <damnbigman@gmail.com> wrote:

Sure,

What I'm trying to do is parse our Apache log files. A fairly standard
sample line is as follows:

10.132.18.15 - - [21/Apr/2010:12:22:36 -0600] "GET
/images/2010_front_sprite.jpg HTTP/1.1" 304 - "http://
cnm.edu/" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; .NET CLR
2.0.50727; .NET CLR 3.0.04506.30; .NET CLR 1.1.4322; .NET CLR 3.0.04506.648;
InfoPath.1; .NET CLR 3.0.4506.2152; .NET CLR 3.5.30729)"

I'm pulling out encapsulated data, splitting the line on the separator then
putting the encapsulated data back. I was using /".*?"/ to grab the quoted
strings but I discovered lines with the following format in the log file:

12.172.30.9 - - [21/Apr/2010:13:21:04 -0600] "GET
/clickheat/click.php?s=&g=index&x=130&y=432&w=1009&b=safari&c=1&random=Wed%20Apr%2021%202010%2013:21:04%20GMT-0600%20(MDT)
HTTP/1.1" 200 100 "http://cnm.edu/&quot; "\"CustomUserAgent\"=\"Mozilla/5.0
(Macintosh; U; Intel Mac OS X 10_6; en-us) AppleWebKit/531.21.8 (KHTML, like
Gecko) Version/4.0.4 Safari/531.21.10 FOH:R177\";"

This broke my simple /".*?"/ expression. So I decided to include the
separator in the regex and tried the following expression:

/\s(".*?")(\s|$)/

I am using gsub to perform the replacement action.

In my gsub block this would get all the quoted strings except for the user
agent string which ends the entry. If I tried matching that regexp against
a quoted string with a preceding space and followed by a \n it would work.
It just didn't work inside my gsub block.

For example:

10.132.18.15 - - [21/Apr/2010:12:22:36 -0600] "GET
/images/2010_front_sprite.jpg HTTP/1.1" 304 - "http://
cnm.edu/" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; .NET CLR
2.0.50727; .NET CLR 3.0.04506.30; .NET CLR 1.1.4322; .NET CLR 3.0.04506.648;
InfoPath.1; .NET CLR 3.0.4506.2152; .NET CLR 3.5.30729)"

would come out as

10.132.18.15 - - encapsulatorherf encapsulatorherg 304 -
encapsulatorherh "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1;
.NET CLR 2.0.50727; .NET CLR 3.0.04506.30; .NET CLR 1.1.4322; .NET CLR
3.0.04506.648; InfoPath.1; .NET CLR 3.0.4506.2152; .NET CLR 3.5.30729)"

what I wanted and was expecting is

10.132.18.15 - - encapsulatorherf encapsulatorherg 304 - encapsulatorherh
encapsulatori

As soon as I changed my regexp to /\s(".*?")(?=\s|$)/ it worked.

I'm not sure why /\s(".*?")(\s|$)/ and /\s(".*?")(?=\s|$)/ are significantly
different.

···

On Fri, Aug 12, 2011 at 9:27 AM, Gavin Kistner <phrogz@me.com> wrote:

On Aug 12, 2011, at 07:28 AM, Glen Holcomb <damnbigman@gmail.com> wrote:
[...]

> Now as to why I'm looking for the \s before and the \s or $ after. It
> turns out that some of the user agent strings are in a format like
"\"Custom
> Agent\"=\"Mozilla ...\""\n

So, after a little digging on Stackoverflow I decided to try an explicit
lookahead. For what ever reason it works.

/\s(".*?")(?=\s|$)/ matches where /\s(".*?")(\s|$)/ won't.

It sounds like you have a solution, but don't understand it. I'd like to
help you understand it, but I don't understand what you're trying to match.
The sample string you provide above does not match your regex (and obviously
so, as there is never whitespace before a quote).

Could you please provide a single string that you're matching against, and
describe what you are trying to match?

Because, when the (\s|$) at the end matches \s (a space), this space
is no longer included in subsequent matches - as if that part of
string "disappeared" - and thus the \s at the beginning can't match
it. You should use a regex tester for complex regexes (by complex, I
mean almost all), for example http://regexpal.com/. (Try inputting
your data and both of your regexes there.)

I think there's a similar tool that explicitly uses Ruby's flavor of
regexp (regexpal uses browser-side JavaScript), but I can't remember
the URL and AFAIR it sucked.

-- Matma Rex

Actually I don't think that was the case. Here is my gsub block

encapexs.each do |encapex|
line.gsub!(encapex) do |match|
x.next!
stash = $1
separator + x + separator
end
end

So, it should have been and appeared to be putting the space back into the
line. I really need to modify my regex and grab the preceding space in it's
own group so I can clean the last line up some.

···

2011/8/12 Bartosz Dziewoński <matma.rex@gmail.com>

Because, when the (\s|$) at the end matches \s (a space), this space
is no longer included in subsequent matches - as if that part of
string "disappeared" - and thus the \s at the beginning can't match
it. You should use a regex tester for complex regexes (by complex, I
mean almost all), for example - Regex Pal. (Try inputting
your data and both of your regexes there.)

I think there's a similar tool that explicitly uses Ruby's flavor of
regexp (regexpal uses browser-side JavaScript), but I can't remember
the URL and AFAIR it sucked.

-- Matma Rex

--
"Hey brother Christian with your high and mighty errand, Your actions speak
so loud, I can’t hear a word you’re saying."

-Greg Graffin (Bad Religion)

I don't think I understand; "putting it back in" doesn't matter here,
nor does using gsub instead of, say, scan.

irb(main):006:0> s = 'ababababa'
=> "ababababa"
irb(main):007:0> s.gsub(/aba/, 'aca')
=> "acabacaba"
irb(main):008:0> s.scan /aba/
=> ["aba", "aba"]
irb(main):011:0> s.scan /(?=aba)/
=> ["", "", "", ""]

Note how the first scan returned two results, even though you can
clearly see "aba" appears in the string 4 times. Note how the second
returned 4 matches (even if they're all empty). Once a character is
matched, regex engine moves forward, discarding everything up to the
end of match, inclusive.

-- Matma Rex

If that's the case then I have absolutely no idea why either of my
expressions work.

In my initial testing I wasn't returning the separator + match_group +
separator pattern from the gsub block and it was skipping the second
encapsulated string when there were two in a row.

At any rate, why does (\s|$) match differently from (?=\s|$)

···

2011/8/12 Bartosz Dziewoński <matma.rex@gmail.com>

I don't think I understand; "putting it back in" doesn't matter here,
nor does using gsub instead of, say, scan.

irb(main):006:0> s = 'ababababa'
=> "ababababa"
irb(main):007:0> s.gsub(/aba/, 'aca')
=> "acabacaba"
irb(main):008:0> s.scan /aba/
=> ["aba", "aba"]
irb(main):011:0> s.scan /(?=aba)/
=> ["", "", "", ""]

Note how the first scan returned two results, even though you can
clearly see "aba" appears in the string 4 times. Note how the second
returned 4 matches (even if they're all empty). Once a character is
matched, regex engine moves forward, discarding everything up to the
end of match, inclusive.

-- Matma Rex

The first one, (\s|$), is simply a group, matching either whitespace,
or a position - end-of-line. The other one is a lookahead group which
matches a *position* - a *zero-width string*, if you like - if, at
this place in string, either following characters are whitespace, or
this position is also the end of line (the following characters
themselves are *not* matched).

You need to understand that once a character is matched, it's *gone* -
this gsub/match/scan/whatever will not match it again in this run.

Have you tried inputting the data, and both regexes, in regexpal and
comparing the results? I think it really clearly shows graphically
what I mean.

-- Matma Rex