Sure,
What I'm trying to do is parse our Apache log files. A fairly standard
sample line is as follows:
10.132.18.15 - - [21/Apr/2010:12:22:36 -0600] "GET
/images/2010_front_sprite.jpg HTTP/1.1" 304 - "http://
cnm.edu/" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; .NET CLR
2.0.50727; .NET CLR 3.0.04506.30; .NET CLR 1.1.4322; .NET CLR 3.0.04506.648;
InfoPath.1; .NET CLR 3.0.4506.2152; .NET CLR 3.5.30729)"
I'm pulling out encapsulated data, splitting the line on the separator then
putting the encapsulated data back. I was using /".*?"/ to grab the quoted
strings but I discovered lines with the following format in the log file:
12.172.30.9 - - [21/Apr/2010:13:21:04 -0600] "GET
/clickheat/click.php?s=&g=index&x=130&y=432&w=1009&b=safari&c=1&random=Wed%20Apr%2021%202010%2013:21:04%20GMT-0600%20(MDT)
HTTP/1.1" 200 100 "http://cnm.edu/" "\"CustomUserAgent\"=\"Mozilla/5.0
(Macintosh; U; Intel Mac OS X 10_6; en-us) AppleWebKit/531.21.8 (KHTML, like
Gecko) Version/4.0.4 Safari/531.21.10 FOH:R177\";"
This broke my simple /".*?"/ expression. So I decided to include the
separator in the regex and tried the following expression:
/\s(".*?")(\s|$)/
I am using gsub to perform the replacement action.
In my gsub block this would get all the quoted strings except for the user
agent string which ends the entry. If I tried matching that regexp against
a quoted string with a preceding space and followed by a \n it would work.
It just didn't work inside my gsub block.
For example:
10.132.18.15 - - [21/Apr/2010:12:22:36 -0600] "GET
/images/2010_front_sprite.jpg HTTP/1.1" 304 - "http://
cnm.edu/" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; .NET CLR
2.0.50727; .NET CLR 3.0.04506.30; .NET CLR 1.1.4322; .NET CLR 3.0.04506.648;
InfoPath.1; .NET CLR 3.0.4506.2152; .NET CLR 3.5.30729)"
would come out as
10.132.18.15 - - encapsulatorherf encapsulatorherg 304 -
encapsulatorherh "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1;
.NET CLR 2.0.50727; .NET CLR 3.0.04506.30; .NET CLR 1.1.4322; .NET CLR
3.0.04506.648; InfoPath.1; .NET CLR 3.0.4506.2152; .NET CLR 3.5.30729)"
what I wanted and was expecting is
10.132.18.15 - - encapsulatorherf encapsulatorherg 304 - encapsulatorherh
encapsulatori
As soon as I changed my regexp to /\s(".*?")(?=\s|$)/ it worked.
I'm not sure why /\s(".*?")(\s|$)/ and /\s(".*?")(?=\s|$)/ are significantly
different.
···
On Fri, Aug 12, 2011 at 9:27 AM, Gavin Kistner <phrogz@me.com> wrote:
On Aug 12, 2011, at 07:28 AM, Glen Holcomb <damnbigman@gmail.com> wrote:
[...]
> Now as to why I'm looking for the \s before and the \s or $ after. It
> turns out that some of the user agent strings are in a format like
"\"Custom
> Agent\"=\"Mozilla ...\""\n
So, after a little digging on Stackoverflow I decided to try an explicit
lookahead. For what ever reason it works.
/\s(".*?")(?=\s|$)/ matches where /\s(".*?")(\s|$)/ won't.
It sounds like you have a solution, but don't understand it. I'd like to
help you understand it, but I don't understand what you're trying to match.
The sample string you provide above does not match your regex (and obviously
so, as there is never whitespace before a quote).
Could you please provide a single string that you're matching against, and
describe what you are trying to match?