Regular expression contradiction

Folks,

I’m having trouble extracting all the data I want from a string. Please
consider the following results:

irb(main):014:0* xyz.class                       -> String
irb(main):015:0> xyz.length                      -> 500
irb(main):016:0> xyz =~ /TRANDATE/               -> 48
irb(main):017:0> xyz =~ /CHECKSUM/               -> 377
irb(main):018:0> xyz =~ /TRANDATE.*CHECKSUM/     -> nil
irb(main):019:0> xyz =~ /TRANDATE.*?CHECKSUM/    -> nil
irb(main):020:0> VERSION                         -> "1.6.5"

(I want to be using 1.8.0, but I have not solved my recently-described
problem in building it on Cygwin.)

I’m sure you can all see the problem: “TRANDATE” occurs at position 48,
and “CHECKSUM” at position 377. Why on earth does (xyz =~
/TRANDATE.*CHECKSUM/) not return 48?

(xyz.inspect) appears below. Use an editor to collapse into one line to
recreate it.

Thanks,
Gavin

“TRANID\000\000\000\000\000\000\000\000\000ENGYAUS
USR-TNS-210036784\001\002\000\023\000\023\000\000
010TRANDATE\000\000\000\000\000\000\0002003-03-06,
11:01:26\001\003\000$\000\000\000\000\005INITRANID
\000\000\000\000\000\000\000<)\351\005\000\000#\a
000\000\005\000\002\000\002\000\000\000\000\000\00
0\000\000\000\000\000\003\000\000\0006\000\000\000
\000\003\000\000\000INITCOU_FST\000\000\003\001
\000\000\000\002\000\000\000\000\000\000\000\000\0
00\000\000\001\001\000\010\000\005\000\000\005VERS
ION\000\000\000\000\000\000\000\000r5_a3\001\002\0
00\001\000\000\000\000\005DUMMY\000\000\000\000\00
0\000\000\000\000\000\000<)\245\006\000\000\300\01
0\000\000\006\000\003\000\f\000\000\000\000\000\f
000\000\000\000\000\004\000\000\000U\001\000\000
\000\004\000\000\000G003\000\000\000\000\000\000
000\000\000\003\001\000\000\000\003\000\000\000\00
0\000\000\000\000\000\000\000\001\001\000\010\000
005\000\000\005VERSION\000\000\000\000\000\000\000
\000r5_a3\001\002\000\n\000\n\000\000\aDPID\000\00
0\000\000\000\000\000\000\000\000\0005240132312\00
1\003\000\001\000\001\000\000\aCHECKSUM\000\000\00
0\000\000\000\0001\001\004\000\n\000\006\000\000\0
05ROLR\000\000\000\000\000\000\000\000\000\000\000
AGLUSR\001\005\000\003\000\000\000\001\005ROLRCONT
EXT\000\000\000\000\001\006\000F\000\000\000\001\0
05ROLRDESC\000\000\000\000\000\000\000\001\a\000\f
\000\f\000\000\aNETRCPTPT\000\000\000\000\000\000D
UKEH”

I don’t know, but I can tell you that it works correctly in 1.8.0.
Therefore, it must have been a bug.

···

On Fri, Jul 11, 2003 at 10:37:57AM +0900, Gavin Sinclair wrote:

Folks,

I’m having trouble extracting all the data I want from a string. Please
consider the following results:

irb(main):014:0* xyz.class                       -> String
irb(main):015:0> xyz.length                      -> 500
irb(main):016:0> xyz =~ /TRANDATE/               -> 48
irb(main):017:0> xyz =~ /CHECKSUM/               -> 377
irb(main):018:0> xyz =~ /TRANDATE.*CHECKSUM/     -> nil
irb(main):019:0> xyz =~ /TRANDATE.*?CHECKSUM/    -> nil
irb(main):020:0> VERSION                         -> "1.6.5"

(I want to be using 1.8.0, but I have not solved my recently-described
problem in building it on Cygwin.)

I’m sure you can all see the problem: “TRANDATE” occurs at position 48,
and “CHECKSUM” at position 377. Why on earth does (xyz =~
/TRANDATE.*CHECKSUM/) not return 48?

(xyz.inspect) appears below. Use an editor to collapse into one line to
recreate it.

Thanks,
Gavin

“TRANID\000\000\000\000\000\000\000\000\000ENGYAUS
USR-TNS-210036784\001\002\000\023\000\023\000\000
010TRANDATE\000\000\000\000\000\000\0002003-03-06,
11:01:26\001\003\000$\000\000\000\000\005INITRANID
\000\000\000\000\000\000\000<)\351\005\000\000#\a
000\000\005\000\002\000\002\000\000\000\000\000\00
0\000\000\000\000\000\003\000\000\0006\000\000\000
\000\003\000\000\000INITCOU_FST\000\000\003\001
\000\000\000\002\000\000\000\000\000\000\000\000\0
00\000\000\001\001\000\010\000\005\000\000\005VERS
ION\000\000\000\000\000\000\000\000r5_a3\001\002\0
00\001\000\000\000\000\005DUMMY\000\000\000\000\00
0\000\000\000\000\000\000<)\245\006\000\000\300\01
0\000\000\006\000\003\000\f\000\000\000\000\000\f
000\000\000\000\000\004\000\000\000U\001\000\000
\000\004\000\000\000G003\000\000\000\000\000\000
000\000\000\003\001\000\000\000\003\000\000\000\00
0\000\000\000\000\000\000\000\001\001\000\010\000
005\000\000\005VERSION\000\000\000\000\000\000\000
\000r5_a3\001\002\000\n\000\n\000\000\aDPID\000\00
0\000\000\000\000\000\000\000\000\0005240132312\00
1\003\000\001\000\001\000\000\aCHECKSUM\000\000\00
0\000\000\000\0001\001\004\000\n\000\006\000\000\0
05ROLR\000\000\000\000\000\000\000\000\000\000\000
AGLUSR\001\005\000\003\000\000\000\001\005ROLRCONT
EXT\000\000\000\000\001\006\000F\000\000\000\001\0
05ROLRDESC\000\000\000\000\000\000\000\001\a\000\f
\000\f\000\000\aNETRCPTPT\000\000\000\000\000\000D
UKEH”


Daniel Carrera | OpenPGP fingerprint:
Graduate TA, Math Dept | 6643 8C8B 3522 66CB D16C D779 2FDD 7DAC 9AF7 7A88
UMD (301) 405-5137 | http://www.math.umd.edu/~dcarrera/pgp.html

Gavin,

irb(main):016:0> xyz =~ /TRANDATE/ → 48
irb(main):017:0> xyz =~ /CHECKSUM/ → 377
irb(main):018:0> xyz =~ /TRANDATE.CHECKSUM/ → nil
irb(main):019:0> xyz =~ /TRANDATE.
?CHECKSUM/ → nil
irb(main):020:0> VERSION → “1.6.5”

I’m sure you can all see the problem: “TRANDATE” occurs at
position 48, and “CHECKSUM” at position 377. Why on earth
does (xyz =~ /TRANDATE.*CHECKSUM/) not return 48?

Simple, the definition of /./ is "any character except a newline",

except in multiline mode where it is simply “any character”. If you look
carefully at your string, you will find three newlines (“\n”) between
“TRANDATE” and “CHECKSUM”. The easy solution would be to change the regular
expression to be multiline:

xyz =~ /TRANDATE.*CHECKSUM/m → 48

I hope this helps!

- Warren Brown

Gavin,

irb(main):016:0> xyz =~ /TRANDATE/ → 48
irb(main):017:0> xyz =~ /CHECKSUM/ → 377
irb(main):018:0> xyz =~ /TRANDATE.CHECKSUM/ → nil
irb(main):019:0> xyz =~ /TRANDATE.
?CHECKSUM/ → nil
irb(main):020:0> VERSION → “1.6.5”

I’m sure you can all see the problem: “TRANDATE” occurs at
position 48, and “CHECKSUM” at position 377. Why on earth
does (xyz =~ /TRANDATE.*CHECKSUM/) not return 48?

Simple, the definition of /./ is "any character except a newline",

except in multiline mode where it is simply “any character”. If you
look carefully at your string, you will find three newlines (“\n”)
between “TRANDATE” and “CHECKSUM”. The easy solution would be to change
the regular expression to be multiline:

xyz =~ /TRANDATE.*CHECKSUM/m → 48

I hope this helps!

- Warren Brown

::slap cheek with wet fish:: D’oh!!

Thanks a lot for that, Warren. My furious attempts to build 1.8.0 can
again be put on hold. The magical “m” flag worked straight away, and
fast, too.

Cheers,
Gavin