“Weirich, James” James.Weirich@FMR.COM schrieb im Newsbeitrag
news:1C8557C418C561429998C1F8FBB283A728BA93@MSGDALCLB2WIN.DMN1.FMR.COM…
It’s the unexpected that I’m thinking about. How do you make
it so that it will match anything other than your token? This
doesn’t seem to work:
regex = Regexp.new(
“(?m:” << tokens.map { |t|
Regexp::escape(t)
}.join(“|”) << “|.*)”
)
scan(regex)
Try /./ instead of /.*/. Unexpected stuff will come at you one
character at
a time, which may or may not be ok.
Looks like “|.*” is already present above. Or did you want to point to
something else? The problem with this is just, that as soon as all other
tokens don’t match this one will happily eat up the whole sequence:
irb(main):002:0> “foo bar baz”.scan /foo|baz|.*/
=> [“foo”, " bar baz", “”]
Note the interesting additional match of “” at the end of the sequence
which is due to the “*” IMHO. Here are some alternatives that aren’t
better (but may come to mind):
irb(main):003:0> “foo bar baz”.scan /foo|baz|.+/
=> [“foo”, " bar baz"]
irb(main):004:0> “foo bar baz”.scan /foo|baz|.*?/
=> [“foo”, “”, “”, “”, “”, “”, “baz”, “”]
irb(main):005:0> “foo bar baz”.scan /foo|baz|.+?/
=> [“foo”, " ", “b”, “a”, “r”, " ", “baz”]
In short, using “.*” or variants for “unexpected stuff” will fail in this
situation.
Otherwise you can be more creative in
your Regexps. For example, if all you are interested in are patterns
like
“some_var = some_thing_else”, then end your token list with something
like
/[^a-zA-Z0-9_=]*/. If you want white space, then include white space in
your list of tokens.
Again I’d suggest to consider the “+” since the empty sequence is not very
often interesting as a token. I’m afraid, the best solution for dealing
with the unexpected is still the pre_match variant.
Regards
robert