Iki,
I thought I could parse a dsv file with say a ‘;’
delimiter (escaped with a backslash where
appropriate) just by using a regexp such as
/(?=[^\]);/x and String#split
“eggs\;bread;butter”.split(/(?=[^\]);/x)
=> [“eggs\”, “bread”, “butter”]
but …
“eggs;\bread;butter”.split(/;(?=[^\])/x)
=> [“eggs;\bread”, “butter”]
Am I missing something?
I think perhaps you are a little confused on what exactly "(?=re)"
does. This construct allows you to look ahead without consuming
characters from the search string. This construct is virtually never
followed by additional patterns, since the additional patterns will be
trying to match the same characters. For example, /(?=a)b/ would never
match anything since it would require a character to match both /a/
and /b/.
So your regular expression /(?=[^\\]);/ is basically saying "match
any character that is not a backslash and is a semicolon". Obviously,
this is the same as /;/.
Your second example is a little more normal use of the look-ahead.
You are trying to match and semicolon which is followed by a character
that is not a backslash, without consuming that additional character.
This produces the correct results.
Now to your original intent, it sounds like what you want is to
split on any semicolon that is not preceded by a backslash. This is a
little more difficult, since there is no “look-behind” construct in
regular expressions. To accomplish what you want to do, it will
probably be easier to use String#scan and specify what you want to
match, instead of using String#split and specifying what you don’t want
to match.
Of course, using scan is a little more involved. If we naively just
search for non-semicolons followed by a semicolon:
“eggs\;bread;butter”.scan(/([^;]*);/)
=> [[“eggs\”], [“bread”]]
We quickly realize that we need a terminating semicolon to match the
last word. Also, we need to flatten the array:
“eggs\;bread;butter;”.scan(/([^;]*);/).flatten
=> [“eggs\”, “bread”, “butter”]
Now, to ignore semicolons preceded by backslashes, we need to
include a backslash followed by a semicolon in the definition of the
word:
“eggs\;bread;butter;”.scan(/( (?: [^;\] | (?:\ )* );/x).flatten
=> [“eggs\;bread”, “butter”]
Unfortunately, there is a problem with this - a backslash can be
followed by other characters as well:
“eggs\;br\ead;butter;”.scan(/( (?: [^;\] | (?:\ )* );/x).flatten
=> [“”, “ead”, “butter”]
Also, to be concise you really don’t want to match a semicolon preceded
by two backslashes, or any even number of backslashes. What you
really want is to have a backslash followed by any single character as
part of the word. Restating things this way, we get:
“eggs\;br\ead\\;butter;”.scan(/( (?: [^;\] | (?:\.) )*
);/x).flatten
=> [“eggs\;br\ead\\”, “butter”]
I hope this helps.
- Warren Brown