Regular Expression for D(elimiter) Separated Values File

Hi all,

I thought I could parse a dsv file with say a ‘;’ delimiter
(escaped with a backslash where appropriate) just by using
a regexp such as /(?=[^\]);/x and String#split

“eggs\;bread;butter”.split(/(?=[^\]);/x)’
=> [“eggs\”, “bread”, “butter”]

but …

“eggs;\bread;butter”.split(/;(?=[^\])/x)’
=> [“eggs;\bread”, “butter”]

Am I missing something?

I am using Ruby 1.8.1 on mingw

Thanks

didn’t read with attention, but why not just using csv.rb from the
1.8.1 lib?

···

il 11 Mar 2004 02:24:21 -0800, npoly_iki@yahoo.com (Iki) ha scritto::

Hi all,

“Iki” npoly_iki@yahoo.com schrieb im Newsbeitrag
news:ad334b15.0403110224.4d3d5b2@posting.google.com

Hi all,

I thought I could parse a dsv file with say a ‘;’ delimiter
(escaped with a backslash where appropriate) just by using
a regexp such as /(?=[^\]);/x and String#split

“eggs\;bread;butter”.split(/(?=[^\]);/x)’
=> [“eggs\”, “bread”, “butter”]

but …

“eggs;\bread;butter”.split(/;(?=[^\])/x)’
=> [“eggs;\bread”, “butter”]

Am I missing something?

Yes, Ruby regular expressions can’t do lookbehind, which is what you need
for the escaping. But you can use scan:

irb(main):005:0> “eggs\;bread;butter”.scan /(?:[^\;]+|\.)+/
=> [“eggs\;bread”, “butter”]
irb(main):006:0> “eggs\;bread;butter”.scan /(?:[^\;]|\.)+/
=> [“eggs\;bread”, “butter”]

Regards

robert

Iki,

I thought I could parse a dsv file with say a ‘;’
delimiter (escaped with a backslash where
appropriate) just by using a regexp such as
/(?=[^\]);/x and String#split

“eggs\;bread;butter”.split(/(?=[^\]);/x)
=> [“eggs\”, “bread”, “butter”]

but …

“eggs;\bread;butter”.split(/;(?=[^\])/x)
=> [“eggs;\bread”, “butter”]

Am I missing something?

I think perhaps you are a little confused on what exactly "(?=re)"

does. This construct allows you to look ahead without consuming
characters from the search string. This construct is virtually never
followed by additional patterns, since the additional patterns will be
trying to match the same characters. For example, /(?=a)b/ would never
match anything since it would require a character to match both /a/
and /b/.

So your regular expression /(?=[^\\]);/ is basically saying "match

any character that is not a backslash and is a semicolon". Obviously,
this is the same as /;/.

Your second example is a little more normal use of the look-ahead.

You are trying to match and semicolon which is followed by a character
that is not a backslash, without consuming that additional character.
This produces the correct results.

Now to your original intent, it sounds like what you want is to

split on any semicolon that is not preceded by a backslash. This is a
little more difficult, since there is no “look-behind” construct in
regular expressions. To accomplish what you want to do, it will
probably be easier to use String#scan and specify what you want to
match, instead of using String#split and specifying what you don’t want
to match.

Of course, using scan is a little more involved.  If we naively just

search for non-semicolons followed by a semicolon:

“eggs\;bread;butter”.scan(/([^;]*);/)
=> [[“eggs\”], [“bread”]]

We quickly realize that we need a terminating semicolon to match the

last word. Also, we need to flatten the array:

“eggs\;bread;butter;”.scan(/([^;]*);/).flatten
=> [“eggs\”, “bread”, “butter”]

Now, to ignore semicolons preceded by backslashes, we need to

include a backslash followed by a semicolon in the definition of the
word:

“eggs\;bread;butter;”.scan(/( (?: [^;\] | (?:\:wink: )* );/x).flatten
=> [“eggs\;bread”, “butter”]

Unfortunately, there is a problem with this - a backslash can be

followed by other characters as well:

“eggs\;br\ead;butter;”.scan(/( (?: [^;\] | (?:\:wink: )* );/x).flatten
=> [“”, “ead”, “butter”]

Also, to be concise you really don’t want to match a semicolon preceded
by two backslashes, or any even number of backslashes. What you
really want is to have a backslash followed by any single character as
part of the word. Restating things this way, we get:

“eggs\;br\ead\\;butter;”.scan(/( (?: [^;\] | (?:\.) )*
);/x).flatten
=> [“eggs\;br\ead\\”, “butter”]

I hope this helps.

- Warren Brown

To be pedantic - not in /Ruby’s current implementation of/ regular
expressions. This should all change in 2.0 with Oniguruma, right?

Anyhow, if the above is an accurate assessment of the needs, just for
the fun you can do it another way than the (effective) solution already
provided.

str = “aaa;bbb;ccc\;ccc;ddd\\;eee;fff\\\;fff;ggg\\\\;hhh”;
chunks = str.gsub(/([^\])(\\)*;/,‘\1\2•’).split(‘•’).inspect;

The premise here is to replace the desire for a negative lookbehind
with a consumed character, which is then stuck back into the string
along with a new ‘magic’ character which is guaranteed not to be in the
source string. This character is then used to split the output.

The regular expression above says “find a character which isn’t a
backslash, followed be an even number of backslashes, followed by a
semi-colon (which we now know must be a field delimiter, since it can’t
have been escaped)”.

Seems to work from the naive sample I posted above…but requires a
magic character to be available.

···

On Mar 11, 2004, at 3:40 PM, Warren Brown wrote:

Now to your original intent, it sounds like what you want is to split
on any semicolon that is not preceded by a backslash. This is a
little more difficult, since there is no “look-behind” construct in
regular expressions.


(-, /\ / / //