Regexp help: Parsing a CSV file

I’ve dumped a CSV (comma separated values) file from Excel, and I want to
parse it into cels within my Ruby script. Easy enough, you say:
line.split(’,’)
But it’s a little more complicated than that - see, if the cel has a comma in
it, it gets surrounded in quotes. If a cel has a quote in it, it’s surrounded
by quotes and doubled, eg:
Test Te,st Te"st
becomes
Test,“Te,st”,“Te”"st"
Now to use String#split, I would have to write a regexp that will match a
comma, provided that comma is preceeded by an even number of quotes. BUT, I
don’t want the regexp to match the quotes themselves, just the comma. I can’t
figure this one out…

Tim Bates

···


tim@bates.id.au

line.split(’,’).collect do |e|
if e[0] == ‘"’ && e[e.length - 1] == ‘"‘
e[1…e.length].gsub(’""’, ‘"’)
else
e
end
end

  • Christian
···

On Fri, 21 Feb 2003, Tim Bates wrote:

I’ve dumped a CSV (comma separated values) file from Excel, and I want to
parse it into cels within my Ruby script. Easy enough, you say:
line.split(’,’)
But it’s a little more complicated than that - see, if the cel has a comma in
it, it gets surrounded in quotes. If a cel has a quote in it, it’s surrounded
by quotes and doubled, eg:
Test Te,st Te"st
becomes
Test,“Te,st”,“Te”"st"
Now to use String#split, I would have to write a regexp that will match a
comma, provided that comma is preceeded by an even number of quotes. BUT, I
don’t want the regexp to match the quotes themselves, just the comma. I can’t
figure this one out…

What would happen if you split on “,” still?

irb(main):001:0> s = “Test,“Te,st”,“Te”“st””
“Test,“Te,st”,“Te”“st”“
irb(main):002:0> a = s.split(”,”)
[“Test”, ““Te”, “st””, ““Te”“st””]

Now, can we walk all the elements, and if it starts with a " but doesn’t
end with a ", then we know the split broke on a comma within a field,
and thus sew it back together with subsequent array elements until we
find the closing element (that ends in ")?

b = []
y = ""
a.each { |x|
    if y.length == 0 && !(/^"/.match(x) || /"$/.match(x))
        b << x
        next
    end

    z = /^"(.*)"$/.match(y)
    if z
        b << z[1]
        y = ""
    end

    if y.length > 0
        y += ","
    end
    y += x
}

z = /^"(.*)"$/.match(y)
if z
    b << z[1]
else
    b << y
end

irb(main):391:0> p b
[“Test”, “Te,st”, “Te”“st”]

Clearly,t here’s obvious duplication to refactor out … but the general
idea is there.

– Dossy

···

On 2003.02.21, Tim Bates tim@bates.id.au wrote:

Test Te,st Te"st
becomes
Test,“Te,st”,“Te”“st”


Dossy Shiobara mail: dossy@panoptic.com
Panoptic Computer Network web: http://www.panoptic.com/
“He realized the fastest way to change is to laugh at your own
folly – then you can let go and quickly move on.” (p. 70)

irb(main):087:0* line=‘first, “se cond”, third , four th’
“first, “se cond”, third , four th"
irb(main):088:0> rx = /”[^"]"|’[^’]’|[^,]+/
/"[^"]"|’[^’]’|[^,]+/
irb(main):089:0> line.scan rx do |match| p match.strip end
"first"
"“se cond”"
“third”
“four th”
"first, “se cond”, third , four th"
irb(main):090:0>

robert

“Tim Bates” tim@bates.id.au schrieb im Newsbeitrag
news:200302212243.48991.tim@bates.id.au

I’ve dumped a CSV (comma separated values) file from Excel, and I want to
parse it into cels within my Ruby script. Easy enough, you say:
line.split(’,’)
But it’s a little more complicated than that - see, if the cel has a
comma in
it, it gets surrounded in quotes. If a cel has a quote in it, it’s
surrounded
by quotes and doubled, eg:
Test Te,st Te"st
becomes
Test,“Te,st”,“Te”"st"
Now to use String#split, I would have to write a regexp that will match a
comma, provided that comma is preceeded by an even number of quotes. BUT,
I
don’t want the regexp to match the quotes themselves, just the comma. I
can’t

···

figure this one out…

Tim Bates

tim@bates.id.au

“Tim Bates” tim@bates.id.au wrote in message
news:200302212243.48991.tim@bates.id.au

I’ve dumped a CSV (comma separated values) file from Excel, and I want to
parse it into cels within my Ruby script. Easy enough, you say:
line.split(’,’)
But it’s a little more complicated than that - see, if the cel has a comma
in
it, it gets surrounded in quotes. If a cel has a quote in it, it’s
surrounded
by quotes and doubled, eg:
Test Te,st Te"st
becomes
Test,“Te,st”,“Te”"st"
Now to use String#split, I would have to write a regexp that will match a
comma, provided that comma is preceeded by an even number of quotes. BUT,
I
don’t want the regexp to match the quotes themselves, just the comma. I
can’t
figure this one out…

The following regular expression is used to break up properties separated by
semicolon in a database connection string. It doesn handle xml style single
and double quotes which allows embedded semicolon and embedded quotes of
opposite type.

I’m sure you can modify it to your purpose. Note the use of "?" which
prevents an expression from eating the remaining input.
The syntax parsed is: ‘=’ (’;’ ‘=’ )

where value can optionally use single or double quote delimiters.

The expression is actually used html client side Javascript, but I reckon
that won’t make a big impact.

re = /[ \t\r\n](.?)[ \t\r\n]=[ \t\r\n](’[^’]’|"[^"]"|[^;]?)[
\t\r\n]
(;|$)/g

Mikkel

i would not use regexps to parse this… they can be very slow,not to mention
difficult. how about something like :

#!/usr/bin/env ruby

input data and expected output

line = %q(Test,“Te,st”,“Te”“st”,""“Test”,“Test”"",",Test",“Test,”)
expected = %w(Test Te,st Te"st “Test Test” ,Test Test,)

states

INITIAL = 0
QUOTED = 1

stacks

cells = [ %q() ]
states = [ INITIAL ]

handy byte values

q = %q(’)[0]
qq = %q(")[0]
comma = %q(,)[0]

process string

top = nil
state = nil
idx = 0

while (b = line[idx])

idx += 1

top = cells.last
state = states.last

if state == QUOTED
if b == qq
lookahead = line[idx]
if lookahead == qq
top << (b)
idx += 1
next
else
states.pop and state = states.last and next
end
end

top << (b) and next

end

else state is NOT QUOTED!

states.push (QUOTED) and next if (b == qq)

cells.push (%q()) and next if (b == comma)

top << (b) and next
end

raise ‘Parse error!’ unless
cells == expected and
state == INITIAL

p cells

>> [“Test”, “Te,st”, “Te"st”, ““Test”, “Test””, “,Test”, “Test,”]

-a

···

On Fri, 21 Feb 2003, Tim Bates wrote:

I’ve dumped a CSV (comma separated values) file from Excel, and I want to
parse it into cels within my Ruby script. Easy enough, you say:
line.split(’,’)
But it’s a little more complicated than that - see, if the cel has a comma in
it, it gets surrounded in quotes. If a cel has a quote in it, it’s surrounded
by quotes and doubled, eg:
Test Te,st Te"st
becomes
Test,“Te,st”,“Te”"st"
Now to use String#split, I would have to write a regexp that will match a
comma, provided that comma is preceeded by an even number of quotes. BUT, I
don’t want the regexp to match the quotes themselves, just the comma. I can’t
figure this one out…

Ara Howard
NOAA Forecast Systems Laboratory
Information and Technology Services
Data Systems Group
R/FST 325 Broadway
Boulder, CO 80305-3328
Email: ahoward@fsl.noaa.gov
Phone: 303-497-7238
Fax: 303-497-7259
====================================

I’ve dumped a CSV (comma separated values) file from Excel, and I want to
parse it into cels within my Ruby script. Easy enough, you say:
line.split(’,’)
But it’s a little more complicated than that - see, if the cel has a comma in
[…]
http://raa.ruby-lang.org/list.rhtml?name=csv

would seem to be relevant. I’ve not used it myself, though, so
can’t comment further.

Tim Bates

tim@bates.id.au

    Hugh
···

On Fri, 21 Feb 2003, Tim Bates wrote:

require 'csv' # see RAA and http://rrr.jin.gr.jp/doc/csv/

  CSV::Reader.parse( File.open( "excel.csv", "rb" )) do | row |
    p row
  end

It works fine, although the API does seem rather over-complex for what
should be a trivial job. I'd rather just have a function which takes a line
of CSV and converts it into an array, and vice versa; perhaps even

   String#csv_to_a
   Array#to_csv

The only limitation of this is when a field contains a newline, e.g.

"this","is
an example"

is two lines of the file but one row of CSV. What does Excel do if a cell
contains a newline?

This particular case certainly breaks grep and wc -l (as I discovered when
exporting data from a Mysql database into CSV, only to find that some of the
columns contained trailing newlines). I'd prefer

"this","is\nan example"

but then that's not CSV. How about saving as XML instead? :slight_smile:

Regards,

Brian.

···

On Fri, Feb 21, 2003 at 09:13:54PM +0900, Tim Bates wrote:

I've dumped a CSV (comma separated values) file from Excel, and I want to
parse it into cels within my Ruby script. Easy enough, you say:
  line.split(',')
But it's a little more complicated than that - see, if the cel has a comma in
it, it gets surrounded in quotes. If a cel has a quote in it, it's surrounded
by quotes and doubled, eg:
  Test Te,st Te"st
becomes
  Test,"Te,st","Te""st"

Regexps aren’t good for everything. CSVs are a good example of something
a nice programmatic iterative scan is good for.

Here is my solution to the problem:
http://www.io.com/~egabriel/csv_array.rb

It’s pretty slow, but it handles most variants fairly well.

···


Gabriel Emerson

Tim,

I’ve dumped a CSV (comma separated values) file
from Excel, and I want to parse it into cels
within my Ruby script. Easy enough, you say:
line.split(’,’)
But it’s a little more complicated than that -
see, if the cel has a comma in it, it gets
surrounded in quotes. If a cel has a quote in
it, it’s surrounded by quotes and doubled, eg:
Test Te,st Te"st
becomes
Test,“Te,st”,“Te”"st"
Now to use String#split, I would have to write
a regexp that will match a comma, provided that
comma is preceeded by an even number of quotes.
BUT, I don’t want the regexp to match the
quotes themselves, just the comma. I can’t
figure this one out…

Your main problem is in trying to use String#split instead of

String#scan. String#split is great for easily defined delimiters, but for
this problem, you really want to be describing the fields themselves, not
the delimiters.

If we append a ',' to the end of the line, a field can be described as:

    0 or more characters excluding double-quotes and commas (/[^",]*/)

followed by
0 or more strings consisting of:
a double quote (/"/)
0 or more characters excluding a double-quote (/[^"]/)
a double quote (/"/)
(/("[^"]
")/) followed by
0 or more characters excluding double-quotes and commas (/[^",]
/)
followed by
a comma (/,/)

Putting this all together, we get:

/[^",]*("[^"]*")*[^",]*,/

Now for the tricky bit.  We really want String#scan to return the whole

field (without the comma), but those parenthesis are going to cause problems
(String#scan returns an array with an entry for each matching
backreference). So, we use the /(?:re)/ form of parenthesis to avoid the
backreference:

/[^",]*(?:"[^"]*")*[^",]*,/

Then we add parenthesis around the field portion to have scan return

just that:

/([^",]*(?:"[^"]*")*[^",]*),/

We can now use this regular expression in String#scan to return the

individual fields:

irb(main):001:0> RUBY_VERSION
=> “1.6.8"
irb(main):002:0> line =
'a,b,“c”,“d,d”,”,e",“f,”,",",“g”“g”,""“h”,“i”"","""",j,k,’
=>
“a,b,“c”,“d,d”,”,e",“f,”,",",“g”“g”,""“h”,“i”"","""
",j,k,“
irb(main):003:0> line.scan(/([^”,](?:"[^"]")[^",]),/)
=> [[“a”], [“b”], ["“c”"], ["“d,d”"], ["",e""], ["“f,”"], ["",""],
["“g”“g”"], ["""“h”"], ["“i”"""], [""""""], [“j”], [""],
[“k”]]

Of course, we really just want an array of strings, so we add

Array#flatten to the end:

irb(main):004:0> line.scan(/([^",](?:"[^"]")[^",]),/).flatten
=> [“a”, “b”, ““c””, ““d,d””, “”,e"", ““f,””, “”,"", ““g”“g””,
"""“h”", ““i””"", “”"""", “j”, “”, “k”]

And, if we want to clean up all of the double-quotes, we can add

Array#collect and use the block to delete leading and trailing double-quotes
(/(^")|("$)/), and change pairs of double-quotes (/""/) back into a single
double-quote:

irb(main):005:0> line.scan(/([^",](?:"[^"]")[^",]),/).flatten.collect

fld> fld.gsub(/(^")|("$)/,’’).gsub(/""/,’"’) }
=> [“a”, “b”, “c”, “d,d”, “,e”, “f,”, “,”, “g"g”, ““h”, “i””, “”", “j”,
"", “k”]

Of course, this may not be the best way to parse a CSV file...

Hope this helps.

- Warren Brown

…which, at a second thought, doesn’t quite do the job (ignores commas
enclosed in double quotes). Sorry : )

  • Christian
···

On Fri, 21 Feb 2003, Christian Rishoj wrote:

line.split(’,’).collect do |e|
if e[0] == ‘"’ && e[e.length - 1] == ‘"‘
e[1…e.length].gsub(’""’, ‘"’)
else
e
end
end

[…]

“this”,“is\nan example”

but then that’s not CSV. How about saving as XML instead? :slight_smile:

Well, definitions of CSV are somewhat flexible. Some programs happily
accept C escape sequences within the field values. Some don’t. Some will
regard a raw new line while in quotes. Some won’t. Some will allow
single quotes as well as double quotes. Some won’t. Some do “” to
indicate a " inside “…”. Some do ". Some do both.

All in all, for a “simple” file format, it’s not really that standard =)

I’m sure I saw a parser somewhere that had most of those as options, for
both import and export.

cheers,

···


Iain.
who just spent the past half hour or so converting perl’s Text::CSV to
ruby. A pointless exercise beyond getting some ruby practice in.

Just to be a little contrary … consider the following function

def re_parse(line)
result = line.scan(/([^",]|"([^"]|"")+")(,|$)/).collect do |a, rest|
a = a[1…-2] if a =~ /^".
"$/
a.gsub(/""/,’"’)
end
result[0…-1]
end

I ran the above regular expression based parser against the test input
that Ara supplied. The times for the state machine based parser verses
the regular expression based parser comes to …

State Machine: 100000 iterations => 20.949496 seconds
Regular Expression: 100000 iterations => 14.183326 seconds

So, the regular expression version isn’t always slow. For one thing,
the RE engine is written in C and the state machine was hand crafted in
Ruby. A hand crafted C parser would beat both of these.

However, the regular expression version is much more fragile … in both
time and correctness. I added some extra test cases to Ara’s example
(namely parsing double commas (,), empty strings ("") and a string that
should resolve to a single quote ("""")). In all cases Ara’s version
passed with flying colors with no changes. But the Regexp version had
to be tweaked a bit to handle the new cases. And once the Regexp
version was tweaked, its runtime increased to be comparable to the state
machine version. More tweaking was required to bring the RE version
runtime down again.

For those interested, here is a break down of the Regular expression
used …

The basic Regexp has the form … /(something)(,|$)/. This means that
it will match something followed by either the end of line or a trailing
comma. Since “something” is wrapped in parenthesis, scan will pass an
array of partial matches to the do…end block. The first submatch
(matching the “something”) will be in the “a” parameter, everything else
will be in “rest” (which we will ignore).

Now we look at something. It consists of two parts joined by “|”. The
first part … [^",]* … matches anything not containing a double quote
or a comma. This handles the non-quoted cases, including null strings.

The second part handles quoted strings. It looks like …
"(quoted_string_char)+" … and matches anything that begins with a
double quote and ends with a double quote. The characters that are
allowed in a quoted string are:
[^"] – Anything that is not a double quote.
and “” – Two double quotes in a row.

That’s it. That’s the regular expression.

A fellow programmer who sat at the desk just over my partition wall was
fond of telling me …

If you have a problem and decide to solve it with regular
expressions, then you have two problems.

Probably very wise words.

···

On Fri, 21 Feb 2003, Tim Bates wrote:

I want to parse it [CSV file] into cels within my Ruby script.

On Fri, 2003-02-21 at 09:01, ahoward wrote:

i would not use regexps to parse this… they can be very slow,not to mention
difficult. how about something like :


– Jim Weirich jweirich@one.net http://w3.one.net/~jweirich

“Beware of bugs in the above code; I have only proved it correct,
not tried it.” – Donald Knuth (in a memo to Peter van Emde Boas)

Hi, Brian,

From: “Brian Candler” B.Candler@pobox.com
Sent: Friday, February 21, 2003 11:41 PM

require ‘csv’ # see RAA and http://rrr.jin.gr.jp/doc/csv/

CSV::Reader.parse( File.open( “excel.csv”, “rb” )) do | row |
p row
end

Thanks.

It works fine, although the API does seem rather over-complex for what
should be a trivial job.

Might be. Few people will care about the difference between
Null and empty string.

I’d rather just have a function which takes a line
of CSV and converts it into an array, and vice versa; perhaps even

String#csv_to_a
Array#to_csv

class String
def csv_to_a
CSV.parse_line(self)
end
end

class Array
def to_csv
CSV.generate_line(self)
end
end

The only limitation of this is when a field contains a newline, e.g.

“this”,“is
an example”

is two lines of the file but one row of CSV. What does Excel do if a cell
contains a newline?

Quotes with “” as;

$ ruby -rthis_article -e ‘p [">\n<"].to_csv.csv_to_ary’
[">\n<"]

This particular case certainly breaks grep and wc -l (as I discovered when
exporting data from a Mysql database into CSV, only to find that some of the
columns contained trailing newlines). I’d prefer

“this”,“is\nan example”

but then that’s not CSV. How about saving as XML instead? :slight_smile:

$ ruby -rsoap/marshal -e ‘puts SOAPMarshal.dump(">\n<")’

<?xml version="1.0" encoding="utf-8" ?>

<env:Envelope xmlns:xsd=“http://www.w3.org/2001/XMLSchema” xmlns:env="http://schemas.xmlsoap.org/soap/envelope/"
xmlns:xsi=“http://www.w3.org/2001/XMLSchema-instance”>
env:Body
>
<
</env:Body>
</env:Envelope>

Regards,
// NaHi

“Iain ‘Spoon’ Truskett” spoon@dellah.org wrote in message
news:20030221145503.GA25711@ouroboros.anu.edu.au

All in all, for a “simple” file format, it’s not really that standard =)

You’ve got that right - the Excel export can profoundly trash date formats
depending on the machine locale.

Mikkel

I ran the above regular expression based parser against the test
input
that Ara supplied. The times for the state machine based parser
verses
the regular expression based parser comes to …

State Machine: 100000 iterations => 20.949496 seconds
Regular Expression: 100000 iterations => 14.183326 seconds

Aren’t regex’s state machines anyway? Seems to me that if you can
get a regex to work at all (modulo pathologically contrived Kleene
star craziness, of course), it will almost always be quicker.

···

=====

Yahoo IM: michael_s_campbell


Do you Yahoo!?
Yahoo! Tax Center - forms, calculators, tips, more

Just to be a little contrary … consider the following function

Probably very wise words.

a very interesting analysis! i am in complete agreement with you as well. i
write a lot of decoders and parsers here at work and have found that state
machines end up being more maintainable. this is not espcially suprising
considering that a regex is an entire state machine crammed into a single
line! when writing a state machine you are essentially writing your own
regex, which accepts (matches), or does not accept (does not match), a given
input.

a combination of the two, state machines which also use regexs, probably
offers the most powerfull combination of ease of creation, ease of
maintainance, and speed of execution. this is the approach some of our c++
decoder classes are taking. i think something like ruby-lex
(http://raa.ruby-lang.org/list.rhtml?name=ruby-lex) is attractive for this
reason : it combines a state machine with ruby’s builtin regex power. with a
C lexer you only get half of this (flex generated lexer) unless you find C’s
regex ‘easy’ to use.

just some thoughts for you parsers out there…

-a

···

On Sat, 22 Feb 2003, Jim Weirich wrote:

Ara Howard
NOAA Forecast Systems Laboratory
Information and Technology Services
Data Systems Group
R/FST 325 Broadway
Boulder, CO 80305-3328
Email: ahoward@fsl.noaa.gov
Phone: 303-497-7238
Fax: 303-497-7259
====================================

Hi, all,

From: “NAKAMURA, Hiroshi” nahi@keynauts.com
Sent: Tuesday, February 25, 2003 7:57 PM

I’d rather just have a function which takes a line
of CSV and converts it into an array, and vice versa; perhaps even

String#csv_to_a
Array#to_csv

class String
def csv_to_a
CSV.parse_line(self)
end
end

class Array
def to_csv
CSV.generate_line(self)
end
end

parse_line and generate_line are methods of CSV2.
CSV2 has not yet been released. Sorry.

You can get it from CVS.
http://www.ruby-lang.org/cgi-bin/cvsweb.cgi/lib/csv/
HEAD is for CSV2 now.

CSV-1 branch is for CSV(1) on RAA.
http://www.ruby-lang.org/cgi-bin/cvsweb.cgi/lib/csv/?only_with_tag=CSV-1

Regards,
// NaHi

Have you tried racc?

http://raa.ruby-lang.org/list.rhtml?name=racc

I think it compiles to a C or Ruby parser.

···

On Saturday, 22 February 2003 at 1:22:26 +0900, ahoward wrote:

On Sat, 22 Feb 2003, Jim Weirich wrote:

a combination of the two, state machines which also use regexs, probably
offers the most powerfull combination of ease of creation, ease of
maintainance, and speed of execution. this is the approach some of our c++
decoder classes are taking. i think something like ruby-lex
(http://raa.ruby-lang.org/list.rhtml?name=ruby-lex) is attractive for this
reason : it combines a state machine with ruby’s builtin regex power. with a
C lexer you only get half of this (flex generated lexer) unless you find C’s
regex ‘easy’ to use.


Jim Freeze

Next Friday will not be your lucky day. As a matter of fact, you don’t
have a lucky day this year.

They can be parsed by state machines. In fact, a true regexp can be
mechanically converted into a 'deterministic finite state automaton' - that
is, one which only has to read each input symbol once and never has to
backtrack.

However the regexp libraries I'm aware of are implemented as
nondeterministic FSA's, i.e. when presented with a choice, they take one
option and backtrack later if they reach a blind alley.

I think some things which programmers consider part of "regular expressions"
are not formally part of regexps anyway. In particular, a true regular
expression cannot parse a grammar like matching open and close brackets
(e.g. "match a number of A's followed by an equal number of B's")

Regards,

Brian.

···

On Sat, Feb 22, 2003 at 12:44:55AM +0900, Michael Campbell wrote:

Aren't regex's state machines anyway?