Tim,
I’ve dumped a CSV (comma separated values) file
from Excel, and I want to parse it into cels
within my Ruby script. Easy enough, you say:
line.split(‘,’)
But it’s a little more complicated than that -
see, if the cel has a comma in it, it gets
surrounded in quotes. If a cel has a quote in
it, it’s surrounded by quotes and doubled, eg:
Test Te,st Te"st
becomes
Test,“Te,st”,“Te”“st”
Now to use String#split, I would have to write
a regexp that will match a comma, provided that
comma is preceeded by an even number of quotes.
BUT, I don’t want the regexp to match the
quotes themselves, just the comma. I can’t
figure this one out…
Your main problem is in trying to use String#split instead of
String#scan. String#split is great for easily defined delimiters, but for
this problem, you really want to be describing the fields themselves, not
the delimiters.
If we append a ',' to the end of the line, a field can be described as:
0 or more characters excluding double-quotes and commas (/[^",]*/)
followed by
0 or more strings consisting of:
a double quote (/“/)
0 or more characters excluding a double-quote (/[^”]/)
a double quote (/“/)
(/(”[^"]")/) followed by
0 or more characters excluding double-quotes and commas (/[^",]/)
followed by
a comma (/,/)
Putting this all together, we get:
/[^",]*("[^"]*")*[^",]*,/
Now for the tricky bit. We really want String#scan to return the whole
field (without the comma), but those parenthesis are going to cause problems
(String#scan returns an array with an entry for each matching
backreference). So, we use the /(?:re)/ form of parenthesis to avoid the
backreference:
/[^",]*(?:"[^"]*")*[^",]*,/
Then we add parenthesis around the field portion to have scan return
just that:
/([^",]*(?:"[^"]*")*[^",]*),/
We can now use this regular expression in String#scan to return the
individual fields:
irb(main):001:0> RUBY_VERSION
=> “1.6.8”
irb(main):002:0> line =
‘a,b,“c”,“d,d”,“,e”,“f,”,“,”,“g”“g”,“”“h”,“i”“”,“”“”,j,k,’
=>
“a,b,"c","d,d",",e","f,",",","g""g","""h","i""","""
",j,k,”
irb(main):003:0> line.scan(/([^“,](?:“[^”]”)[^",]),/)
=> [[“a”], [“b”], [“"c"”], [“"d,d"”], [“",e"”], [“"f,"”], [“","”],
[“"g""g"”], [“"""h"”], [“"i"""”], [“""""”], [“j”], [“”],
[“k”]]
Of course, we really just want an array of strings, so we add
Array#flatten to the end:
irb(main):004:0> line.scan(/([^“,](?:“[^”]”)[^",]),/).flatten
=> [“a”, “b”, “"c"”, “"d,d"”, “",e"”, “"f,"”, “","”, “"g""g"”,
“"""h"”, “"i"""”, “""""”, “j”, “”, “k”]
And, if we want to clean up all of the double-quotes, we can add
Array#collect and use the block to delete leading and trailing double-quotes
(/(^“)|(”$)/), and change pairs of double-quotes (/“”/) back into a single
double-quote:
irb(main):005:0> line.scan(/([^“,](?:“[^”]”)[^",]),/).flatten.collect
fld> fld.gsub(/(^“)|(”$)/,‘’).gsub(/“”/,‘"’) }
=> [“a”, “b”, “c”, “d,d”, “,e”, “f,”, “,”, “g"g”, “"h”, “i"”, “"”, “j”,
“”, “k”]
Of course, this may not be the best way to parse a CSV file...
Hope this helps.
- Warren Brown