FasterCSV parsing issues

I'm using FasterCSV to do an import into my DB, and the CSV file
contains European words. I have French, Italian, and German words which
contain accents and such. When I try the import it throws a
FasterCSV::MalformedCSV error, but if I remove just the letters with
accents on them, it will upload just fine.

Here is a sample row:

Universal,ID,Kir,"Commonly, white wine with Cassis. Traditionally, the
cocktail kir (also known as vin blanc cassis in French) is made with
Aligoté. Kir Royal is made with Champagne instead of Aligoté."

Notice the 2 "e" with accents on them. I can remove these and it's fine.
I'm assuming this is an encoding issue. The CSV file could be created by
any number of people in any number of different locations using any
number of programs. Do I need to do something like use Iconv to convert
to a standard encoding first, then upload?

Thanks

~Jeremy

···

--
Posted via http://www.ruby-forum.com/.

I've had similar issues recently, and they are due to character
encodings. Something like Iconv will probably be necessary to convert
the files to a standard encoding

···

On Wed, Dec 1, 2010 at 11:52 AM, Jeremy Woertink <jeremywoertink@gmail.com> wrote:

I'm using FasterCSV to do an import into my DB, and the CSV file
contains European words. I have French, Italian, and German words which
contain accents and such. When I try the import it throws a
FasterCSV::MalformedCSV error, but if I remove just the letters with
accents on them, it will upload just fine.

Here is a sample row:

Universal,ID,Kir,"Commonly, white wine with Cassis. Traditionally, the
cocktail kir (also known as vin blanc cassis in French) is made with
Aligoté. Kir Royal is made with Champagne instead of Aligoté."

Notice the 2 "e" with accents on them. I can remove these and it's fine.
I'm assuming this is an encoding issue. The CSV file could be created by
any number of people in any number of different locations using any
number of programs. Do I need to do something like use Iconv to convert
to a standard encoding first, then upload?

Thanks

~Jeremy

--
Posted via http://www.ruby-forum.com/\.

Yes, that's exactly the strategy you need to adopt.

James Edward Gray II

···

On Dec 1, 2010, at 10:52 AM, Jeremy Woertink wrote:

I'm using FasterCSV to do an import into my DB, and the CSV file
contains European words. I have French, Italian, and German words which
contain accents and such. When I try the import it throws a
FasterCSV::MalformedCSV error, but if I remove just the letters with
accents on them, it will upload just fine.

The CSV file could be created by
any number of people in any number of different locations using any
number of programs. Do I need to do something like use Iconv to convert
to a standard encoding first, then upload?

Thanks for the info, James.

I've upgraded to Ruby 1.9.2 now, but I'm still running into weird
issues. How come I can only parse a file once?

ruby-1.9.2-p0 > file = File.open(File.join(Rails.root, 'public',
'sample.csv'))
=> #<File:/Users/jeremywoertink/Sites/winovations/public/sample.csv>
ruby-1.9.2-p0 > csv = CSV.new(file)
=> <#CSV io_type:File
io_path:"/Users/jeremywoertink/Sites/winovations/public/sample.csv"
encoding:ISO-8859-1 lineno:0 col_sep:"," row_sep:"\n" quote_char:"\"">
ruby-1.9.2-p0 > csv.each { |row| puts row[1] }
...
...
ruby-1.9.2-p0 > csv.each { |row| puts row[1] }
=> nil

Thanks,
~Jeremy

···

--
Posted via http://www.ruby-forum.com/.

ok, actually... I think I get that last one. It's saying there's 1409
rows, not technically line numbers because there seems to be some
breaks.

duh.. Ok, now if I can just figure out this "Unclosed quoted field"
error and how to avoid it, I'll be good!

Thanks!

···

--
Posted via http://www.ruby-forum.com/.

Nathaniel Smith wrote in post #965441:

I've had similar issues recently, and they are due to character
encodings. Something like Iconv will probably be necessary to convert
the files to a standard encoding

On Wed, Dec 1, 2010 at 11:52 AM, Jeremy Woertink

I've never actually used Iconv before, but I was just reading

and I did a test. I converted from ISO8859-1 to UTF8, and that actually
changes the characters, so it changes the meaning of the words. Now,
this is assuming that the CSV files I'm getting are all ISO8859-1
encoded (which I think they are).

I tried a test to just tell FasterCSV to read it as 'ISO8859-1'using the
first 3 lines of this CSV file:

Universal,ID,Kir,"Commonly, white wine with Cassis. Traditionally, the
cocktail kir (also known as vin blanc cassis in French) is made with
AligotÈ. Kir Royal is made with Champagne instead of AligotÈ."
Universal,GRAPE,MourvËdre / Monastrell / Mataro,"Grape: MourvËdre,
MatarÛ, or Monastrell is variety of grape used to make both strong, dark
red wines and rosÈs. It is grown in many regions around the world.
Universal,Tasting,Leafy,Specific aroma/taste descriptor: Having the
smell or taste sensation of Leaves.

ruby-1.8.7-p302 > file = File.open(File.join(Rails.root, 'public',
'sample.csv'))
=> #<File:/Users/jeremywoertink/Sites/winovations/public/sample.csv>
ruby-1.8.7-p302 > csv = FasterCSV.new(file, :encoding => 'ISO8859-1')
=> <#FasterCSV io_type:File
io_path:"/Users/jeremywoertink/Sites/winovations/public/sample.csv"
lineno:0 col_sep:"," row_sep:"\n" quote_char:"\"" encoding:"ISO8859-1">
ruby-1.8.7-p302 > csv.each { |row| puts row }
Universal
ID
Kir
Commonly, white wine with Cassis. Traditionally, the cocktail kir (also
known as vin blanc cassis in French) is made with AligotÈ. Kir Royal is
made with Champagne instead of AligotÈ.
FasterCSV::MalformedCSVError: Unclosed quoted field on line 2.
  from
/Users/jeremywoertink/.rvm/gems/ruby-1.8.7-p302/gems/fastercsv-1.5.3/lib/faster_csv.rb:1663:in
`shift'
  from
/Users/jeremywoertink/.rvm/gems/ruby-1.8.7-p302/gems/fastercsv-1.5.3/lib/faster_csv.rb:1581:in
`loop'
  from
/Users/jeremywoertink/.rvm/gems/ruby-1.8.7-p302/gems/fastercsv-1.5.3/lib/faster_csv.rb:1581:in
`shift'
  from
/Users/jeremywoertink/.rvm/gems/ruby-1.8.7-p302/gems/fastercsv-1.5.3/lib/faster_csv.rb:1526:in
`each'
  from (irb):28

I'm not seeing any unclosed quotes... Also, I thought that when you
iterate through the returned csv file, it gives you rows, but this one
seems to be giving my columns on the first row, then dies when it hits
the second row.

···

--
Posted via http://www.ruby-forum.com/\.

For the same reason you could only read from an IO object once: it's tracking your position. You're not at the end. However, you could "rewind" it:

csv = CSV.open(File.join(Rails.root, 'public', 'sample.csv'))
csv.each { |row| … }
csv.rewind
csv.each { |row| … }

Hope that helps.

James Edward Gray II

···

On Dec 2, 2010, at 11:48 AM, Jeremy Woertink wrote:

I've upgraded to Ruby 1.9.2 now, but I'm still running into weird
issues. How come I can only parse a file once?

That most likely extends from some invalid CSV data.

James Edward Gray II

···

On Dec 2, 2010, at 12:15 PM, Jeremy Woertink wrote:

Ok, now if I can just figure out this "Unclosed quoted field"
error and how to avoid it, I'll be good!

scratch that... I found the missing quote (-_-) my bad.

···

--
Posted via http://www.ruby-forum.com/.

I've never actually used Iconv before, but I was just reading
Gray Soft / Not Found
and I did a test. I converted from ISO8859-1 to UTF8, and that actually
changes the characters, so it changes the meaning of the words. Now,
this is assuming that the CSV files I'm getting are all ISO8859-1
encoded (which I think they are).

You probably want to hit the files with some encoding guessing script to be sure.

I tried a test to just tell FasterCSV to read it as 'ISO8859-1'using the
first 3 lines of this CSV file:

ruby-1.8.7-p302 >

On Ruby 1.8.7, FasterCSV supports only four encodings (the same four Ruby does) and Latin-1 (ISO-8859-1) isn't one of them. You need to transcode the data to UTF-8 on the way in or use the standard CSV library in Ruby 1.9 (which can parse Latin-1 directly).

James Edward Gray II

···

On Dec 1, 2010, at 12:16 PM, Jeremy Woertink wrote:

Oh. I guess I don't spend enough time with IO stuff :stuck_out_tongue: I wasn't aware of
that. Makes sense though!

Ok, sorry to throw all these out here, but I'm trying to understand this
whole thing :stuck_out_tongue:

Ok, so In my sample.csv, I have 1481 lines (according to textmate). When
I print out the rows and line numbers in the console, it gets to line
1409 then stops and returns nil. There's no error or anything. Is there
a limitation, or would this be caused from a malformed csv file?

···

--
Posted via http://www.ruby-forum.com/.

Oh. I guess I don't spend enough time with IO stuff :stuck_out_tongue: I wasn't aware of
that. Makes sense though!

Ok, sorry to throw all these out here, but I'm trying to understand this
whole thing :stuck_out_tongue:

No worries.

Ok, so In my sample.csv, I have 1481 lines (according to textmate). When
I print out the rows and line numbers in the console, it gets to line
1409 then stops and returns nil. There's no error or anything. Is there
a limitation, or would this be caused from a malformed csv file?

It would probably be do to CSV content like:

  one,"multi-line
  two",three

TextMate would count that as two lines (it is) but it's only one row of CSV data.

James Edward Gray II

···

On Dec 2, 2010, at 12:09 PM, Jeremy Woertink wrote:

James Edward Gray II wrote in post #965490:

ruby-1.8.7-p302 >

On Ruby 1.8.7, FasterCSV supports only four encodings (the same four
Ruby does) and Latin-1 (ISO-8859-1) isn't one of them.

But binary (-Kn) is one of them, and that should be fine for ISO-8859-1,
shouldn't it?

OP, are you running on a Mac by any chance? Apple built Ruby for OSX
with a non-standard configuration so that $KCODE="UTF8" by default. Try
using:

ruby -e 'puts $KCODE'

If it says UTF8, then try running your script again with ruby -Kn

···

--
Posted via http://www.ruby-forum.com/\.

Ah, yes. Excellent point.

James Edward Gray II

···

On Dec 2, 2010, at 5:02 PM, Brian Candler wrote:

James Edward Gray II wrote in post #965490:

ruby-1.8.7-p302 >

On Ruby 1.8.7, FasterCSV supports only four encodings (the same four
Ruby does) and Latin-1 (ISO-8859-1) isn't one of them.

But binary (-Kn) is one of them, and that should be fine for ISO-8859-1,
shouldn't it?