Splitting a CSV file into 40,000 line chunks

Paul_Lutus · 29 November 2006 22:20

James Edward Gray II wrote:

/ ...

A plain-text CSV file uses linefeeds as record delimiters. A
program that
uses "readline" or "gets" splits the records just as a sane CSV parser
would. And IMHO a CSV file should never, ever have linefeeds
embedded in
fields.

Your opinion doesn't make you right on this one.

I didn't make a statement of fact, so the term and concept of "right" is not
appropriate. I used the term "should". And I acknowledged that I was wrong
later in the same post.

/ ...

I know you're a don't-use-a-library guy and you know I disagree.

First, that characterization ("don't-use-a-library guy") is false. Second,
this isn't about libraries versus hand-written code, it is about reliable,
robust file protocols. IMHO allowing linefeeds into CSV fields is plain
stupid, and this is not about right and wrong, it is about robustness.

In any case, the majority of the responders in this thread tried to use
simple linefeed parsing in their replies, as I did, with the same risk.

This is the reason why. The edge cases will get you every time.

The implication is that a failure to use a library will get you every time.
This violates common sense in the most fundamental way, for a reason you
should certainly be able to detect.

···

--
Paul Lutus
http://www.arachnoid.com

Thomas_Mueller · 30 November 2006 23:42

You could import it into MS Access, sort there and export again. Maybe
you could even do all the splitting up in Access directly, but I don't
know too much about Access to help with that.
And I guess using Access would make it harder to automate this whole
task, in case that's required.

That's not very "rubyish", of course

Thomas

···

2006/11/30, Drew Olson <olsonas@gmail.com>:

I'll give FasterCSV a try when I get home from work and out from behind
this proxy. Here's another question: in some cases I need to sort the
file before splitting it (in this case sorting by the 4th cell in each
row). However, the current file I'm trying to sort and split is around
76 MB and ruby fails when trying to store the CSV as an array. The code
and output are below. How else can I go about this?

M_Edward_Ed_Borasky1 · 1 December 2006 03:14

Drew Olson wrote:

James -

I'll give FasterCSV a try when I get home from work and out from behind this proxy. Here's another question: in some cases I need to sort the file before splitting it (in this case sorting by the 4th cell in each row). However, the current file I'm trying to sort and split is around 76 MB and ruby fails when trying to store the CSV as an array. The code and output are below. How else can I go about this?

On Linux (or CygWin, since you're on Windows):

$ sort -k 4 -t , <inputfile> > <outputfile>

That will probably put the header line at the bottom of the file or perhaps the middle, but that's pretty easy to fix.

$ cp <a-file-with-just-the-header-line> <outputfile>
$ grep -v <some-field-in-the-header-and-not-in-the-data> <inputfile> | sort -k 4 -t , >> <outputfile>

**However**:

If the file is truly CSV (numeric values unquoted, date and time stamps in official format for your locale, and all text fields quoted) there is actually a way you can treat it like a table in a database with ODBC. Open your little ODBC data sources gizmo and create a "File DSN". Use the "text" driver. It will let you configure it so that the "database" is the directory where your CSV file lives and your CSV file is a table in that database. Then anything that can connect to an ODBC Data Source Name will be able to query your CSV file.

If the sort is numeric, add a "-n" to the sort command above. If it's descending, add "-r" and if it's both numeric and descending add "-nr". "man sort" is your friend.

···

--
M. Edward (Ed) Borasky, FBG, AB, PTA, PGS, MS, MNLP, NST, ACMC(P)
http://borasky-research.blogspot.com/

If God had meant for carrots to be eaten cooked, He would have given rabbits fire.

Reid_Thompson1 · 29 November 2006 20:26

rthompso@jhereg:~$ cat splitandhead
#!/bin/bash

fname=$1
split -l 4000 -a 6 -d $fname $fname

for file in `ls ${fname}0*`
do
sed -e '1i\Col1 Col2 Col3 Col4\' $file > ${file}.tmp
mv ${file}.tmp $file
done

rthompso@jhereg:~$ head testsed
this is a test
this is a test
this is a test
this is a test
this is a test
this is a test
this is a test
this is a test
this is a test
this is a test
rthompso@jhereg:~$ wc testsed
40140 160560 602100 testsed
rthompso@jhereg:~$ time ./splitandhead testsed

real 0m0.499s
user 0m0.140s
sys 0m0.092s
rthompso@jhereg:~$ head -2 testsed0000*
==> testsed000000 <==
Col1 Col2 Col3 Col4
this is a test

==> testsed000001 <==
Col1 Col2 Col3 Col4
this is a test

==> testsed000002 <==
Col1 Col2 Col3 Col4
this is a test

==> testsed000003 <==
Col1 Col2 Col3 Col4
this is a test

==> testsed000004 <==
Col1 Col2 Col3 Col4
this is a test

==> testsed000005 <==
Col1 Col2 Col3 Col4
this is a test

==> testsed000006 <==
Col1 Col2 Col3 Col4
this is a test

==> testsed000007 <==
Col1 Col2 Col3 Col4
this is a test

==> testsed000008 <==
Col1 Col2 Col3 Col4
this is a test

==> testsed000009 <==
Col1 Col2 Col3 Col4
this is a test

==> testsed000010 <==
Col1 Col2 Col3 Col4
this is a test
rthompso@jhereg:~$

···

On Thu, 2006-11-30 at 04:44 +0900, Drew Olson wrote:

Reid Thompson wrote:
> On Thu, 2006-11-30 at 02:25 +0900, ChrisH wrote:
>> Also since he's feeding Excel this is most likely on Windows rather
>> than *NIX
> not a show stopper..
> error
>
> http://unxutils.sourceforge.net/

Reid -

I also have cygwin installed on my box, so if there is a simply way to
do this in unix I'd love to know. My unix is rusty (that's being
generous).

Reid_Thompson1 · 29 November 2006 20:55

sort -n +4 out
or
sort +4 out

rthompso@jhereg:~$ head -50 out
this is a test 10202
this is a test 23990
this is a test 11056
this is a test 9606
this is a test 28590
this is a test 18264
this is a test 12902
this is a test 12856
this is a test 27571
this is a test 5495
this is a test 15965
this is a test 22229
this is a test 18865
this is a test 31339
this is a test 21913
this is a test 406
this is a test 8602
this is a test 5329
this is a test 10048
this is a test 6458
this is a test 20069
this is a test 19771
this is a test 21844
this is a test 24719
this is a test 30894
this is a test 25239
this is a test 9900
this is a test 1727
this is a test 12042
this is a test 20832
this is a test 23735
this is a test 28768
this is a test 10283
this is a test 390
this is a test 10480
this is a test 1337
this is a test 2745
this is a test 26398
this is a test 32288
this is a test 3797
this is a test 22251
this is a test 458
this is a test 14679
this is a test 29642
this is a test 19943
this is a test 26342
this is a test 24232
rthompso@jhereg:~$ sort +4 out
this is a test 0
this is a test 10
this is a test 10
this is a test 100
this is a test 100
this is a test 100
this is a test 1000
this is a test 1000
this is a test 10001
this is a test 10001
this is a test 10005
this is a test 10006
this is a test 10006
this is a test 10007
this is a test 10007
this is a test 10008
this is a test 10008
this is a test 10009
this is a test 10009
this is a test 1001
this is a test 10010
this is a test 10011
this is a test 10012
this is a test 10012
this is a test 10012
this is a test 10012
this is a test 10012
this is a test 10013
this is a test 10014
this is a test 10014
this is a test 10015
this is a test 10017
this is a test 10017
this is a test 10017
this is a test 10018
this is a test 10019
this is a test 1002
this is a test 1002
this is a test 10021
this is a test 10026
this is a test 10026
-------------------- OR
rthompso@jhereg:~$ sort -n +4 out
this is a test 0
this is a test 2
this is a test 3
this is a test 3
this is a test 4
this is a test 7
this is a test 10
this is a test 10
this is a test 12
this is a test 13
this is a test 14
this is a test 14
this is a test 15
this is a test 16
this is a test 16
this is a test 20
this is a test 21
this is a test 24
this is a test 25
this is a test 28
this is a test 29
this is a test 29
this is a test 29
this is a test 30
this is a test 30
this is a test 31
this is a test 32
this is a test 32
this is a test 34
this is a test 35
this is a test 35
this is a test 35
this is a test 36
this is a test 36
this is a test 37
this is a test 37
this is a test 38
this is a test 38
this is a test 41
this is a test 41
this is a test 41
this is a test 41
this is a test 41
this is a test 41
this is a test 42
this is a test 42
this is a test 42
this is a test 43
this is a test 45
this is a test 46
this is a test 46
this is a test 47
this is a test 47

···

On Thu, 2006-11-30 at 04:04 +0900, James Edward Gray II wrote:

On Nov 29, 2006, at 12:56 PM, Drew Olson wrote:

> Here's another question: in some cases I need to sort the
> file before splitting it (in this case sorting by the 4th cell in each
> row). However, the current file I'm trying to sort and split is around
> 76 MB and ruby fails when trying to store the CSV as an array. The
> code
> and output are below. How else can I go about this?

Hmm, that's a good question.

Perhaps you can collect just the key values in an Array and then use
those to reorder the lines bit by bit. That's not going to be fast
with any library helping you, but I don't have a better idea.

James Edward Gray II

Edwin_Fine · 29 November 2006 23:25

I am perplexed by CSV.open. In IO and File, open returns something that
quacks like an IO object. You can then call gets, puts, read, write and
so on. The CSV open seems to return an array (or gives you a row at a
time). This makes it very hard to write code that can take advantage of
duck typing, e.g.

def get_batch(f, count)
  lines = nil
  count.times do
     break unless line = f.gets
     (lines ||= []) << line
  end
  lines
end

get_batch(File.open("foo"), 40000) # No problem
get_batch(CSV.open("foo", "r"), 40000) # Error

I wonder why it was not patterned more closely on IO? Any thoughts?

···

--
Posted via http://www.ruby-forum.com/.

M_Edward_Ed_Borasky1 · 30 November 2006 04:12

Paul Lutus wrote:

James Edward Gray II wrote:

[snip]

Hmmmm ... seems like a bunch of folks are going to a bunch of trouble and to baby a tool that can only handle a CSV file with less than 65536 rows. Excel is broken ... Excel sucks at math ... Excel can't handle lots of rows and lots of columns ... Excel costs money. Excel was a great wonder in 1986 -- 20 years later it's lame beyond words.

I'm not sure about the column count limitations, but I know almost any database will deal with a CSV file with more than 65536 rows. There are some awfully good free databases. Did I mention how bad Excel is at math? Sometimes you *don't* get what you pay for.

···

--
M. Edward (Ed) Borasky, FBG, AB, PTA, PGS, MS, MNLP, NST, ACMC(P)
http://borasky-research.blogspot.com/

If God had meant for carrots to be eaten cooked, He would have given rabbits fire.

Lou_Scoras · 30 November 2006 21:13

Indeed. That problem is difficult in general because you need to have
the whole set of elements in memory before you can begin sorting them.
As James pointed out, you might be able to use some sort of
memoization technique to track only the bits relevent to sorting. The
problem is you'll also need some way to get back to the original
record.

Depending on how you ending up parsing the records, you might be able
to store the file position of the start of the record and the record
length.

Records -> [sort_key, file.pos, record.length]

Then sort those arrays if you can fit them all in memory. Finally,
you can use the offsets for random access to grab the records and
stick them into the new files as you've been doing.

Basically, you're looking at a complicated swartzian transformation.
If it will work depends on how big your records are. If they are
fairly large, you might be able to pull if off; however, if they're
small and the problem is only that there are too many records, you'll
still have a problem.

In that case, you might want to just shove them in an RDBMS and let it
sort it for you.

···

On 11/29/06, James Edward Gray II <james@grayproductions.net> wrote:

On Nov 29, 2006, at 12:56 PM, Drew Olson wrote:

> Here's another question: in some cases I need to sort the
> file before splitting it (in this case sorting by the 4th cell in each
> row). However, the current file I'm trying to sort and split is around
> 76 MB and ruby fails when trying to store the CSV as an array. The
> code
> and output are below. How else can I go about this?

Hmm, that's a good question.

Perhaps you can collect just the key values in an Array and then use
those to reorder the lines bit by bit. That's not going to be fast
with any library helping you, but I don't have a better idea.

James Edward Gray II

--
Lou.

M_Edward_Ed_Borasky1 · 1 December 2006 02:57

Thomas Mueller wrote:

···

2006/11/30, Drew Olson <olsonas@gmail.com>:

I'll give FasterCSV a try when I get home from work and out from behind
this proxy. Here's another question: in some cases I need to sort the
file before splitting it (in this case sorting by the 4th cell in each
row). However, the current file I'm trying to sort and split is around
76 MB and ruby fails when trying to store the CSV as an array. The code
and output are below. How else can I go about this?

You could import it into MS Access, sort there and export again. Maybe
you could even do all the splitting up in Access directly, but I don't
know too much about Access to help with that.
And I guess using Access would make it harder to automate this whole
task, in case that's required.

That's not very "rubyish", of course

Thomas

Well ... *databases* are highly Ruby-ish! Every major database, including, I'm guessing, MS Abscess*, has a simple Ruby interface. Reading a huge file into memory is the wrong approach.

* Abscess -- An abscess is a localized collection of pus in any part of the body, caused by an infection.

--
M. Edward (Ed) Borasky, FBG, AB, PTA, PGS, MS, MNLP, NST, ACMC(P)
http://borasky-research.blogspot.com/

If God had meant for carrots to be eaten cooked, He would have given rabbits fire.

James_Cribbs · 1 December 2006 03:34

Thomas Mueller wrote:

I'll give FasterCSV a try when I get home from work and out from behind
this proxy. Here's another question: in some cases I need to sort the
file before splitting it (in this case sorting by the 4th cell in each
row). However, the current file I'm trying to sort and split is around
76 MB and ruby fails when trying to store the CSV as an array. The code
and output are below. How else can I go about this?

I'm coming to this party really late, so I hope I don't come across as shamelessly plugging KirbyBase, but, you might want to try it for this.

If you are simply trying to take a large csv file, sort it by one of its fields, and split it up into smaller files that each contain 40,000 records, I think it might work.

Here's some code (not tested, could be incorrect) off the top of my head:

require 'kirbybase'

db = KirbyBase.new

tbl = db.create_table(:foo, :field1, :String, :field2, :Integer, :field3, :String............................

tbl.import_csv(name_of_csv_file)

rec_count = tbl.total_recs
last_recno_written_out = 0

while rec_count > 0
recs = tbl.select { |r| r.recno > last_recno_written_out and r.recno < last_recno_written_out + 40000 }.sort(:field4)
........ here is where you put the code to write these 40,000 recs to a csv output file .............

last_recno_written_out = recs.last.recno

rec_count = rec_count - 40000
end

KirbyBase will even use FasterCSV for it's csv stuff if you have it installed.

Anyway, hope this helps. If I have totally misunderstood the request, feel free to ignore!

Jamey Cribbs

···

2006/11/30, Drew Olson <olsonas@gmail.com>:

James_Edward_Gray_II · 29 November 2006 23:31

I am perplexed by CSV.open.

Me too.

I wonder why it was not patterned more closely on IO? Any thoughts?

The author does not believe CSV objects should be IO like. If you search the archives for my messages about including FasterCSV in the standard library you will run into our debates on this subject.

FasterCSV's open() behaves the way you expect.

James Edward Gray II

···

On Nov 29, 2006, at 5:25 PM, Edwin Fine wrote:

Paul_Lutus · 30 November 2006 07:25

M. Edward (Ed) Borasky wrote:

Paul Lutus wrote:

James Edward Gray II wrote:

[snip]

Hmmmm ... seems like a bunch of folks are going to a bunch of trouble
and to baby a tool that can only handle a CSV file with less than 65536
rows. Excel is broken ... Excel sucks at math ... Excel can't handle
lots of rows and lots of columns ... Excel costs money. Excel was a
great wonder in 1986 -- 20 years later it's lame beyond words.

I concur in all respects, based on direct experience.

I'm not sure about the column count limitations, but I know almost any
database will deal with a CSV file with more than 65536 rows. There are
some awfully good free databases. Did I mention how bad Excel is at
math?

It certainly bears repeating. I've found a number of math errors in Excel
over the years.

Sometimes you *don't* get what you pay for.

IMHO Microsoft represents a negative correlation across the board.

http://arachnoid.com/boycott/index.html

···

--
Paul Lutus
http://www.arachnoid.com

Paul_Lutus · 30 November 2006 18:36

Edwin Fine wrote:

I am perplexed by CSV.open. In IO and File, open returns something that
quacks like an IO object. You can then call gets, puts, read, write and
so on. The CSV open seems to return an array (or gives you a row at a
time).

/ ...

I wonder why it was not patterned more closely on IO? Any thoughts?

This is an experience with which I am becoming familiar. Someone requests a
solution to a problem. Someone else offers the option of a library to solve
the problem. Then the original problem fades into the background, replaced
by discussion of the library's problems.

This same pattern has repeated itself about four times in the past
fortnight, in just this one newsgroup.

I can be relied on to suggest a terse code solution. Then someone else can
be relied on to point out, correctly, that a terse code solution may miss
edge cases and, if exposed to enough data, will surely fail. Absolutely
correct.

So, just for variety, I will _not_ say "Have you considered writing your own
code?" just because that's what people expect me to say. I won't do this
because I now realize "code" is a trigger word, just like saying "abortion"
among fundamentalists -- that is something you just don't want to do.

So I will say "Have you considered writing your own library?" It amounts to
the same thing, since libraries are written using code, and all code is
written by mortals, but this way of saying it avoids the trigger word
"code".

Your own code ... er, excuse me, your own library ... will meet your
requirements exactly, it won't cover cases that are not relevant to the
problem at hand, it will be much faster overall than existing solutions,
and you will learn things about Ruby that you would not if you used someone
else's library.

In this specific case, as has been pointed out to me, a CSV field can
contain linefeeds, which means -- if your data exploits this trait -- you
need to parse the entire database using a state machine that knows about
this possibility.

On the other hand, if your data does not exploit this CSV trait (few
real-world CSV databases embed linefeeds), you can scan the data much more
quickly using a simpler solution, but a solution that will certainly fail
if the above assumption turns out to be false. Code like this:

···

------------------------------------------

#!/usr/bin/ruby -w

max_output_lines = 40000

input_file = "test.txt"

output_base = "output"

n = 0

ifile = File.open(input_file,"r")

header = ifile.gets

until(ifile.eof?)
ofn = output_base + sprintf("%03d",n) + ".txt"
ofile = File.open(ofn,"w")
ofile.write(header)
line = 2
until(ifile.eof? || line > max_output_lines)
ofile.write(ifile.gets)
line += 1
end
ofile.close
n += 1
end

ifile.close

------------------------------------------

Note that I meet your requirement to place the original header line at the
top of each database section.

If you will call this a "library", it will pass muster with those who prefer
the word "library" to the word "code". Outside the box, it's all the same.
Inside the box, not at all.

--
Paul Lutus
http://www.arachnoid.com

James_Cribbs · 1 December 2006 14:37

Jamey Cribbs wrote:

Thomas Mueller wrote:

I'll give FasterCSV a try when I get home from work and out from behind
this proxy. Here's another question: in some cases I need to sort the
file before splitting it (in this case sorting by the 4th cell in each
row). However, the current file I'm trying to sort and split is around
76 MB and ruby fails when trying to store the CSV as an array. The code
and output are below. How else can I go about this?

I'm coming to this party really late, so I hope I don't come across as shamelessly plugging KirbyBase, but, you might want to try it for this.

If you are simply trying to take a large csv file, sort it by one of its fields, and split it up into smaller files that each contain 40,000 records, I think it might work.

Here's some code (not tested, could be incorrect) off the top of my head:

require 'kirbybase'

db = KirbyBase.new

tbl = db.create_table(:foo, :field1, :String, :field2, :Integer, :field3, :String............................

tbl.import_csv(name_of_csv_file)

rec_count = tbl.total_recs
last_recno_written_out = 0

while rec_count > 0
recs = tbl.select { |r| r.recno > last_recno_written_out and r.recno < last_recno_written_out + 40000 }.sort(:field4)

........ here is where you put the code to write these 40,000 recs to a csv output file .............

last_recno_written_out = recs.last.recno

rec_count = rec_count - 40000
end

I realized this morning that the solution I posted last night won't work because you need the whole dataset sorted *before* you start splitting it up into 40,000 record files. Oops!

Anyway, in an attempt to recover gracefully from my mistake and also to give me the opportunity to shamelessly plug another one of my libraries, I present the following proposed solution that is totally untested and probably full of holes:

require 'mongoose'
db = Mongoose::Database.new

  db.create_table(:foo) do |tbl|
    tbl.add_column(:field1, :string)
    tbl.add_column(:field2, :string)
    tbl.add_column(:field3, :integer)
    tbl.add_indexed_column(:field4, :string)
    .
  end

Foo.import(csv_filename)

total_recs_written = 0

  while total_recs_written < Foo.last_id_used
    recs = Foo.find(:order => :field4, :offset => total_recs_written, :limit => 40000)
     ........ here is where you put the code to write these 40,000 recs to a csv output file .............

total_recs_written += recs.size
end

Jamey

Confidentiality Notice: This email message, including any attachments, is for the sole use of the intended recipient(s) and may contain confidential and/or privileged information. If you are not the intended recipient(s), you are hereby notified that any dissemination, unauthorized review, use, disclosure or distribution of this email and any materials contained in any attachments is prohibited. If you receive this message in error, or are not the intended recipient(s), please immediately notify the sender by email and destroy all copies of the original message, including attachments.

···

2006/11/30, Drew Olson <olsonas@gmail.com>:

Mike_Harris · 1 December 2006 23:15

Louis J Scoras wrote:

> Here's another question: in some cases I need to sort the
> file before splitting it (in this case sorting by the 4th cell in each
> row). However, the current file I'm trying to sort and split is around
> 76 MB and ruby fails when trying to store the CSV as an array. The
> code
> and output are below. How else can I go about this?

Hmm, that's a good question.

Perhaps you can collect just the key values in an Array and then use
those to reorder the lines bit by bit. That's not going to be fast
with any library helping you, but I don't have a better idea.

James Edward Gray II

Indeed. That problem is difficult in general because you need to have
the whole set of elements in memory before you can begin sorting them.
As James pointed out, you might be able to use some sort of
memoization technique to track only the bits relevent to sorting. The
problem is you'll also need some way to get back to the original
record.

Depending on how you ending up parsing the records, you might be able
to store the file position of the start of the record and the record
length.

Records -> [sort_key, file.pos, record.length]

Then sort those arrays if you can fit them all in memory. Finally,
you can use the offsets for random access to grab the records and
stick them into the new files as you've been doing.

Basically, you're looking at a complicated swartzian transformation.
If it will work depends on how big your records are. If they are
fairly large, you might be able to pull if off; however, if they're
small and the problem is only that there are too many records, you'll
still have a problem.

In that case, you might want to just shove them in an RDBMS and let it
sort it for you.

Let's say you want to sort by the foo column

Read in all the foo values and sort them
Get every 40,000th value from the list.
Now, upon reading any row, you can determine what page it should go on.
Read the file, get the rows for the first N pages, ignoring the rest of the rows, where N is a number that won't run you out of memory. Create the files for those rows
Remove references to the rows you read in.
Repeat with the next N pages until finished.

···

On 11/29/06, James Edward Gray II <james@grayproductions.net> wrote:

On Nov 29, 2006, at 12:56 PM, Drew Olson wrote:

James_Edward_Gray_II · 30 November 2006 19:49

Your own code ... er, excuse me, your own library ... will meet your
requirements exactly, it won't cover cases that are not relevant to the
problem at hand, it will be much faster overall than existing solutions,
and you will learn things about Ruby that you would not if you used someone
else's library.

Now you're guilty of a new sin: encouraging people to reinvent the wheel. You just can't win, can you?

Different problems require different levels of paranoia. Sometimes a little code will get you over the hump, but you may be making some trade-offs when you don't use a robust library. Sometimes those are even good trade-offs, like sacrificing edge case handling to gain some speed. Sometimes it's even part of the goal to avoid the library, like when I built FasterCSV to address some needs CSV wasn't meeting. As soon as things start getting serious though, *I* usually feel safer reaching for the library.

The people reading this list have seen us debate the issue now and be able to make well informed decisions about what they think is best.

On the other hand, if your data does not exploit this CSV trait (few
real-world CSV databases embed linefeeds)...

Really? How do they handle data with newlines in it?

Which "CSV databases" are you referring to here?

James Edward Gray II

···

On Nov 30, 2006, at 12:36 PM, Paul Lutus wrote:

Edwin_Fine · 1 December 2006 08:10

Paul Lutus wrote:

Edwin Fine wrote:

I am perplexed by CSV.open. In IO and File, open returns something that
quacks like an IO object. You can then call gets, puts, read, write and
so on. The CSV open seems to return an array (or gives you a row at a
time).

/ ...

I wonder why it was not patterned more closely on IO? Any thoughts?

This is an experience with which I am becoming familiar. Someone
requests a
solution to a problem. Someone else offers the option of a library to
solve
the problem. Then the original problem fades into the background,
replaced
by discussion of the library's problems.

This same pattern has repeated itself about four times in the past
fortnight, in just this one newsgroup.
/snip/

Ummm, I am not sure exactly how to interpret the above post, but I see
my name quoted there, so I feel compelled to clarify what I was thinking
in making my original post. I had just written a small Ruby program that
would satisfy the OP's stated problem, but using IO/File. While I was
doing this, more posts appeared, which alerted me to the possibility
that I would have to cater for newlines in the input., "Oh well", I
thought, "I'll just replace every use of "IO" with "CSV", and that will
be that. BZZZ! Wrong! Thank you for playing. I couldn't drop in CSV
instead of IO? WTF???

This is where my perplexity came in. Matz himself has regularly and
clearly stated that he designed Ruby along the Principle Of Least
Surprise (or LOLA, Law of Least Astonishment). Well, I was grievously
surprised and astonished when CSV#open behaved differently to every open
I have used in any language. All the other opens that I know of return
the concept of a handle/object, or some *thing* that can then be
beseeched to bring forth the contents of the actual I/O "device", one
element at a time, or all at once. The CSV#open skips this step and goes
straight from open to bringing forth entities, and thereby breaks
possible compatibility with IO/File. IMHO, this is inelegant design.

I have written many, many libraries (not in Ruby) and know how important
it is to present to your users a consistent, clean, orthogonal,
robust,reliable set of interfaces and implementations. That's why it is
inadvisable to release a very early version of a library to a large
audience of end users until it has proven itself in battle, as it were.
Otherwise, you face the prospect of having to change the interface to be
less surprising (*and* keep a backward-compatible, deprecated one) and
re-releasing it to possibly hundreds of users.

The bottom line is, although I am perfectly capable of doing so, I don't
WANT to reinvent the wheel. I really like reusing good, dependable,
predictable code. I haven't had time to look at FasterCSV yet, but I
predict that I will like what I see, because to my mind, from the works
of his I have seen, the author does his best to embody the "Tao" of Ruby
(LOLA). (Although you can never accurately describe the Tao of anything,
I know...)

Well, that's my 2c worth

···

--
Posted via http://www.ruby-forum.com/\.

Drew_Olson · 2 December 2006 01:06

Thanks everyone for ALL the replys. Lots of interesting things to think
about. I'll take a look at using a database approach for this and I'm
looking at FasterCSV now. Also, some very good insight related to
building code from scratch and using libraries.

Another great example of the ruby community at work IMO.

···

--
Posted via http://www.ruby-forum.com/.

Paul_Lutus · 30 November 2006 20:45

James Edward Gray II wrote:

Your own code ... er, excuse me, your own library ... will meet your
requirements exactly, it won't cover cases that are not relevant to
the
problem at hand, it will be much faster overall than existing
solutions,
and you will learn things about Ruby that you would not if you used
someone
else's library.

Now you're guilty of a new sin: encouraging people to reinvent the
wheel. You just can't win, can you?

If the OP has a problem not easily solved with a library, then he isn't
reinventing the wheel. And I don't care about winning.

Different problems require different levels of paranoia.

Yes, absolutely. The larger the job and the larger the data set, the more
likely one will encounter border conditions, and the more appropriate to
use a state machine that understands the full specification. All at the
cost of speed.

Sometimes a
little code will get you over the hump, but you may be making some
trade-offs when you don't use a robust library.

Yes, but a robust library is not appropriate if it cannot solve the problem,
or if the learning curve is so steep that it would be easier to write one's
own scanner.

Also, there is a hidden assumption in your position -- that libraries, ipso
facto, represent robust methods.

Sometimes those are
even good trade-offs, like sacrificing edge case handling to gain
some speed. Sometimes it's even part of the goal to avoid the
library, like when I built FasterCSV to address some needs CSV wasn't
meeting.

That borders on the heretical.

As soon as things start getting serious though, *I* usually
feel safer reaching for the library.

I've noticed that. I want to emphasize once again that my style is a
personal preference, not an appeal to authority or untestable precepts.

The people reading this list have seen us debate the issue now and be
able to make well informed decisions about what they think is best.

I think 90% of the readers of this newsgroup won't pay any attention to
either of our opinions on this topic. They will realize that inside every
library is code written by a mortal human, therefore this sort of debate is
primarily tilting at windmills or describing angel occupancy requirements
for heads of pins.

For the newbies, however, it might matter. They might think library contents
differ from ordinary code. And that is true only if the writers of
libraries differ from ordinary coders. Ultimately, they don't, as Microsoft
keeps finding out.

On the other hand, if your data does not exploit this CSV trait (few
real-world CSV databases embed linefeeds)...

Really? How do they handle data with newlines in it?

Linefeeds are escaped as though in a normal quoted string. This is how I
have always dealt with embedded linefeeds, which is why I was ignorant of
the specification's language on this (an explanation, not an excuse).

Which "CSV databases" are you referring to here?

MySQL, the database I am most familiar with, uses this method for import or
export of comma- or tab-separated plain-text data. Within MySQL's own
database protocol, linefeeds really are linefeeds, but an imported or
exported plain-text table has them escaped within fields.

I create a lot of plain-text databases, and I am constantly presenting them
to MySQL for parsing (or getting plain-text back from MySQL), and this only
confirmed my mistaken impression that linefeeds are always escaped in
fields of this class of database.

It's obvious why the specification reads as it does, and I should have known
about this long ago. It reads as it does because it just isn't that
difficult to parse a quoted field, and it is no big deal to allow
absolutely anything in the field. It just takes longer if all the database
handling (not just record parsing) must use the same state machine that
field parsing must use.

It's very simple, really. Once you allow the record separator inside a
field, you give up any chance to parse records quickly.

When a group of people sit down to create a specification, the highest
priority is ... utility, common sense? ... no, it's immunity to criticism.
The easies way to avoid criticism is to allow absolutely anything, even if
this hurts performance in real-world embodiments that obey the
specification.

Someone might say, "Okay, but can you drop an entire, internally consistent
CSV database into the boundaries of a single field of another CSV database,
without any harm or lost data?" Using the present specification, the
committee can say "yes, absolutely."

But parsing will necessarily be slow, character by character, the entire
database scan must use an intelligent parser (no splitting records on
linefeeds as I have been doing), and the state machine needs a few extra
states.

I cannot tell you how many times I have foolishly said, "surely the
specification doesn't allow that!", and I cannot remember actually ever
being right after taking such a position. When I make assumptions about
committees, I am always wrong.

</RANT></OT>

···

On Nov 30, 2006, at 12:36 PM, Paul Lutus wrote:

--
Paul Lutus
http://www.arachnoid.com

Paul_Lutus · 1 December 2006 09:30

Edwin Fine wrote:

/ ...

The bottom line is, although I am perfectly capable of doing so, I don't
WANT to reinvent the wheel. I really like reusing good, dependable,
predictable code. I haven't had time to look at FasterCSV yet, but I
predict that I will like what I see, because to my mind, from the works
of his I have seen, the author does his best to embody the "Tao" of Ruby
(LOLA). (Although you can never accurately describe the Tao of anything,
I know...)

Nice, informative post. There are a lot of issues here, primarily the fact
that the database under discussion is too big to hold in memory, and it is
also too big to fit into Excel in one chunk, which appears to be its
destination.

Most people have begin to drift toward suggesting a database approach,
rather than anything that involves direct manipulation of the database in
Ruby. Because of the size of the database and because sorting the records
is one goal, I have to agree.

···

--
Paul Lutus
http://www.arachnoid.com

Topic		Replies	Views
Why csv file processing is so slow? ruby-talk	25	1382	4 February 2005
Grep a csv? ruby-talk	26	161	17 August 2007
FasterCSV RCR? ruby-talk	27	148	6 June 2006
Faster CSV parsing ruby-talk	10	81	30 October 2005
Text processing library ruby-talk	9	66	18 April 2007

Splitting a CSV file into 40,000 line chunks

Related topics