Splitting a CSV file into 40,000 line chunks

All -

I've written a script to split a .csv file into smaller .csv files of
40,000 lines each. The intent here is to break the file down enough so
that excel does not have issues reading each chunk. My code takes a
filename from the command line and breaks it down as so:

infile -> xyz.csv

output -> xyz_part_1.csv
          xyz_part_2.csv
          etc...

My code is working but I don't find it very "rubyish". In particular, I
hate having my index and counter counters and I don't like that I had to
declare my header variable outside of the loop. Bear in mind here that I
can not do something like "rows = CSV.open(infile)" because ruby will
yell and error as the input file is too big (250 mb). Any advice on
making the code nicer is appreciated. The current code is as follows:

require 'csv'

infile = ARGV[0] if ARGV[0] != nil

counter = 1
index = 0
header = ""
writer = CSV.open(infile.gsub(/\./,"_part_"+counter.to_s+"."),'w')

CSV.open(infile, 'r') do |row|
  if(index != 0 && index%40000 == 0)
    writer.close
    counter+=1
    writer = CSV.open(infile.gsub(/\./,"_part_"+counter.to_s+"."),'w')
    writer << header
  end
  if (index == 0)
    header = row
  end
  writer << row
  index += 1
end

writer.close()

···

--
Posted via http://www.ruby-forum.com/.

File.open("test", "w+") do |f|
  20.times do |i|
    f.write "abc|def|ghi#{i}\n"
  end
end

File.open("test", "r") do |fin|
  part = 0
  while row = fin.gets

    File.open("test_part#{part}", "w+") do |f|
      5.times do
        f.write(row)
        break unless row = fin.gets
      end
    end

    part += 1
  end
end

···

On 11/29/06, Drew Olson <olsonas@gmail.com> wrote:

My code is working but I don't find it very "rubyish". In particular, I
hate having my index and counter counters and I don't like that I had to
declare my header variable outside of the loop. Bear in mind here that I
can not do something like "rows = CSV.open(infile)" because ruby will
yell and error as the input file is too big (250 mb). Any advice on
making the code nicer is appreciated. The current code is as follows:

--
Simon Strandgaard
http://opcoders.com/

Drew Olson wrote:

All -

I've written a script to split a .csv file into smaller .csv files of
40,000 lines each. The intent here is to break the file down enough so
that excel does not have issues reading each chunk. My code takes a
filename from the command line and breaks it down as so:

infile -> xyz.csv

output -> xyz_part_1.csv
          xyz_part_2.csv
          etc...

My code is working but I don't find it very "rubyish". In particular, I
hate having my index and counter counters

Consider that the problem is one of counting input lines. In a case like
this, it is not possible to avoid using a counter. It's in the nature of
the problem to be solved.

and I don't like that I had to
declare my header variable outside of the loop. Bear in mind here that I
can not do something like "rows = CSV.open(infile)" because ruby will
yell and error as the input file is too big (250 mb). Any advice on
making the code nicer is appreciated. The current code is as follows:

require 'csv'

infile = ARGV[0] if ARGV[0] != nil

counter = 1
index = 0
header = ""
writer = CSV.open(infile.gsub(/\./,"_part_"+counter.to_s+"."),'w')

CSV.open(infile, 'r') do |row|

Why are you using CSV for this? You aren't parsing the lines into fields, so
the fact that they contain CSV content has no bearing on the present task.
Your goal is to split the input file into groups of lines delimited by
linefeeds, not fields delimited by commas.

Why not simply read lines from the input file and write them to a series of
output files, until the input file is exhausted?

···

------------------------------------------

#!/usr/bin/ruby -w

max_output_lines = 1000

input_file = "test.txt"

output_base = "output"

n = 0

ifile = File.open(input_file,"r")

header = ifile.gets

until(ifile.eof?)
   ofn = output_base + sprintf("%03d",n) + ".txt"
   ofile = File.open(ofn,"w")
   ofile.write(header)
   line = 2
   until(ifile.eof? || line > max_output_lines)
      ofile.write(ifile.gets)
      line += 1
   end
   ofile.close
   n += 1
end

ifile.close

------------------------------------------

Just change the number for "max_output_lines" to suit your requirement.

--
Paul Lutus
http://www.arachnoid.com

Not ruby, but why not use "split -l 40000"?

Cheers,
/Nick

···

On 11/29/06, Drew Olson <olsonas@gmail.com> wrote:

I've written a script to split a .csv file into smaller .csv files of
40,000 lines each. The intent here is to break the file down enough so
that excel does not have issues reading each chunk. My code takes a
filename from the command line and breaks it down as so:

I'll let you decide if this is any better:

Firefly:~/Desktop$ ls
gen_csv.rb split_csv.rb
Firefly:~/Desktop$ ruby gen_csv.rb
Firefly:~/Desktop$ ls
data.csv gen_csv.rb split_csv.rb
Firefly:~/Desktop$ cat data.csv
rec_no,whatever
0,"blah, blah, blah, ..."
1,"blah, blah, blah, ..."
2,"blah, blah, blah, ..."
3,"blah, blah, blah, ..."
4,"blah, blah, blah, ..."
5,"blah, blah, blah, ..."
6,"blah, blah, blah, ..."
7,"blah, blah, blah, ..."
8,"blah, blah, blah, ..."
9,"blah, blah, blah, ..."
10,"blah, blah, blah, ..."
11,"blah, blah, blah, ..."
12,"blah, blah, blah, ..."
13,"blah, blah, blah, ..."
14,"blah, blah, blah, ..."
15,"blah, blah, blah, ..."
16,"blah, blah, blah, ..."
17,"blah, blah, blah, ..."
18,"blah, blah, blah, ..."
19,"blah, blah, blah, ..."
Firefly:~/Desktop$ ruby split_csv.rb data.csv
Firefly:~/Desktop$ ls
data.csv data_part_2.csv data_part_4.csv gen_csv.rb
data_part_1.csv data_part_3.csv data_part_5.csv split_csv.rb
Firefly:~/Desktop$ cat data_part_1.csv
rec_no,whatever
0,"blah, blah, blah, ..."
1,"blah, blah, blah, ..."
2,"blah, blah, blah, ..."
3,"blah, blah, blah, ..."
Firefly:~/Desktop$ cat data_part_3.csv
rec_no,whatever
8,"blah, blah, blah, ..."
9,"blah, blah, blah, ..."
10,"blah, blah, blah, ..."
11,"blah, blah, blah, ..."
Firefly:~/Desktop$ cat split_csv.rb
#!/usr/bin/env ruby -w

require "rubygems"
require "faster_csv"

original_file = ARGV.shift
counter = 1

FCSV.open(original_file.sub(".", "_part_#{counter}."), "w") do |out|
   FCSV.open(original_file, :headers => true).each_with_index do |row, index|
     if (index % 4).zero?
       if index.nonzero?
         counter += 1
         out.reopen(original_file.sub(".", "_part_#{counter}."), "w")
       end
       out << row.headers
     end
     out << row
   end
end

__END__

James Edward Gray II

···

On Nov 29, 2006, at 9:32 AM, Drew Olson wrote:

Any advice on making the code nicer is appreciated.

Drew Olson wrote:

All -

I've written a script to split a .csv file into smaller .csv files of
40,000 lines each. The intent here is to break the file down enough so
that excel does not have issues reading each chunk. My code takes a
filename from the command line and breaks it down as so:

infile -> xyz.csv

output -> xyz_part_1.csv
         xyz_part_2.csv
         etc...

My code is working but I don't find it very "rubyish". In particular, I
hate having my index and counter counters and I don't like that I had to
declare my header variable outside of the loop. Bear in mind here that I
can not do something like "rows = CSV.open(infile)" because ruby will
yell and error as the input file is too big (250 mb). Any advice on
making the code nicer is appreciated. The current code is as follows:

require 'csv'

infile = ARGV[0] if ARGV[0] != nil

counter = 1
index = 0
header = ""
writer = CSV.open(infile.gsub(/\./,"_part_"+counter.to_s+"."),'w')

CSV.open(infile, 'r') do |row|
if(index != 0 && index%40000 == 0)
   writer.close
   counter+=1
   writer = CSV.open(infile.gsub(/\./,"_part_"+counter.to_s+"."),'w')
   writer << header
end
if (index == 0)
   header = row
end
writer << row
index += 1
end

writer.close()

I will ignore the CSV issue, not because it isn't important, but simply because I'm not familar with the csv parser, and this example sufficiently represents the concept.

For maximum elegance, I would write the code this way. It uses the helper methods (i.e. not in the stdlib) File#write_fresh, File#to_a and Enumerable#chunks, all of which I've written at one time or another. Mentally sub in the appropriate code as desired.

File.to_a('xyz.csv').chunks(40000).each_with_index do |chunk,i|
  File.write_fresh("xyz_part_#{i+1}",chunk.join("\n"))
end

File.to_a returns an array of lines
Enumerable#chunks divides an Enumerable into groups of 40k. A 100k array would yield 2 40k chunks and a 20k chunk.
File#write_fresh creates the file if it doesn't exist, truncates any existing file, and writes the 2nd argument to the file.

This version is much prettier than the corresponding version without the helper methods, but it is also clearer. It is obvious at a glance what it does. The same can't be said for the version without helper methods.

Drew Olson wrote:

All -

I've written a script to split a .csv file into smaller .csv files of
40,000 lines each. The intent here is to break the file down enough so
that excel does not have issues reading each chunk. My code takes a
filename from the command line and breaks it down as so:

infile -> xyz.csv

output -> xyz_part_1.csv
         xyz_part_2.csv
         etc...

My code is working but I don't find it very "rubyish". In particular, I
hate having my index and counter counters and I don't like that I had to
declare my header variable outside of the loop. Bear in mind here that I
can not do something like "rows = CSV.open(infile)" because ruby will
yell and error as the input file is too big (250 mb). Any advice on
making the code nicer is appreciated. The current code is as follows:

require 'csv'

infile = ARGV[0] if ARGV[0] != nil

counter = 1
index = 0
header = ""
writer = CSV.open(infile.gsub(/\./,"_part_"+counter.to_s+"."),'w')

CSV.open(infile, 'r') do |row|
if(index != 0 && index%40000 == 0)
   writer.close
   counter+=1
   writer = CSV.open(infile.gsub(/\./,"_part_"+counter.to_s+"."),'w')
   writer << header
end
if (index == 0)
   header = row
end
writer << row
index += 1
end

writer.close()

I will ignore the CSV issue, not because it isn't important, but simply because I'm not familar with the csv parser, and this example sufficiently represents the concept.

For maximum elegance, I would write the code this way. It uses the helper methods (i.e. not in the stdlib) File#write_fresh, File#to_a and Enumerable#chunks, all of which I've written at one time or another. Mentally sub in the appropriate code as desired.

File.to_a('xyz.csv').chunks(40000).each_with_index do |chunk,i|
File.write_fresh("xyz_part_#{i+1}",chunk.join("\n"))
end

File.to_a returns an array of lines
Enumerable#chunks divides an Enumerable into groups of 40k. A 100k array would yield 2 40k chunks and a 20k chunk.
File#write_fresh creates the file if it doesn't exist, truncates any existing file, and writes the 2nd argument to the file.

This version is much prettier than the corresponding version without the helper methods, but it is also clearer. It is obvious at a glance what it does. The same can't be said for the version without helper methods.

Paul Lutus wrote:

CSV.open(infile, 'r') do |row|

Why are you using CSV for this? You aren't parsing the lines into fields, so
the fact that they contain CSV content has no bearing on the present task.
Your goal is to split the input file into groups of lines delimited by
linefeeds, not fields delimited by commas.

Why not simply read lines from the input file and write them to a series of
output files, until the input file is exhausted?

Because CSV understands csv data with embedded newlines:

[will@localhost ~]$ cat csvfile.csv
some, field,"new
line"
other,field,here
[will@localhost ~]$ cat test.rb
require 'csv'

CSV.open('csvfile.csv', 'r') do |row|
   p row
end
[will@localhost ~]$ ruby test.rb
["some", " field", "new\nline"]
["other", "field", "here"]

will.

Because the sample code is also moving the headers to the new files.

James Edward Gray II

···

On Nov 29, 2006, at 10:53 AM, Nick Sieger wrote:

On 11/29/06, Drew Olson <olsonas@gmail.com> wrote:

I've written a script to split a .csv file into smaller .csv files of
40,000 lines each. The intent here is to break the file down enough so
that excel does not have issues reading each chunk. My code takes a
filename from the command line and breaks it down as so:

Not ruby, but why not use "split -l 40000"?

Nice Simon

But the 'while row = fin.gets' is skipping a row each time around and
you don't handle the header. My "fix":
Simon Strandgaard wrote:...

File.open("test", "w+") do |f|
  20.times do |i|
    f.write "abc|def|ghi#{i}\n"
  end
end

File.open("test", "r") do |fin|
  part = 0

    header = fin.gets
    row = fin.gets
    while row

    File.open("test_part#{part}", "w+") do |f|

         f.write(header)

      5.times do
        f.write(row)
        break unless row = fin.gets
      end
    end

    part += 1
  end
end

...

Cheers
Chris

Thanks for all the responses. As noted in a post above, I am trying to
place the headers at the beginning of each split file. Should have
mentioned that earlier. Also, regarding the above responses, I am
working on a Windows system, hence not being able to use nice shell
scripts.

Anyway, I seems that my logic was on the right track, the use of ugly
counters just made my inner-rubyist upset. Thanks for all the feedback,
I will take a look and reading/writing the files using the File object
as opposed to the CSV object. In fact, I didn't even think about that.
In previous scripts I had been accessing the CSV fields individually
but, as it is unnecessary here, I will toy around with performance gains
using File object.

Speaking of performance, using the code I've described seemed to be
quite slow. This is understandable based on the fact that the input file
has 1 million+ records, however I was wondering if I would see
performance gain by using File objects as opposed to CSV objects. Also,
any other tips on increasing the performance of this code would be
helpful.

Again, thanks for all the responses. I am continually impressed by the
attitude and intelligence of this community.

···

--
Posted via http://www.ruby-forum.com/.

I usually mess up in my loops like this :frowning:
Thanks :slight_smile:

···

On 11/29/06, ChrisH <chris.hulan@gmail.com> wrote:

Nice Simon

But the 'while row = fin.gets' is skipping a row each time around and
you don't handle the header. My "fix":

--
Simon Strandgaard

James Edward Gray II wrote:

···

On Nov 29, 2006, at 10:53 AM, Nick Sieger wrote:

> Not ruby, but why not use "split -l 40000"?

Because the sample code is also moving the headers to the new files.

James Edward Gray II

Also since he's feeding Excel this is most likely on Windows rather
than *NIX

Cheers
Chris

Will Jessop wrote:

Paul Lutus wrote:

CSV.open(infile, 'r') do |row|

Why are you using CSV for this? You aren't parsing the lines into fields,
so the fact that they contain CSV content has no bearing on the present
task. Your goal is to split the input file into groups of lines delimited
by linefeeds, not fields delimited by commas.

Why not simply read lines from the input file and write them to a series
of output files, until the input file is exhausted?

Because CSV understands csv data with embedded newlines:

A plain-text CSV file uses linefeeds as record delimiters. A program that
uses "readline" or "gets" splits the records just as a sane CSV parser
would. And IMHO a CSV file should never, ever have linefeeds embedded in
fields.

[will@localhost ~]$ cat csvfile.csv
some, field,"new
line"
other,field,here

Ah, yes, another terrific "improvement" in database protocols. Sometimes I
wonder what these standards committees are smoking.

···

--
Paul Lutus
http://www.arachnoid.com

CSV is known to be pretty slow, thus FasterCSV. :wink:

James Edward Gray II

···

On Nov 29, 2006, at 12:10 PM, Drew Olson wrote:

Speaking of performance, using the code I've described seemed to be
quite slow. This is understandable based on the fact that the input file
has 1 million+ records, however I was wondering if I would see
performance gain by using File objects as opposed to CSV objects. Also,
any other tips on increasing the performance of this code would be
helpful.

Your opinion doesn't make you right on this one. The CSV RFC clearly defines handling for carriage-returns and linefeeds. They certainly are allowed in fields. Here is a link to the document, in case you want to read up:

http://www.ietf.org/rfc/rfc4180.txt

Not to use a CSV parser on this task would be shooting yourself in the foot. The result using a simple File object would be broken and, much worse, it might look OK for a while. You just can't be sure you are never going to split a CSV file that has an embedded linefeed in it (especially since that's perfectly legal), and when you do you will be responsible for destroying data. There's just no reason for that.

I know you're a don't-use-a-library guy and you know I disagree. This is the reason why. The edge cases will get you every time.

James Edward Gray II

···

On Nov 29, 2006, at 11:29 AM, Paul Lutus wrote:

Will Jessop wrote:

Paul Lutus wrote:

CSV.open(infile, 'r') do |row|

Why are you using CSV for this? You aren't parsing the lines into fields,
so the fact that they contain CSV content has no bearing on the present
task. Your goal is to split the input file into groups of lines delimited
by linefeeds, not fields delimited by commas.

Why not simply read lines from the input file and write them to a series
of output files, until the input file is exhausted?

Because CSV understands csv data with embedded newlines:

A plain-text CSV file uses linefeeds as record delimiters. A program that
uses "readline" or "gets" splits the records just as a sane CSV parser
would. And IMHO a CSV file should never, ever have linefeeds embedded in
fields.

James -

I'll give FasterCSV a try when I get home from work and out from behind
this proxy. Here's another question: in some cases I need to sort the
file before splitting it (in this case sorting by the 4th cell in each
row). However, the current file I'm trying to sort and split is around
76 MB and ruby fails when trying to store the CSV as an array. The code
and output are below. How else can I go about this?

Code:

require 'csv'

infile = ""

if ARGV[0] != nil
  infile = ARGV[0]
else
  puts "Please enter a file to split"
  exit
end

puts "loading file"

rows = CSV.read(infile)

puts "file loaded"

rows.sort!{|x,y| x[3] <=> y[3]}

counter = 1
header = ""
writer = CSV.open(infile.gsub(/\./,"_part_"+counter.to_s+"."),'w')

rows.each_index do |i|
    if(i != 0 && i%40000 == 0)
    writer.close
    counter+=1
    writer = CSV.open(infile.gsub(/\./,"_part_"+counter.to_s+"."),'w')
    writer << header
  end
  if (i == 0)
    header = row
  end
  writer << row
end

writer.close()

Output:

loading file
The exception unknown software exception (0xc00000fd) occured in the
application at location.....etc

···

--
Posted via http://www.ruby-forum.com/.

not a show stopper..

http://unxutils.sourceforge.net/

···

On Thu, 2006-11-30 at 02:25 +0900, ChrisH wrote:

Also since he's feeding Excel this is most likely on Windows rather
than *NIX

Hmm, that's a good question.

Perhaps you can collect just the key values in an Array and then use those to reorder the lines bit by bit. That's not going to be fast with any library helping you, but I don't have a better idea.

James Edward Gray II

···

On Nov 29, 2006, at 12:56 PM, Drew Olson wrote:

Here's another question: in some cases I need to sort the
file before splitting it (in this case sorting by the 4th cell in each
row). However, the current file I'm trying to sort and split is around
76 MB and ruby fails when trying to store the CSV as an array. The code
and output are below. How else can I go about this?

Reid Thompson wrote:

···

On Thu, 2006-11-30 at 02:25 +0900, ChrisH wrote:

Also since he's feeding Excel this is most likely on Windows rather
than *NIX

not a show stopper..
error

http://unxutils.sourceforge.net/

Reid -

I also have cygwin installed on my box, so if there is a simply way to
do this in unix I'd love to know. My unix is rusty (that's being
generous).

--
Posted via http://www.ruby-forum.com/\.