Grep a csv?

If i had a huge CSV file, and i wanted to pull out say all the lines
that contained the word "Blah1" and throw it in a temporary file what
would be the best approached. My thoughts were to use

require 'rubygems'
require 'ruport'
require 'ruport/util'

t=Ruport::Data:Table.load("filename.csv")
t.grep(/Blah1/)

### this sadly only returned an output of => []

any ideas?

Thanks!

···

--
Posted via http://www.ruby-forum.com/.

If i had a huge CSV file, and i wanted to pull out say all the lines
that contained the word "Blah1" and throw it in a temporary file what
would be the best approached.

<snip />

any ideas?

My first thought was FasterCSV and go from there. Any reason not to use a dedicated CSV lib?

Michael Glaesemann
grzm seespotcode net

···

On Aug 15, 2007, at 22:30 , Michael Linfield wrote:

File.readlines('filename.csv').grep(/Blah1/)

···

On 8/15/07, Michael Linfield <globyy3000@hotmail.com> wrote:

If i had a huge CSV file, and i wanted to pull out say all the lines
that contained the word "Blah1" and throw it in a temporary file what
would be the best approached. My thoughts were to use

require 'rubygems'
require 'ruport'
require 'ruport/util'

t=Ruport::Data:Table.load("filename.csv")
t.grep(/Blah1/)

### this sadly only returned an output of =>

any ideas?

Thanks!
--
Posted via http://www.ruby-forum.com/\.

--
Chris Carter
concentrationstudios.com
brynmawrcs.com

Michael Linfield wrote:

If i had a huge CSV file, and i wanted to pull out say all the lines
that contained the word "Blah1" and throw it in a temporary file what
would be the best approached. My thoughts were to use

require 'rubygems'
require 'ruport'
require 'ruport/util'

t=Ruport::Data:Table.load("filename.csv")
t.grep(/Blah1/)

### this sadly only returned an output of =>

any ideas?

Thanks!

OK ... first of all, define "huge" and what are your restrictions? Let
me assume the worst case just to get started -- more than 256 columns
and more than 65536 rows and you're on Windows. :slight_smile:

Seriously, though, if this is a *recurring* use case rather than a
one-shot "somebody gave me this *$&%^# file and wants an answer by 5 PM
tonight!" use case, I'd load it into a database (assuming your database
doesn't have a column count limitation larger than the column count in
your file, that is) and then hook up to it with DBI. But if it's a
one-shot deal and you've got a command line handy (Linux, MacOS, BSD or
Cygwin) just do "grep blah1 huge-file.csv > temp-file.csv". Bonus points
for being able to write that in Ruby and get it debugged before someone
who's been doing command-line for years types that one-liner in. :slight_smile:

<snip />

any ideas?

My first thought was FasterCSV and go from there. Any reason not to
use a dedicated CSV lib?

Michael Glaesemann
grzm seespotcode net

im planning to use ruport to graph the data, but if u can integrate the
output from using FasterCSV into a ruport graph, im all ears :slight_smile:

···

--
Posted via http://www.ruby-forum.com/\.

M. Edward (Ed) Borasky wrote

Michael Linfield wrote:
> If i had a huge CSV file, and i wanted to pull out say all the lines
> that contained the word "Blah1" and throw it in a temporary file what
> would be the best approached. My thoughts were to use

> require 'rubygems'
> require 'ruport'
> require 'ruport/util'

> t=Ruport::Data:Table.load("filename.csv")
> t.grep(/Blah1/)

> ### this sadly only returned an output of =>

> any ideas?

> Thanks!

OK ... first of all, define "huge" and what are your
restrictions? Let me assume the worst case just to get
started -- more than 256 columns and more than 65536 rows
and you're on Windows. :slight_smile:

Seriously, though, if this is a *recurring* use case rather
than a one-shot "somebody gave me this *$&%^# file and wants
an answer by 5 PM tonight!" use case, I'd load it into a
database (assuming your database doesn't have a column count
limitation larger than the column count in your file, that
is) and then hook up to it with DBI. But if it's a one-shot
deal and you've got a command line handy (Linux, MacOS, BSD
or Cygwin)

Windoze has a command-line.

           just do "grep blah1 huge-file.csv >
temp-file.csv". Bonus points for being able to write that in
Ruby and get it debugged before someone who's been doing
command-line for years types that one-liner in. :slight_smile:

Chris Carter has already done it. Have you ever posted
Ruby code here?

M. Edward (Ed) Borasky wrote:

Michael Linfield wrote:

### this sadly only returned an output of =>

any ideas?

Thanks!

OK ... first of all, define "huge" and what are your restrictions? Let
me assume the worst case just to get started -- more than 256 columns
and more than 65536 rows and you're on Windows. :slight_smile:

Seriously, though, if this is a *recurring* use case rather than a
one-shot "somebody gave me this *$&%^# file and wants an answer by 5 PM
tonight!" use case, I'd load it into a database (assuming your database
doesn't have a column count limitation larger than the column count in
your file, that is) and then hook up to it with DBI. But if it's a
one-shot deal and you've got a command line handy (Linux, MacOS, BSD or
Cygwin) just do "grep blah1 huge-file.csv > temp-file.csv". Bonus points
for being able to write that in Ruby and get it debugged before someone
who's been doing command-line for years types that one-liner in. :slight_smile:

lol, alright lets say the senario will be in the range of 20k - 70k
lines of data. no more than 20 columns
and i wanna avoid using command line to do this, because yes in fact
this will be used to process more than one datafile which i hope to
setup in optparse to have a command line arg that directs the prog to
the file. also i wanted to for the meantime not have to throw it on any
database...avoiding DBI for the meanwhile. But an idea flew through my
head a few minutes ago....what if i did this --

res =
res << File.readlines('filename.csv').grep(/Blah1/) #thanks chris

ran into a small demeaning problem. this shoves all that grep'd data
into 1 element lol... res[1] => nil ...its all shoved into res[0] id
hope to fix this with a simple do statement but a little confusion hit
me while doing that with a readline command. and by shoving this into an
array will i still be able to single out columns of data. if not then
how would i shove the grep data into a second csv file, doing this all
inside ruby of course, no command line program > output.csv :slight_smile:

···

--
Posted via http://www.ruby-forum.com/\.

Ruport uses FasterCSV for its CVS parsing.

···

On 8/15/07, Michael Linfield <globyy3000@hotmail.com> wrote:

> <snip />
>
>> any ideas?
>
> My first thought was FasterCSV and go from there. Any reason not to
> use a dedicated CSV lib?
>
> Michael Glaesemann
> grzm seespotcode net

im planning to use ruport to graph the data, but if u can integrate the
output from using FasterCSV into a ruport graph, im all ears :slight_smile:
--
Posted via http://www.ruby-forum.com/\.

--
Chris Carter
concentrationstudios.com
brynmawrcs.com

Array#<< appends the object onto your Array, you want to combine the two arrays using Array#+:

irb(main):001:0> a =
=>
irb(main):002:0> a << [1,2,3]
=> [[1, 2, 3]]
irb(main):003:0> a =
=>
irb(main):004:0> a += [1,2,3]
=> [1, 2, 3]
irb(main):005:0>

Though why don't you just use:

res = File.readlines('filename.csv').grep(/Blah1/)

Alex Gutteridge

Bioinformatics Center
Kyoto University

···

On 16 Aug 2007, at 13:08, Michael Linfield wrote:

M. Edward (Ed) Borasky wrote:

Michael Linfield wrote:

### this sadly only returned an output of =>

any ideas?

Thanks!

OK ... first of all, define "huge" and what are your restrictions? Let
me assume the worst case just to get started -- more than 256 columns
and more than 65536 rows and you're on Windows. :slight_smile:

Seriously, though, if this is a *recurring* use case rather than a
one-shot "somebody gave me this *$&%^# file and wants an answer by 5 PM
tonight!" use case, I'd load it into a database (assuming your database
doesn't have a column count limitation larger than the column count in
your file, that is) and then hook up to it with DBI. But if it's a
one-shot deal and you've got a command line handy (Linux, MacOS, BSD or
Cygwin) just do "grep blah1 huge-file.csv > temp-file.csv". Bonus points
for being able to write that in Ruby and get it debugged before someone
who's been doing command-line for years types that one-liner in. :slight_smile:

lol, alright lets say the senario will be in the range of 20k - 70k
lines of data. no more than 20 columns
and i wanna avoid using command line to do this, because yes in fact
this will be used to process more than one datafile which i hope to
setup in optparse to have a command line arg that directs the prog to
the file. also i wanted to for the meantime not have to throw it on any
database...avoiding DBI for the meanwhile. But an idea flew through my
head a few minutes ago....what if i did this --

res =
res << File.readlines('filename.csv').grep(/Blah1/) #thanks chris

# inside ruby of course, no command line program > output.csv :slight_smile:

this is just a simple example,

irb(main):018:0> puts File.readlines('filename.csv')
this, is , a , test, foo
this, is , a , test, bar
this, is , a , test, Blah1
this, is , a , test, bar
this, Blah, is , a , test
this, is , a , Blah, test
=> nil
irb(main):019:0> puts File.readlines('filename.csv').grep(/Blah/)
this, is , a , test, Blah1
this, Blah, is , a , test
this, is , a , Blah, test
=> nil
irb(main):020:0> puts Ruport::Data::Table.new(:data => File.readlines('filename.csv').grep(/Blah/).map{|r| r.split(/,/)})

···

From: Michael Linfield [mailto:globyy3000@hotmail.com]
+---------------------------------------+

this | is | a | test | Blah1 |
this | Blah | is | a | test |
this | is | a | Blah | test |

+---------------------------------------+
=> nil
irb(main):021:0>

kind regards -botp

Michael Linfield wrote:

M. Edward (Ed) Borasky wrote:

Michael Linfield wrote:

### this sadly only returned an output of =>

any ideas?

Thanks!

OK ... first of all, define "huge" and what are your restrictions? Let
me assume the worst case just to get started -- more than 256 columns
and more than 65536 rows and you're on Windows. :slight_smile:

Seriously, though, if this is a *recurring* use case rather than a
one-shot "somebody gave me this *$&%^# file and wants an answer by 5 PM
tonight!" use case, I'd load it into a database (assuming your database
doesn't have a column count limitation larger than the column count in
your file, that is) and then hook up to it with DBI. But if it's a
one-shot deal and you've got a command line handy (Linux, MacOS, BSD or
Cygwin) just do "grep blah1 huge-file.csv > temp-file.csv". Bonus points
for being able to write that in Ruby and get it debugged before someone
who's been doing command-line for years types that one-liner in. :slight_smile:

lol, alright lets say the senario will be in the range of 20k - 70k lines of data. no more than 20 columns
and i wanna avoid using command line to do this, because yes in fact this will be used to process more than one datafile which i hope to setup in optparse to have a command line arg that directs the prog to the file. also i wanted to for the meantime not have to throw it on any database...avoiding DBI for the meanwhile. But an idea flew through my head a few minutes ago....what if i did this --

res =
res << File.readlines('filename.csv').grep(/Blah1/) #thanks chris

There's a problem with using File.readlines that I don't think anyone's mentioned yet. I don't know if it's relevant to your dataset, but CSV fields are allowed to contain newlines if the field is quoted. For example, this single CSV row will break your process:

1,2,"foo
Blah1",bar

The only way around that is to actually parse the file, so unless you *know* that's not going to be a problem, I'd still suggest FasterCSV...

···

--
Alex

Alex Gutteridge wrote:

···

On 16 Aug 2007, at 13:08, Michael Linfield wrote:

Though why don't you just use:

res = File.readlines('filename.csv').grep(/Blah1/)

Alex Gutteridge

can i push that into a file to temporarily use to pull all the Blah1
data from, then at the end of the program delete Blah1.csv ?

--
Posted via http://www.ruby-forum.com/\.

I guess the following is slightly OT, since the OP is talking about
grabbing whole lines, but it's a CSV-relevant question:

Another danger with not using CSV packages (as I've learned from my
present non-use of said packages) is that quoted elements in a row can
contain commas. I'm a fairly inexperienced programmer, and not just in
Ruby, and I haven't yet figured out an elegant way to break this down.

For example:

foo,bar,"foo,bar" is a three-column row in a CSV file, but using
split(/,/) on it will, of course, return ["foo","bar","\"foo","bar
\""], an array of size four. What's an efficient, elegant way of
gathering quoted columns? I'm sure FasterCSV is capable of dealing
with such things, but it's something I'd like to actually learn,
rather than finding a way around it.

···

On Aug 16, 3:04 am, Alex Young <a...@blackkettle.org> wrote:

There's a problem with using File.readlines that I don't think anyone's
mentioned yet. I don't know if it's relevant to your dataset, but CSV
fields are allowed to contain newlines if the field is quoted. For
example, this single CSV row will break your process:

1,2,"foo
Blah1",bar

The only way around that is to actually parse the file, so unless you
*know* that's not going to be a problem, I'd still suggest FasterCSV...

I think that this can be handled easily by this approach:
to extract a record from the csv file, continue reading lines
until the number of double quotes in the record is even.
Something like

record = ""
begin
  record << gets.chomp
end until record.count( '"' ) % 2 == 0

···

On Aug 16, 2:04 am, Alex Young <a...@blackkettle.org> wrote:

Michael Linfield wrote:
> M. Edward (Ed) Borasky wrote:
>> Michael Linfield wrote:
>>> ### this sadly only returned an output of =>

>>> any ideas?

>>> Thanks!
>> OK ... first of all, define "huge" and what are your restrictions? Let
>> me assume the worst case just to get started -- more than 256 columns
>> and more than 65536 rows and you're on Windows. :slight_smile:

>> Seriously, though, if this is a *recurring* use case rather than a
>> one-shot "somebody gave me this *$&%^# file and wants an answer by 5 PM
>> tonight!" use case, I'd load it into a database (assuming your database
>> doesn't have a column count limitation larger than the column count in
>> your file, that is) and then hook up to it with DBI. But if it's a
>> one-shot deal and you've got a command line handy (Linux, MacOS, BSD or
>> Cygwin) just do "grep blah1 huge-file.csv > temp-file.csv". Bonus points
>> for being able to write that in Ruby and get it debugged before someone
>> who's been doing command-line for years types that one-liner in. :slight_smile:

> lol, alright lets say the senario will be in the range of 20k - 70k
> lines of data. no more than 20 columns
> and i wanna avoid using command line to do this, because yes in fact
> this will be used to process more than one datafile which i hope to
> setup in optparse to have a command line arg that directs the prog to
> the file. also i wanted to for the meantime not have to throw it on any
> database...avoiding DBI for the meanwhile. But an idea flew through my
> head a few minutes ago....what if i did this --

> res =
> res << File.readlines('filename.csv').grep(/Blah1/) #thanks chris

There's a problem with using File.readlines that I don't think anyone's
mentioned yet. I don't know if it's relevant to your dataset, but CSV
fields are allowed to contain newlines if the field is quoted. For
example, this single CSV row will break your process:

1,2,"foo
Blah1",bar

Sure, use tempfile, but I think botp has shown why you don't really need the temporary file (unless there's part of this problem I'm not understanding):

irb(main):001:0> puts File.readlines('filename.csv')
this, is , a , test, foo
this, is , a , test, bar
this, is , a , test, Blah1
this, is , a , test, bar
this, Blah, is , a , test
this, is , a , Blah, test
=> nil
irb(main):002:0> puts File.readlines('filename.csv').grep(/Blah1/)
this, is , a , test, Blah1
=> nil
irb(main):003:0> require 'tempfile'
=> true
irb(main):004:0> tf = Tempfile.new('csv')
=> #<File:/tmp/csv.1339.0>
irb(main):005:0> tf.puts File.readlines('filename.csv').grep(/Blah1/)
=> nil
irb(main):006:0> tf.close
=> nil
irb(main):007:0> tf.open
=> #<File:/tmp/csv.1339.0>
irb(main):008:0> puts tf.gets
this, is , a , test, Blah1
=> nil

Alex Gutteridge

Bioinformatics Center
Kyoto University

···

On 16 Aug 2007, at 13:29, Michael Linfield wrote:

Alex Gutteridge wrote:

On 16 Aug 2007, at 13:08, Michael Linfield wrote:

Though why don't you just use:

res = File.readlines('filename.csv').grep(/Blah1/)

Alex Gutteridge

can i push that into a file to temporarily use to pull all the Blah1
data from, then at the end of the program delete Blah1.csv ?

Kaldrenon wrote:

There's a problem with using File.readlines that I don't think anyone's
mentioned yet. I don't know if it's relevant to your dataset, but CSV
fields are allowed to contain newlines if the field is quoted. For
example, this single CSV row will break your process:

1,2,"foo
Blah1",bar

The only way around that is to actually parse the file, so unless you
*know* that's not going to be a problem, I'd still suggest FasterCSV...

I guess the following is slightly OT, since the OP is talking about
grabbing whole lines, but it's a CSV-relevant question:

Another danger with not using CSV packages (as I've learned from my
present non-use of said packages) is that quoted elements in a row can
contain commas. I'm a fairly inexperienced programmer, and not just in
Ruby, and I haven't yet figured out an elegant way to break this down.

For example:

foo,bar,"foo,bar" is a three-column row in a CSV file, but using
split(/,/) on it will, of course, return ["foo","bar","\"foo","bar
\""], an array of size four. What's an efficient, elegant way of
gathering quoted columns?

If you ignore that the quote character can also appear inside column data, then this will work, ishkinda.

'foo,bar,"foo,bar"'.scan(/("[^"]+")|([^,]+)/).flatten.compact
=> ["foo", "bar", "\"foo,bar\""]

That breaks at least for empty fields, fields with newlines, and fields with '"' in them.

I'm sure FasterCSV is capable of dealing
with such things, but it's something I'd like to actually learn,
rather than finding a way around it.

That's just it - the cleanest way to handle this problem is to just take advantage of someone else's work. CSV looks just trivial enough to the human eye that you can convince yourself "Oh, it can't be *that* hard"... but it's still complicated enough that you get tired of coding around the fiddly little details very, very quickly. Besides, when a library is as easy to use as either CSV *or* FasterCSV (my preference is for the latter, but it's not in stdlib), what's the point in stressing?

···

On Aug 16, 3:04 am, Alex Young <a...@blackkettle.org> wrote:

--
Alex

The "chomp" is a mistake.

record = ""
begin
  record << gets
end until record.count( '"' ) % 2 == 0

···

On Aug 16, 2:12 pm, William James <w_a_x_...@yahoo.com> wrote:

On Aug 16, 2:04 am, Alex Young <a...@blackkettle.org> wrote:

> Michael Linfield wrote:
> > M. Edward (Ed) Borasky wrote:
> >> Michael Linfield wrote:
> >>> ### this sadly only returned an output of =>

> >>> any ideas?

> >>> Thanks!
> >> OK ... first of all, define "huge" and what are your restrictions? Let
> >> me assume the worst case just to get started -- more than 256 columns
> >> and more than 65536 rows and you're on Windows. :slight_smile:

> >> Seriously, though, if this is a *recurring* use case rather than a
> >> one-shot "somebody gave me this *$&%^# file and wants an answer by 5 PM
> >> tonight!" use case, I'd load it into a database (assuming your database
> >> doesn't have a column count limitation larger than the column count in
> >> your file, that is) and then hook up to it with DBI. But if it's a
> >> one-shot deal and you've got a command line handy (Linux, MacOS, BSD or
> >> Cygwin) just do "grep blah1 huge-file.csv > temp-file.csv". Bonus points
> >> for being able to write that in Ruby and get it debugged before someone
> >> who's been doing command-line for years types that one-liner in. :slight_smile:

> > lol, alright lets say the senario will be in the range of 20k - 70k
> > lines of data. no more than 20 columns
> > and i wanna avoid using command line to do this, because yes in fact
> > this will be used to process more than one datafile which i hope to
> > setup in optparse to have a command line arg that directs the prog to
> > the file. also i wanted to for the meantime not have to throw it on any
> > database...avoiding DBI for the meanwhile. But an idea flew through my
> > head a few minutes ago....what if i did this --

> > res =
> > res << File.readlines('filename.csv').grep(/Blah1/) #thanks chris

> There's a problem with using File.readlines that I don't think anyone's
> mentioned yet. I don't know if it's relevant to your dataset, but CSV
> fields are allowed to contain newlines if the field is quoted. For
> example, this single CSV row will break your process:

> 1,2,"foo
> Blah1",bar

I think that this can be handled easily by this approach:
to extract a record from the csv file, continue reading lines
until the number of double quotes in the record is even.
Something like

record = ""
begin
  record << gets.chomp
end until record.count( '"' ) % 2 == 0

Alex Gutteridge wrote:

data from, then at the end of the program delete Blah1.csv ?

Sure, use tempfile, but I think botp has shown why you don't really
need the temporary file (unless there's part of this problem I'm not
understanding):

the reason im going to use a tempfile is because thers going to be a
per'se blah2 blah3 ect. So im storing the grep from blah1, blah2, and
blah3 in different files so i can later pull the data from those files
to graph each line with ruport. each line will represent Blah1 Blah2
Blah3 ect. So unless i can shove the grep output into an array, and each
line be an element, ive gotta use a tempfile. Maybe theres a better way?

If thers a better approached im all ears :smiley:

Thanks a ton btw, you all have been a tremendous help.

···

On 16 Aug 2007, at 13:29, Michael Linfield wrote:

--
Posted via http://www.ruby-forum.com/\.

If you ignore that the quote character can also appear inside column
data, then this will work, ishkinda.

'foo,bar,"foo,bar"'.scan(/("[^"]+")|([^,]+)/).flatten.compact
=> ["foo", "bar", "\"foo,bar\""]

That's very cool, granting that there are still a number of conditions
for which it'll go bust. But I'm a little confused about how a pattern
like [^,]+ gets an element, given that (unless I'm mistaken) in a
standard regexp, it would only match on a string that contained a
series of commas that beginning of a line, like "," or "abc\n,".
What's my mistake/confusion here?

Besides, when a
library is as easy to use as either CSV *or* FasterCSV (my preference is
for the latter, but it's not in stdlib), what's the point in stressing?

The only point is in learning - I like to be able to use packages
without the effort that went into making them, but I don't really like
jumping from DIY to "Here's a way, and don't worry about how it
works...it's magic!"

Thanks for the info,
Andrew

···

On Aug 16, 11:56 am, Alex Young <a...@blackkettle.org> wrote:

I think that this will work correctly with any complete csv record.

class String
  def csv
    if include? '"'
      ary =
        "#{chomp},".scan( /\G"([^"]*(?:""[^"]*)*)",|\G([^,"]*),/ )
      raise "Bad csv record:\n#{self}" if $' != ""
      ary.map{|a| a[1] || a[0].gsub(/""/,'"') }
    else
      ary = chomp.split( /,/, -1)
      ## "".csv ought to be [""], not , just as
      ## ",".csv is ["",""].
      if == ary
        [""]
      else
        ary
      end
    end
  end
end

···

On Aug 16, 10:56 am, Alex Young <a...@blackkettle.org> wrote:

Kaldrenon wrote:
> On Aug 16, 3:04 am, Alex Young <a...@blackkettle.org> wrote:
>> There's a problem with using File.readlines that I don't think anyone's
>> mentioned yet. I don't know if it's relevant to your dataset, but CSV
>> fields are allowed to contain newlines if the field is quoted. For
>> example, this single CSV row will break your process:

>> 1,2,"foo
>> Blah1",bar

>> The only way around that is to actually parse the file, so unless you
>> *know* that's not going to be a problem, I'd still suggest FasterCSV...

> I guess the following is slightly OT, since the OP is talking about
> grabbing whole lines, but it's a CSV-relevant question:

> Another danger with not using CSV packages (as I've learned from my
> present non-use of said packages) is that quoted elements in a row can
> contain commas. I'm a fairly inexperienced programmer, and not just in
> Ruby, and I haven't yet figured out an elegant way to break this down.

> For example:

> foo,bar,"foo,bar" is a three-column row in a CSV file, but using
> split(/,/) on it will, of course, return ["foo","bar","\"foo","bar
> \""], an array of size four. What's an efficient, elegant way of
> gathering quoted columns?

If you ignore that the quote character can also appear inside column
data, then this will work, ishkinda.

'foo,bar,"foo,bar"'.scan(/("[^"]+")|([^,]+)/).flatten.compact
=> ["foo", "bar", "\"foo,bar\""]

That breaks at least for empty fields, fields with newlines, and fields
with '"' in them.