Parsing CSV

Hi guys, im a newbie in Ruby i have to parse two CSV files to compare
2 columns of the given files. My problem is that i tried a lot of
different methods to handle this, i tried to put the entire column in
an array and the other one two then test for the bigger array to make
a loop thought it and compare both files like that. It did not work, i
was thinking in using CSV but its limited and then i came a cross with
fasterCSV which is the module than im stuck right now, if somebody can
make a suggestion i really appreciate it.

Thanks in advance.

PS: I was told to make this tool in Java but, AFAIK Ruby is better for
handling file text.

···

--
Grimoire Guru
SourceMage GNU/Linux

Hi guys, im a newbie in Ruby i have to parse two CSV files to compare
2 columns of the given files. My problem is that i tried a lot of
different methods to handle this, i tried to put the entire column in
an array and the other one two then test for the bigger array to make
a loop thought it and compare both files like that. It did not work

Well, posting your code might allow someone to help you spot what's wrong.

I'd suggest first you check that the two arrays are being read in properly -
if they are called a1 and a2, then "puts a1.inspect" and "puts a2.inspect"
will print them to the screen. Then you know whether the problem is in
reading them, or in comparing them.

Posting a more precise description of what you're trying to do, along with
some sample data and what output you expect, would also make it easier for
someone to help you.

PS: I was told to make this tool in Java but, AFAIK Ruby is better for
handling file text.

The better language is the one which you can actually use to get the job
done :slight_smile:

How you do this in Ruby depends on what exactly you mean by 'compare', since
you didn't define exactly what you're trying to do. I'm guessing you mean
check for values which are in the first file but not in the second, or vice
versa. For a simple solution, have a look at Array#include?

For a more efficient solution, you could first sort the two arrays and then
walk down them with two pointers i and j. When a1[i] == a2[j] then you
increment both i and j. When a1[i] < a2[j] then you know an item is missing
in a2, and just increment i. When a1[i] > a2[j] then you know an item is
missing in a1, and just increment j.

Incidentally, you don't even need Ruby to do this; then shell command 'join'
can do this for you (as long as you use 'sort' to pre-sort your input)

HTH,

Brian.

···

On Mon, Feb 26, 2007 at 10:50:22PM +0900, Rafael George wrote:

This code might get you started:

require 'FasterCSV'

def read_csv(filename)
    return FasterCSV::Table.new( FasterCSV.read(filename) ).by_col
end

data1 = read_csv("data1.csv")
data2 = read_csv("data2.csv")

compare_column_idx = 1
unless data1[compare_column_idx] == data2[compare_column_idx]
    puts "column #{compare_column_idx} is different"
end

Regards,
Stephane

···

--
Posted via http://www.ruby-forum.com/.

passvalues =
i = 0
IO.foreach(fsource) do |line|
cols =
cols=CSV::parse_line line.chomp
sourceval = cols[scomp_args[0]] + " " + cols[scomp_args[1]]
  IO.foreach(tdest) do |line|
    tcols =
    tcols=CSV::parse_line line.chomp
    testval = tcols[tcomp_args[0]] + " " + tcols[tcomp_args[1]]
    if sourceval == testval
      passvalues[i] = sourceval
      i += 1
   end
  end
end

Here is what i got

···

On 2/26/07, Stephane Elie <stephane.elie@gmail.com> wrote:

This code might get you started:

require 'FasterCSV'

def read_csv(filename)
    return FasterCSV::Table.new( FasterCSV.read(filename) ).by_col
end

data1 = read_csv("data1.csv")
data2 = read_csv("data2.csv")

compare_column_idx = 1
unless data1[compare_column_idx] == data2[compare_column_idx]
    puts "column #{compare_column_idx} is different"
end

Regards,
Stephane

--
Posted via http://www.ruby-forum.com/\.

--
Grimoire Guru
SourceMage GNU/Linux

The direct translation of this code to FasterCSV is:

passvalues = Array.new
FCSV.foreach(fsource) |s_row|
   source = s_row[scomp_args[0]..scomp_args[1]].join(" ")
   FCSV.foreach(tdest) |t_row|
     if source == t_row[scomp_args[0]..scomp_args[1]].join(" ")
       passvalues << source
     end
   end
end

If you can afford to read one of the files into memory because it's not too large, you can probably speed that up quite a bit:

require "set"

allowed = Set.new
FCSV.foreach(tdest) do |row|
   allowed.add(row[scomp_args[0]..scomp_args[1]].join(" "))
end

passvalues = FCSV.open(fsource) do |source|
   source.select do |row|
     allowed.include? row[scomp_args[0]..scomp_args[1]].join(" ")
   end
end

Hope that gives you some fresh ideas.

James Edward Gray II

···

On Feb 26, 2007, at 8:45 AM, Rafael George wrote:

passvalues =
i = 0
IO.foreach(fsource) do |line|
cols =
cols=CSV::parse_line line.chomp
sourceval = cols[scomp_args[0]] + " " + cols[scomp_args[1]]
IO.foreach(tdest) do |line|
   tcols =
   tcols=CSV::parse_line line.chomp
   testval = tcols[tcomp_args[0]] + " " + tcols[tcomp_args[1]]
   if sourceval == testval
     passvalues[i] = sourceval
     i += 1
  end
end
end

The above destroys the field order. If you need to keep the order, use an Array instead:

allowed = Array.new
FCSV.foreach(dtest) do |row|
   allowed << row[scomp_args[0]..scomp_args[1]].join(" ")
end

# ...

James Edward Gray II

···

On Feb 26, 2007, at 11:48 AM, James Edward Gray II wrote:

If you can afford to read one of the files into memory because it's not too large, you can probably speed that up quite a bit:

require "set"

allowed = Set.new
FCSV.foreach(tdest) do |row|
  allowed.add(row[scomp_args[0]..scomp_args[1]].join(" "))
end

passvalues = FCSV.open(fsource) do |source|
  source.select do |row|
    allowed.include? row[scomp_args[0]..scomp_args[1]].join(" ")
  end
end

passvalues =
i = 0
IO.foreach(fsource) do |line|
cols =
cols=CSV::parse_line line.chomp
sourceval = cols[scomp_args[0]] + " " + cols[scomp_args[1]]
IO.foreach(tdest) do |line|
   tcols =
   tcols=CSV::parse_line line.chomp
   testval = tcols[tcomp_args[0]] + " " + tcols[tcomp_args[1]]
   if sourceval == testval
     passvalues[i] = sourceval
     i += 1
  end
end
end

The direct translation of this code to FasterCSV is:

passvalues = Array.new
FCSV.foreach(fsource) |s_row|
  source = s_row[scomp_args[0]..scomp_args[1]].join(" ")
  FCSV.foreach(tdest) |t_row|
    if source == t_row[scomp_args[0]..scomp_args[1]].join(" ")
      passvalues << source

       break # performance enhancement

    end
  end
end

James Edward Gray II

···

On Feb 26, 2007, at 11:48 AM, James Edward Gray II wrote:

On Feb 26, 2007, at 8:45 AM, Rafael George wrote:

Sorry, I meant row order.

James Edward Gray II

···

On Feb 26, 2007, at 12:54 PM, James Edward Gray II wrote:

On Feb 26, 2007, at 11:48 AM, James Edward Gray II wrote:

If you can afford to read one of the files into memory because it's not too large, you can probably speed that up quite a bit:

require "set"

allowed = Set.new
FCSV.foreach(tdest) do |row|
  allowed.add(row[scomp_args[0]..scomp_args[1]].join(" "))
end

passvalues = FCSV.open(fsource) do |source|
  source.select do |row|
    allowed.include? row[scomp_args[0]..scomp_args[1]].join(" ")
  end
end

The above destroys the field order.

Thanks, James and the other guys i think i found the solution for my problem :slight_smile:

···

On 2/26/07, James Edward Gray II <james@grayproductions.net> wrote:

On Feb 26, 2007, at 11:48 AM, James Edward Gray II wrote:

> On Feb 26, 2007, at 8:45 AM, Rafael George wrote:
>
>> passvalues =
>> i = 0
>> IO.foreach(fsource) do |line|
>> cols =
>> cols=CSV::parse_line line.chomp
>> sourceval = cols[scomp_args[0]] + " " + cols[scomp_args[1]]
>> IO.foreach(tdest) do |line|
>> tcols =
>> tcols=CSV::parse_line line.chomp
>> testval = tcols[tcomp_args[0]] + " " + tcols[tcomp_args[1]]
>> if sourceval == testval
>> passvalues[i] = sourceval
>> i += 1
>> end
>> end
>> end
>
> The direct translation of this code to FasterCSV is:
>
> passvalues = Array.new
> FCSV.foreach(fsource) |s_row|
> source = s_row[scomp_args[0]..scomp_args[1]].join(" ")
> FCSV.foreach(tdest) |t_row|
> if source == t_row[scomp_args[0]..scomp_args[1]].join(" ")
> passvalues << source

       break # performance enhancement

> end
> end
> end

James Edward Gray II

--
Grimoire Guru
SourceMage GNU/Linux