Array diff

James_Dechiaro · 12 May 2008 18:52

Hello

I am new to ruby and trying to do a diff on two csv files.
I am putting each row into an array and then subtracting the arrays into
a new array, then taking that array and printing out the records.

The problem I am running into is I would like the badRecords method to
return the actual record lines that are not present in csv2.csv but
instead it is returning all records in csv1.csv. The other problem I see
is the code is running rather slow, cpu usage spikes up to 99% when
running. Any insight on improvements would be appreciated.

Thanks!

#!/usr/bin/env ruby -wKU

require 'rubygems'
require 'faster_csv'

def Array1

getNumber = FCSV.open("csv1.csv")
      getNumber.collect do |row|
       return row[1]
  end
end

def Array2

  getNumber = FCSV.open("csv2.csv")
      getNumber.collect do |row| if (row[5].include?("Originating")) &&
(row[41].include?("y"))
         return row[20]
  end
end
end

def SumArray

    SumArray = Array1 - Array2
        if SumArray.empty?
           puts "records have been validated"
           Process.exit!(0)
        else
         return SumArray
  end
end

def badRecords

my_file = File.open('badRecords.csv','w')
records = FCSV.open("csv1.csv")
      records.collect do |row| row[1].eql?(SumArray)
            my_file.puts row.inspect.gsub(/\[|[a-z]*\]$/, "")
        end
   my_file.close
  end
end
badRecords

···

--
Posted via http://www.ruby-forum.com/.

Axel_Etzold · 12 May 2008 19:26

-------- Original-Nachricht --------

Datum: Tue, 13 May 2008 03:52:21 +0900
Von: James Dechiaro <jdechiaro@coherecomm.com>
An: ruby-talk@ruby-lang.org
Betreff: array diff

Hello

I am new to ruby and trying to do a diff on two csv files.
I am putting each row into an array and then subtracting the arrays into
a new array, then taking that array and printing out the records.

The problem I am running into is I would like the badRecords method to
return the actual record lines that are not present in csv2.csv but
instead it is returning all records in csv1.csv. The other problem I see
is the code is running rather slow, cpu usage spikes up to 99% when
running. Any insight on improvements would be appreciated.

Thanks!

#!/usr/bin/env ruby -wKU

require 'rubygems'
require 'faster_csv'

def Array1

getNumber = FCSV.open("csv1.csv")
      getNumber.collect do |row|
       return row[1]
  end
end

def Array2

  getNumber = FCSV.open("csv2.csv")
      getNumber.collect do |row| if (row[5].include?("Originating")) &&
(row[41].include?("y"))
         return row[20]
  end
end
end

def SumArray

    SumArray = Array1 - Array2
        if SumArray.empty?
           puts "records have been validated"
           Process.exit!(0)
        else
         return SumArray
  end
end

def badRecords

my_file = File.open('badRecords.csv','w')
records = FCSV.open("csv1.csv")
      records.collect do |row| row[1].eql?(SumArray)
            my_file.puts row.inspect.gsub(/\[|[a-z]*\]$/, "")
        end
   my_file.close
  end
end
badRecords
--
Posted via http://www.ruby-forum.com/\.

James,

welcome to Ruby! You'll like it
You could do something like this:

csv_1_array=IO.readlines("csv1.txt")
csv_2_array=IO.readlines("csv2.txt")
result_array=csv_2-csv1 # (removes doublets, triplets etc also.)

Best regards,

Axel

···

--
GMX startet ShortView.de. Hier findest Du Leute mit Deinen Interessen!
Jetzt dabei sein: http://www.shortview.de/?mc=sv_ext_mf@gmx

Jon_Hawkins · 12 May 2008 19:38

Same thing basically but none-the-less this would also work..

csv1 = []
csv2 = []

IO.foreach("csv1.csv") {|lines| csv1 << lines}
IO.foreach("csv2.csv") {|lines| csv2 << lines}

OR

csv1 = []
csv2 = []

FCSV.foreach("csv1.csv", :headers => true) do |row|
csv1 << row["headerName"]
end

FCSV.foreach("csv2.csv", :headers => true) do |row|
csv2 << row["headerName"]
end

#this will give you the elements they dont share in common
differenceArray = csv1-csv2

#this will give you the elements they DO share in common
commonArray = csv1&csv2

Regards,

- Mac

···

--
Posted via http://www.ruby-forum.com/.

James_Dechiaro · 12 May 2008 21:29

Thanks for your responses

the files are formatted completely different so i need to break up each
field into an array and then call that element. this part is working
correctly (just takes awhile).

the problem i'm running into is when i try to open back the csv1 and
compare the SumArray against it in order to get the entire line (not
just the element). It instead prints all the lines, even the ones not
contained in the array.

So it looks as though the .eql? module is not working correctly.

csv1.csv format:

"3105551212","01133555615771","BEVERLYHLS","CA","INTL","ON","Apr 28 2008
1:10PM","300","256","0.0250","0.0000","0.0013",

csv2.csv format:

3067483e7538520080325105439.8971-040000,ABCCompany,Normal,438,+13105551212,Originating,438,Anonymous,01133555615771,20080325105439.897,1-040000,Yes,20080325105500.333,20080325105716.252,016,,,01133555615771,internat,in,10020101155542615771,,local,,,,,,ABCCompany,,,,,,,,,,y,public,,16519912:0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,user32@domain.com,153

···

--
Posted via http://www.ruby-forum.com/.

Adam_Shelly · 12 May 2008 22:24

There are a few potential problems with this line:
records.collect do |row| row[1].eql?(SumArray)

First, I think you need an 'if'. Otherwise you are calling eql? but
not using the result. Second, it looks like you are comparing a
single field with the
whole SumArray. They will never be equal. You probably want something like
if SumArray.include?(row[1])

-Adam

···

On 5/12/08, James Dechiaro <jdechiaro@coherecomm.com> wrote:

the problem i'm running into is when i try to open back the csv1 and
compare the SumArray against it in order to get the entire line (not
just the element). It instead prints all the lines, even the ones not
contained in the array.

So it looks as though the .eql? module is not working correctly.

James_Dechiaro · 13 May 2008 13:33

if SumArray.include?(row[1])did the trick...thanks adam!

Adam Shelly wrote:

···

On 5/12/08, James Dechiaro <jdechiaro@coherecomm.com> wrote:

the problem i'm running into is when i try to open back the csv1 and
compare the SumArray against it in order to get the entire line (not
just the element). It instead prints all the lines, even the ones not
contained in the array.

So it looks as though the .eql? module is not working correctly.

There are a few potential problems with this line:
records.collect do |row| row[1].eql?(SumArray)

First, I think you need an 'if'. Otherwise you are calling eql? but
not using the result. Second, it looks like you are comparing a
single field with the
whole SumArray. They will never be equal. You probably want something
like
if SumArray.include?(row[1])

-Adam

--
Posted via http://www.ruby-forum.com/\.

Robert_K1 · 14 May 2008 11:06

You might also want to consider using Set or Hash for more efficient
lookups. The way I would probably do it is this: define a record type
that covers all the relevant information (easily done with Struct).
Then read the file to test against (csv2 I believe) and convert
records to the record type which you then add to the Set / Hash. Then
read the second file line by line, convert it and print it if it is
not contained in the Set / Hash. That way you do not need to keep
both files in memory and the Set / Hash lookups are much faster than
Array based lookups.

Kind regards

robert

···

2008/5/13 James Dechiaro <jdechiaro@coherecomm.com>:

if SumArray.include?(row[1])did the trick...thanks adam!

Adam Shelly wrote:

On 5/12/08, James Dechiaro <jdechiaro@coherecomm.com> wrote:

the problem i'm running into is when i try to open back the csv1 and
compare the SumArray against it in order to get the entire line (not
just the element). It instead prints all the lines, even the ones not
contained in the array.

So it looks as though the .eql? module is not working correctly.

There are a few potential problems with this line:
records.collect do |row| row[1].eql?(SumArray)

First, I think you need an 'if'. Otherwise you are calling eql? but
not using the result. Second, it looks like you are comparing a
single field with the
whole SumArray. They will never be equal. You probably want something
like
if SumArray.include?(row[1])

--
use.inject do |as, often| as.you_can - without end

James_Dechiaro · 14 May 2008 11:29

Thanks for the tip Robert I will give it a go...as the script has been
running for over 14 hours now after finding 13k matches =(

Robert Klemme wrote:

···

You might also want to consider using Set or Hash for more efficient
lookups. The way I would probably do it is this: define a record type
that covers all the relevant information (easily done with Struct).
Then read the file to test against (csv2 I believe) and convert
records to the record type which you then add to the Set / Hash. Then
read the second file line by line, convert it and print it if it is
not contained in the Set / Hash. That way you do not need to keep
both files in memory and the Set / Hash lookups are much faster than
Array based lookups.

Kind regards

robert

--
Posted via http://www.ruby-forum.com/\.

Robert_K1 · 14 May 2008 19:50

Yet another approach would be to use a relational database for this. If the volume is large it may pay off to import your CSV data into two tables, create appropriate indexes and get your result via a SELECT.

Kind regards

robert

···

On 14.05.2008 13:29, James Dechiaro wrote:

Thanks for the tip Robert I will give it a go...as the script has been
running for over 14 hours now after finding 13k matches =(

Todd_Benson · 14 May 2008 21:32

I agree that a large dataset like this probably doesn't belong in the
high-level programming domain. Put the burden where it belongs.

For this type of data, the model would be simple in a database, as
would be the queries.

2c,
Todd

···

On Wed, May 14, 2008 at 2:50 PM, Robert Klemme <shortcutter@googlemail.com> wrote:

On 14.05.2008 13:29, James Dechiaro wrote:

Thanks for the tip Robert I will give it a go...as the script has been
running for over 14 hours now after finding 13k matches =(

Yet another approach would be to use a relational database for this. If the
volume is large it may pay off to import your CSV data into two tables,
create appropriate indexes and get your result via a SELECT.

Topic		Replies	Views
Comparing two arrays, too slow ruby-talk	5	134	9 August 2010
Parsing CSV ruby-talk	8	138	26 February 2007
Fast "set difference" on arrays ruby-talk	6	170	6 July 2002
Array difference ruby-talk	3	77	5 July 2007
Need help for finding difference between files ruby-talk	7	120	15 May 2008

Array diff

Related topics