File Merge help request from Newbie

Snoopy_Dog · 15 September 2006 11:08

First let me say that I am an absolute Newbie to Ruby. So please be
tolerant of my newbie question.

My situation is this. I am gathering financial data, and am about to
change data suppliers. I want to "merge" the files from both suppliers
to have as much data history as possible. I have the data in ASCII
format in a comma delimited file.

I have the data in the following structre:
c:\data\1Original\abc.csv - a new data file
c:\data\2Processed\abc.csb - the historical file and my processing
reference
Each file has the same file structure of:
Symbol, Date, Open, High, Low, Close, Volume

I already have a process that references the files in the
c:\data\processed directory structure.

Currently I have figured out how to walk the directory tree and copy any
NEW files into the Processed directory. I am hung up on the merging of
the files into the processed directory.

Sample files to demonstrate:
  c:\data\1Original\abc.csv (new data)
     abc, 20060901, 1.5, 2.1, 1.4, 1.9, 123456
     abc, 20060902. 1.9, 2.3, 1.8, 2.3, 147454

  c:\data\2Processed\abc.csv (historical)
     abc, 20010101, 2.1, 2.5, 2.0, 2.45, 254677
     abc, 20010102. 2.4, 2.6, 2.4, 2.5, 333444
     .......
     abc, 20060901, 1.5, 2.1, 1.4, 1.9, 123456

I need to create
  c:\data\2Processed\abc.csv (historical)
     abc, 20010101, 2.1, 2.5, 2.0, 2.45, 254677
     abc, 20010102. 2.4, 2.6, 2.4, 2.5, 333444
     .......
     abc, 20060901, 1.5, 2.1, 1.4, 1.9, 123456
     abc, 20060902. 1.9, 2.3, 1.8, 2.3, 147454

So, I am with how to read the files in and merge.
Here is my thought process:
1. Read the files into arrays (of rows)
2. Check the dates of the rows
3. Output the early dates from the historical file
4. Output the common data from either file (probably historical as
already in it)
5. Output new data from new file

So, the code I have so far is this...

puts 'start'
require 'find'
require 'ftools'

dir1original = 'c:/Data/1Original/'
dir2processed = 'c:/Data/2Processed/'

puts 'Here'
Find.find(dir1original) { |path| puts path}

Find.find(dir1original) do |path|
   puts 'The current item is ' + path
   if File.file? path
     puts path + ' is a file'
    end
  end

puts 'create log files'
# Set up Log files and Specific output files
runlogfile = 'c:/Data/runlog.txt'
open(runlogfile, "w") { |f| f << "Runlog of StepOneIncrement\n"}
puts 'Created runlog file'
open('c:/Data/Exist1not2.txt', "w") {|f| f << "List of files from
Original not in Processed\n"}
puts 'Created Exist1not2'
open('c:/Data/Exist2not1.txt', "w") {|f| f << "List of files from
Processed not in Original\n"}
puts 'Created Exist2not1'

# Walk the Original Directory Tree and check for files and matches
Find.find(dir1original) do |path|
   if File.file? path
      second = path.gsub(dir1original,dir2processed)
      if File.file? second
         puts 'Found'
         if File.size(path) != File.size(second)
           puts 'Not same size'
           #Now we will have to look at the data
           puts open(path) { |f| f.read(20)}
           puts open(second) { |f| f.read(20)}
           #search out parsdate for possibly parsing the date data

           #need help here on
           # read files into an array
           # date based calculations
           # merging the files

          else
           puts 'Complete Match'
           # if file.cmp(path, second)
          end
        else
         filename = path.gsub(dir1original, '')
         puts filename + ' Not Found'
         # an alternate method to get the file name
         puts File.basename(path) + ' Not Found'
         puts File.basename(path, ".csv") + ' Not Found'
         open('c:/Data/Exist1not2.txt', "a") {|f| f << filename +"\n"}
         File.copy(path,second)
      end
    end
  end

So, some help on the arrays would be GREATLY Appreciated.

Snoopy

···

--
Posted via http://www.ruby-forum.com/.

Paul_Lutus · 15 September 2006 15:25

Snoopy Dog wrote:

First let me say that I am an absolute Newbie to Ruby. So please be
tolerant of my newbie question.

/ ... snip code list

So, some help on the arrays would be GREATLY Appreciated.

Now I know why you haven't gotten any replies. You didn't say what you
wanted, what you got instead, and how they differed.

"I wanted A to happen ..."

"Instead, B happened ..."

"Here is exactly how A and B differ ..."

Without this information, people might be offering cures for which there are
no diseases.

Please do this. Just show the example data, and say what result you want,
and in what form. Be specific. Someone will then solve the problem on their
own, which Ruby allows us to do faster than by analyzing your code.

···

--
Paul Lutus
http://www.arachnoid.com

Pit · 15 September 2006 15:59

Snoopy Dog schrieb:

(... merging data of two files ...)

Snoopy, if your data is as well structured as you've shown (minus a few typos), merging the data of two files should simply be:

data = File.readlines(path)
additional_data = File.readlines(second)

   data.concat(additional_data)
   data.uniq!
   data.sort!

# write data to destination file

Regards,
Pit

W_James · 15 September 2006 16:25

Snoopy Dog wrote:

First let me say that I am an absolute Newbie to Ruby. So please be
tolerant of my newbie question.

My situation is this. I am gathering financial data, and am about to
change data suppliers. I want to "merge" the files from both suppliers
to have as much data history as possible. I have the data in ASCII
format in a comma delimited file.

I have the data in the following structre:
c:\data\1Original\abc.csv - a new data file
c:\data\2Processed\abc.csb - the historical file and my processing
reference
Each file has the same file structure of:
  Symbol, Date, Open, High, Low, Close, Volume

I already have a process that references the files in the
c:\data\processed directory structure.

Currently I have figured out how to walk the directory tree and copy any
NEW files into the Processed directory. I am hung up on the merging of
the files into the processed directory.

Sample files to demonstrate:
  c:\data\1Original\abc.csv (new data)
     abc, 20060901, 1.5, 2.1, 1.4, 1.9, 123456
     abc, 20060902. 1.9, 2.3, 1.8, 2.3, 147454

  c:\data\2Processed\abc.csv (historical)
     abc, 20010101, 2.1, 2.5, 2.0, 2.45, 254677
     abc, 20010102. 2.4, 2.6, 2.4, 2.5, 333444
     .......
     abc, 20060901, 1.5, 2.1, 1.4, 1.9, 123456

I need to create
  c:\data\2Processed\abc.csv (historical)
     abc, 20010101, 2.1, 2.5, 2.0, 2.45, 254677
     abc, 20010102. 2.4, 2.6, 2.4, 2.5, 333444
     .......
     abc, 20060901, 1.5, 2.1, 1.4, 1.9, 123456
     abc, 20060902. 1.9, 2.3, 1.8, 2.3, 147454

dir1 = '1-Original/'
dir2 = '2-Processed/'
def file_to_a s
  File.exist?(s) ?
    IO.read(s).map{|x| x.chomp.split(/\s*,\s*/)} :
end
Dir[dir1+"*.csv"].each{|full_name|
  bare_name = full_name[ %r{[^/]*$} ]
  ary = file_to_a( full_name ) | file_to_a( dir2 + bare_name )
  File.open( dir2+bare_name, 'w'){|f|
    f.puts ary.sort.map{|x| x.join(", ")} }
}

Snoopy_Dog · 15 September 2006 22:44

Pit Capitain wrote:

Snoopy Dog schrieb:

(... merging data of two files ...)

Snoopy, if your data is as well structured as you've shown (minus a few
typos), merging the data of two files should simply be:

   data = File.readlines(path)
   additional_data = File.readlines(second)

   data.concat(additional_data)
   data.uniq!
   data.sort!

   # write data to destination file

Regards,
Pit

Thank You Pit.

I implemented this in minutes following your example.

The data is very well structured, unfortunately the data is not
identical between suppliers. Sometimes there are differences in prices
or volumes fields.

So now I get unique records, but some have the same date. Since I use
the data as a time/price series, I can't have duplicate dates.
Unfortunately I don't have time to work on this tonight, but will keep
at it.

Thanks.
Snoopy

···

--
Posted via http://www.ruby-forum.com/\.

Snoopy_Dog · 15 September 2006 22:50

William James wrote:

Snoopy Dog wrote:

snip

}

Thanks William James.

As I am new to Ruby, and regular expressions, I implemented Pit's
method. Since I have some issues with the data (values, not structure)
both your's and Pit's methods have the same data issues (DATE not
unique).

I think I will be able to use your regular expressions to split up my
incoming data and find a way to use the unique feature on the date
value.

Will post my results, when I get a chance to work on it.

Thanks again.
Snoopy

···

--
Posted via http://www.ruby-forum.com/\.

Snoopy_Dog · 15 September 2006 23:01

Paul Lutus wrote:

Snoopy Dog wrote:

snip snip

Please do this. Just show the example data, and say what result you
want,
and in what form. Be specific. Someone will then solve the problem on
their
own, which Ruby allows us to do faster than by analyzing your code.

Paul,

I didn't really want someone else to "solve" it for me. I want to
learn. I thought by laying out the problem, sample input data, and
showing the desired result, and my thought process on how to get there,
the good folks (like yourself) on the forum could point me in the right
direction (which they have).

As I am still learning Ruby, I need to learn the constructs of the
language, and how to use them. Both forum suggestions have greatly
aided my project and understanding of Ruby. Now I am thinking about
some new ways to approach the DATE issue.

So a new question here: How do I find out what methods exist for an
object?

eg: For the File object, how can I find out about the open, file?,
close, and other methods (as well as their syntax).

Also what references should I be looking at other than this forum?

Thanks in advance.

···

--
Posted via http://www.ruby-forum.com/\.

Paul_Lutus · 15 September 2006 23:26

Snoopy Dog wrote:

/ ...

So a new question here: How do I find out what methods exist for an
object?

puts objectname.methods

eg: For the File object, how can I find out about the open, file?,
close, and other methods (as well as their syntax).

You can get the method names as above, but to learn how to use them, you
will have to consult the documentation.

Also what references should I be looking at other than this forum?

http://www.ruby-lang.org/en/documentation/

···

--
Paul Lutus
http://www.arachnoid.com

Sam_Gentle · 16 September 2006 07:26

I'd do something like this:

require 'enumerator'

data = File.readlines(path)
additional_data = File.readlines(second)
data.concat(additional_data)

data.sort!
mergeddata = [data[0]]
data.each_cons(2) {|x1,x2| mergeddata.push(x2) unless x1.split(/,/)[1]
== x2.split(/,/)[1]}

Sam

···

On 9/16/06, Snoopy Dog <snoopy.pa30@gmail.com> wrote:

As I am new to Ruby, and regular expressions, I implemented Pit's
method. Since I have some issues with the data (values, not structure)
both your's and Pit's methods have the same data issues (DATE not
unique).

I think I will be able to use your regular expressions to split up my
incoming data and find a way to use the unique feature on the date
value.

Snoopy_Dog · 16 September 2006 03:58

Paul Lutus wrote:

Snoopy Dog wrote:

snip snip

Also what references should I be looking at other than this forum?

Documentation

Thanks Paul

It is always so simple when someone shows you how/where.

···

--
Posted via http://www.ruby-forum.com/\.

Snoopy_Dog · 17 September 2006 15:10

Sam Gentle wrote:
snip

I'd do something like this:

require 'enumerator'

data = File.readlines(path)
additional_data = File.readlines(second)
data.concat(additional_data)

data.sort!
mergeddata = [data[0]]
data.each_cons(2) {|x1,x2| mergeddata.push(x2) unless x1.split(/,/)[1]
== x2.split(/,/)[1]}

Sam

Sam, that works great. Unfortunately I still don't understand it all
yet.

mergedata = [data[0]] - creates an new array from data, but why do we
need the subscript??

data.each_cons(2)... - what does the each_cons(2) do... I understdand
the split, but don't know the .push(x2)

I will go do some more reading to see if I can figure them out.

Thanks
Snoopy

···

--
Posted via http://www.ruby-forum.com/\.

leandro_nascimento_c · 18 September 2006 10:36

When the subject is about ruby reference guide I always use this:
http://www.ruby-doc.org/core/classes/File.html

(The File module in you particular case)

Paul Lutus wrote:

···

Snoopy Dog wrote:

/ ...

> So a new question here: How do I find out what methods exist for an
> object?

puts objectname.methods

>
> eg: For the File object, how can I find out about the open, file?,
> close, and other methods (as well as their syntax).

You can get the method names as above, but to learn how to use them, you
will have to consult the documentation.

>
> Also what references should I be looking at other than this forum?

Documentation

--
Paul Lutus
http://www.arachnoid.com

Snoopy_Dog · 17 September 2006 15:47

Snoopy Dog wrote:

Sam Gentle wrote:
snip

I'd do something like this:

... snip snip

I will go do some more reading to see if I can figure them out.

Thanks
Snoopy

OK, I think I have figured out most of it... Amazing how the
documentation can help out when you see the code in action.

mergedata = [data[0]] - creates a new array with just one element in it!

the mergedata.push - pushes elements on to the array (appends).

and the data.each_cons(2) - I ASSUME that this takes two elements at a
time from the data array.

How it stays in sync I don't know.

This looks like it will do everything that I need. I will do a bit more
testing and then throw it on to the live data.

THANKS
Snoopy

···

--
Posted via http://www.ruby-forum.com/\.

Pit · 17 September 2006 21:45

Snoopy Dog schrieb:

Sam Gentle wrote:

require 'enumerator'

data = File.readlines(path)
additional_data = File.readlines(second)
data.concat(additional_data)

data.sort!
mergeddata = [data[0]]
data.each_cons(2) {|x1,x2| mergeddata.push(x2) unless x1.split(/,/)[1]
== x2.split(/,/)[1]}

Sam, that works great.

Snoopy, you mentioned in one of our posts that the suppliers might deliver inconsistent data, for example different volumes for the same date, and Sam's solution guarantees that you get only one row for each date. You should be aware, though, that it randomly chooses this one row. For some data, it could be the row of the first file, for other data it could be the row of the second file. If you want to prefer one of the suppliers over the others, you have to implement a slightly different algorithm. The problem is that Ruby's sort isn't a stable sort.

Regards,
Pit

Jordan_Callicoat · 17 September 2006 22:41

Snoopy Dog wrote:

OK, I think I have figured out most of it... Amazing how the
documentation can help out when you see the code in action.

mergedata = [data[0]] - creates a new array with just one element in it!

the mergedata.push - pushes elements on to the array (appends).

You got it. It could also be written:

mergedata = Array.new(data.at(0))
...
mergedata << x2

(All of these are synonyms for the way it was written, which is why I
mention them.)

and the data.each_cons(2) - I ASSUME that this takes two elements at a
time from the data array.

Kind of, sort of...see the docs:
http://ruby-doc.org/core/classes/Enumerable.html#M002115

As to the problem, you could also do something like this, which is a
little more verbose than the other solutions, but is (I think) easier
to understand:

data1 = File.readlines(file1) # historical
data2 = File.readlines(file2) # new

# dates are from index 5-12 in the row string
# like your example data, change as needed
dates1 = data1.collect { |row| row[5..12] }
dates2 = data2.collect { |row| row[5..12] }

i = 0
while i < dates2.size
  if dates1.include?(dates2[i])
    dates2.delete_at(i)
    data2.delete_at(i)
  else
    i += 1
  end
end

out_data = (data1 + data2).sort

Also, you could use a little trick with Hash; just index the rows in a
hash by their date, then when you hit a duplicate date, you'll just
overwrite the previous value indexed by that date (change the order of
reading in file1 and file2 to keep historical rows rather than new
ones, the current order keeps new rows):

hash = {}
data = File.readlines(file1) +
       File.readlines(file2)
data.each { |row|
  date = row[5..12]
  hash[date] = row
}
data = hash.values.sort

Regards,
Jordan

Snoopy_Dog · 18 September 2006 00:21

Pit Capitain wrote:

Snoopy Dog schrieb:

data.each_cons(2) {|x1,x2| mergeddata.push(x2) unless x1.split(/,/)[1]
== x2.split(/,/)[1]}

Sam, that works great.

Snoopy, you mentioned in one of our posts that the suppliers might
deliver inconsistent data, for example different volumes for the same
date, and Sam's solution guarantees that you get only one row for each
date. You should be aware, though, that it randomly chooses this one
row. For some data, it could be the row of the first file, for other
data it could be the row of the second file. If you want to prefer one
of the suppliers over the others, you have to implement a slightly
different algorithm. The problem is that Ruby's sort isn't a stable
sort.

Regards,
Pit

Pit,

Thanks for mentioning that. I assumed that the sort kept them in order,
and I used the push(x1) instead of push(x2) after a few of my tests.
That way I kept the historical data. Since my sample data tests are
small, I was just lucky not to have them out of the order I expected
them.

Now looking at Jordan's code, I think I will use (a variant) of it to
control what I keep for historical data.

Thanks again Pit.

Snoopy

···

--
Posted via http://www.ruby-forum.com/\.

Snoopy_Dog · 18 September 2006 01:08

Jordan Callicoat wrote:
..snip snip

Also, you could use a little trick with Hash; just index the rows in a
hash by their date, then when you hit a duplicate date, you'll just
overwrite the previous value indexed by that date (change the order of
reading in file1 and file2 to keep historical rows rather than new
ones, the current order keeps new rows):

hash = {}
data = File.readlines(file1) +
       File.readlines(file2)
data.each { |row|
  date = row[5..12]
  hash[date] = row
}
data = hash.values.sort

Regards,
Jordan

Jordan,

Thanks for the suggestion. I am implementing the hash idea you
provided. That way I can keep my historical data (for common dates) and
just grab new data for new dates.

Now just a little tweak. The symbol data is not always just 3
characters.
I am currently using a regular expression to split out the values, so I
get the date from the split.

           #Using Jordan's methodology
           hash = {}
           data = File.readlines(path) + File.readlines(second)
           data.each { |row|
              (sym, date, open, high, low, close, vol) = row.split(/,/)
              hash[date] = row
            }
           data = hash.values.sort
           open(second, 'w') { |f| f.puts data}

Works fine, but I really don't care about the information past the date
field.
Additionally, I have another set of files that have a column after the
vol, I am not sure how to handle it in the regular Expression.

I just want to do:
(symbol, date, ignore_the_rest) = row.split(/,/) for just the first
two columns. I am off to read more on regular expressions.

Thanks
Snoopy

···

--
Posted via http://www.ruby-forum.com/\.

Jordan_Callicoat · 18 September 2006 01:36

Snoopy Dog wrote:

Works fine, but I really don't care about the information past the date
field.
Additionally, I have another set of files that have a column after the
vol, I am not sure how to handle it in the regular Expression.

I just want to do:
(symbol, date, ignore_the_rest) = row.split(/,/) for just the first
two columns. I am off to read more on regular expressions.

Hi there,

The split method just returns an array of every item on either side of
the delimiter...

p row.split(/,/)
# => ["abc", " 20060901", " 1.5", " 2.1", " 1.4", " 1.9", " 123456\n"]

You can assign it to a variable:

a = row.split(/,/)
a[0]

You can index it anonymously:

row.split(/,/)[0]

Or unpack some (or all) members of it:

symbol, date = row.split(/,/)[0..1] # .. is a range operator

And so on.

I think you want something like the last example, but unless you need
the symbol too, you can just use: date = row.split(/,/)[1] An extra
column at the end of the rows won't effect anything.

Regards,
Jordan

Jordan_Callicoat · 18 September 2006 01:41

Actually, there's no reason to use a regexp for a delimiter here. A
string works fine:

date = row.split(',')[1]

Regards,
Jordan

Snoopy_Dog · 18 September 2006 02:24

Jordan Callicoat wrote:

Actually, there's no reason to use a regexp for a delimiter here. A
string works fine:

date = row.split(',')[1]

Regards,
Jordan

Jordan,

Thanks. That is quicker than me finding the proper references online.

Thanks to you and all the other folks on the forum who have helped out.

I am now able to do my required processing task, and hopefully have
learned enought that I will be able to implement a few more "nice to
have" tasks soon.

When I run in to more stumbling (learning) blocks along the way, I will
know where to look for EXCELLENT help.

Thanks again to you and all the forum.

Snoopy

···

--
Posted via http://www.ruby-forum.com/\.

Topic		Replies	Views
Merging partial, incomplete data files ruby-talk	4	117	22 October 2009
Merging 2 csv files and sorting merged file ruby-talk	2	129	1 May 2011
Merging hashes and having trouble with variable scope! ruby-talk	6	117	30 September 2008
Merging files with CR as EOL? ruby-talk	3	108	18 October 2006
Csv next and previous help ruby-talk	8	96	27 April 2007

File Merge help request from Newbie

Related topics