Parse csv similar file

Rebhan_Gilbert · 6 February 2007 14:32

Hi,

i have a txtfile with a format like that =

AP850KP;INCLIB;E023889;AP013;240107;0730
AP850SD$;INCLIB;E052337;AP013;240107;0730
AP850SDA;INCLIB;E050441;AP013;240107;0730
AP850SDI;INCLIB;E023889;AP013;240107;0730
AP850SDO;INCLIB;E052337;AP013;240107;0730
AP850SDS;INCLIB;E050441;AP013;240107;0730
...

i want to get a collection for every E followed by digits,
so with the example above, i want to get =

collections:
  E023889
  E052337
  E050441
  ...

each collection should contain datasets with the rest of the line, so
f.e.
E023889 would have =

[AP850KP;INCLIB;AP013;240107;0730,AP850SDI;AP013;240107;0730]

questions=
what kind of collection is the best ? is an array sufficient ?

right now i have =

efas=Array.new
File.open("mycsvfile", "r").each do |line|
        if line =~ /(\w+.?);(\w+);(\w+);(\w+);(\w+);(\w+)/

         efas<<$3.to_s<<',' unless efas.include?($3.to_s)

        end
     end
     puts efas.to_s.chop

So i have all Ed\+, but how to get further ?

Are there better ways as regular expressions ?
Any ideas ?

Regards, Gilbert

Brian_Candler · 6 February 2007 14:36

questions=
what kind of collection is the best ? is an array sufficient ?

Depends what you want to do with it. If you want to be able to find an entry
E123456 quickly, then you'd use a hash. If you want to keep only the
first/last entry for a particular key (as it seems you do), using a hash
speeds things up here too.

right now i have =

efas=Array.new
File.open("mycsvfile", "r").each do |line|
        if line =~ /(\w+.?);(\w+);(\w+);(\w+);(\w+);(\w+)/

         efas<<$3.to_s<<',' unless efas.include?($3.to_s)

        end
     end
     puts efas.to_s.chop

Try:

efas = Hash.new
...
efas[$3] = [$1,$2,$4,$5,$6] unless efas.has_key?($3)
...
puts efas.inspect

Are there better ways as regular expressions ?

You could look at String#split instead

HTH,

Brian.

···

On Tue, Feb 06, 2007 at 11:32:27PM +0900, Rebhan, Gilbert wrote:

Gavin_Kistner2 · 6 February 2007 15:30

lines = DATA.readlines.map{ |line|
  line.chomp.split( ';' )
}
lookup = {}
lines.each{ |data|
  key = data.find{ |value| /^E/ =~ value }
  lookup[ key ] = data
}
p lookup[ "E050441" ]
#=> ["AP850SDS", "INCLIB", "E050441", "AP013", "240107", "0730"]
__END__
AP850KP;INCLIB;E023889;AP013;240107;0730
AP850SD$;INCLIB;E052337;AP013;240107;0730
AP850SDA;INCLIB;E050441;AP013;240107;0730
AP850SDI;INCLIB;E023889;AP013;240107;0730
AP850SDO;INCLIB;E052337;AP013;240107;0730
AP850SDS;INCLIB;E050441;AP013;240107;0730

···

On Feb 6, 7:32 am, "Rebhan, Gilbert" <Gilbert.Reb...@huk-coburg.de> wrote:

i have a txtfile with a format like that =

AP850KP;INCLIB;E023889;AP013;240107;0730
AP850SD$;INCLIB;E052337;AP013;240107;0730
AP850SDA;INCLIB;E050441;AP013;240107;0730
AP850SDI;INCLIB;E023889;AP013;240107;0730
AP850SDO;INCLIB;E052337;AP013;240107;0730
AP850SDS;INCLIB;E050441;AP013;240107;0730
..

i want to get a collection for every E followed by digits,
so with the example above, i want to get =

Greg_Brown1 · 6 February 2007 15:52

Hi,

<newbie>

i have a txtfile with a format like that =

AP850KP;INCLIB;E023889;AP013;240107;0730
AP850SD$;INCLIB;E052337;AP013;240107;0730
AP850SDA;INCLIB;E050441;AP013;240107;0730
AP850SDI;INCLIB;E023889;AP013;240107;0730
AP850SDO;INCLIB;E052337;AP013;240107;0730
AP850SDS;INCLIB;E050441;AP013;240107;0730
...

i want to get a collection for every E followed by digits,
so with the example above, i want to get =

collections:
        E023889
        E052337
        E050441
        ...

each collection should contain datasets with the rest of the line, so
f.e.
E023889 would have =

[AP850KP;INCLIB;AP013;240107;0730,AP850SDI;AP013;240107;0730]

questions=
what kind of collection is the best ? is an array sufficient ?

Just for fun, here's a Ruport example:

require "rubygems"
require "ruport"
DATA = <<-EOS
AP850KP;INCLIB;E023889;AP013;240107;0730
AP850SD$;INCLIB;E052337;AP013;240107;0730
AP850SDA;INCLIB;E050441;AP013;240107;0730
AP850SDI;INCLIB;E023889;AP013;240107;0730
AP850SDO;INCLIB;E052337;AP013;240107;0730
AP850SDS;INCLIB;E050441;AP013;240107;0730
EOS

table = Ruport::Data::Table.parse(DATA, :has_names => false,
:csv_options=>{:col_sep=>";"})

table.column_names = %w[c1 c2 c3 c4 c5 c6] # BUG! you shouldn't need colnames

e = table.column(2).uniq
e.each { |x| table.create_group(x) { |r| r[2].eql?(x) } }

groups = table.groups

groups.attributes
["E023889", "E052337", "E050441"]

groups["E023889"].map { |r| r[0] }
["AP850KP", "AP850SDI"]

groups.each { |t| p t[0].c1 }

"AP850KP"
"AP850SD$"
"AP850SDA"

···

On 2/6/07, Rebhan, Gilbert <Gilbert.Rebhan@huk-coburg.de> wrote:

===============

note that in making this example, I found a small bug in Ruport's
grouping support which I will fix

Erik_Veenstra1 · 7 February 2007 16:15

Just an idea...

gegroet,
Erik V. - http://www.erikveen.dds.nl/

···

----------------------------------------------------------------

hash =
File.open("input.txt") do |f|
   f.readlines.collect do |line|
     k = line.scan(/;(E\d+);/).flatten.shift
     v = line.scan(/;E\d+;(.*)/).flatten.shift

     [k, v]
   end.select do |k, v|
     k and v
   end.inject({}) do |h, (k, v)|
     (h[k] ||= []) << v ; h
   end.inject({}) do |h, (k, v)|
     h[k] = v.join(",") ; h
   end
end

p hash

----------------------------------------------------------------

Rebhan_Gilbert · 6 February 2007 14:54

Hi,

···

-----Original Message-----
From: Brian Candler [mailto:B.Candler@pobox.com]
Sent: Tuesday, February 06, 2007 3:37 PM
To: ruby-talk ML
Subject: Re: Parse csv similar file

what kind of collection is the best ? is an array sufficient ?

/*
Depends what you want to do with it. If you want to be able to find an
entry
E123456 quickly, then you'd use a hash. If you want to keep only the
first/last entry for a particular key (as it seems you do), using a hash
speeds things up here too.
*/

i don't need to find all entries E..... , but collect all datas
that belong to the different E.....

i want a collection for every E... that occurs, with all the lines
(except the E... itself) that contain that E in it

/*
Try:

efas = Hash.new
...
efas[$3] = [$1,$2,$4,$5,$6] unless efas.has_key?($3)
...
puts efas.inspect
*/

that gives me only one dataset in the hash, but there are more
entries that have E123456 in it.

Regards, Gilbert

Drew_Olson · 6 February 2007 15:36

Gavin Kistner wrote:

i want to get a collection for every E followed by digits,
so with the example above, i want to get =

lines = DATA.readlines.map{ |line|
  line.chomp.split( ';' )
}
lookup = {}
lines.each{ |data|
  key = data.find{ |value| /^E/ =~ value }
  lookup[ key ] = data
}
p lookup[ "E050441" ]
#=> ["AP850SDS", "INCLIB", "E050441", "AP013", "240107", "0730"]
__END__
AP850KP;INCLIB;E023889;AP013;240107;0730
AP850SD$;INCLIB;E052337;AP013;240107;0730
AP850SDA;INCLIB;E050441;AP013;240107;0730
AP850SDI;INCLIB;E023889;AP013;240107;0730
AP850SDO;INCLIB;E052337;AP013;240107;0730
AP850SDS;INCLIB;E050441;AP013;240107;0730

I think he wants to append this array with information each time he sees
the same key, so modify your code like so:

lines = DATA.readlines.map{ |line|
  line.chomp.split( ';' )
}
lookup = {}
lines.each{ |data|
  key = data.find{ |value| /^E/ =~ value }
  lookup[ key ] ||=
  lookup[ key ] << data
}

···

On Feb 6, 7:32 am, "Rebhan, Gilbert" <Gilbert.Reb...@huk-coburg.de> > wrote:

--
Posted via http://www.ruby-forum.com/\.

Erik_Veenstra1 · 7 February 2007 18:27

Nice abstraction... ;]

(By heart: This group_by is part of one of the Rails packages.)

gegroet,
Erik V. - http://www.erikveen.dds.nl/

···

----------------------------------------------------------------

module Enumerable
   def hash_by(&block)
     inject({}){|h, o| (h[block[o]] ||= []) << o ; h}
   end

   def group_by(&block)
     #hash_by(&block).values
     hash_by(&block).sort.transpose.pop
   end
end

p hash

----------------------------------------------------------------

Brian_Candler · 6 February 2007 15:13

I was just following your original example, which only kept the first line
for a particular E key.

If you want to keep them all, then I'd use a hash with each element being an
array.

efas[$3] ||= # create empty array if necessary
efas[$3] << [$1,$2,$4,$5,$6] # add a new line

So, given the following input

aaa,bbb,E123,ddd,eee,fff
ggg,hhh,E123,iii,jjj,kkk

you should get

efas = {
  "E123" => [
         ["aaa","bbb","ddd","eee","fff"],
         ["ggg","hhh","iii","jjj","kkk"],
  ],
}

puts efas["E123"].size # 2
puts efas["E123"][0][3] # "eee"
puts efas["E123"][1][3] # "jjj"

In practice, to make it easier to manipulate this data, you'd probably want
to create a class to represent each object, rather than using a 5-element
array.

You would give each attribute a sensible name. I don't know what these
values mean, so I've just called them a to e here.

class Myclass
  attr_accessor :a, :b, :c, :d, :e
  def initialize(a, b, c, d, e)
    @a = a
    @b = b
    @c = c
    @d = d
    @e = e
  end
end

...
efas[$3] ||=
efas[$3] << Myclass.new($1,$2,$4,$5,$6)

HTH,

Brian.

···

On Tue, Feb 06, 2007 at 11:54:59PM +0900, Rebhan, Gilbert wrote:

> what kind of collection is the best ? is an array sufficient ?
/*
Depends what you want to do with it. If you want to be able to find an
entry
E123456 quickly, then you'd use a hash. If you want to keep only the
first/last entry for a particular key (as it seems you do), using a hash
speeds things up here too.
*/

i don't need to find all entries E..... , but collect all datas
that belong to the different E.....

i want a collection for every E... that occurs, with all the lines
(except the E... itself) that contain that E in it

/*
Try:

efas = Hash.new
...
efas[$3] = [$1,$2,$4,$5,$6] unless efas.has_key?($3)
...
puts efas.inspect
*/

that gives me only one dataset in the hash, but there are more
entries that have E123456 in it.

Gavin_Kistner2 · 6 February 2007 16:55

Curses, I didn't read carefully enough. Right you are. (And, though
it's not clear from his example, he might not even need to split the
original line into arrays of pieces, but just keep the lines.)

···

On Feb 6, 8:36 am, Drew Olson <olso...@gmail.com> wrote:

I think he wants to append this array with information each time he sees
the same key, so modify your code like so:

lines = DATA.readlines.map{ |line|
  line.chomp.split( ';' )}

lookup = {}
lines.each{ |data|
  key = data.find{ |value| /^E/ =~ value }
  lookup[ key ] ||=
  lookup[ key ] << data

}

Gavin_Kistner2 · 6 February 2007 17:00

So here's another version:

lookup = Hash.new{ |h,k| h[k]= }

DATA.each_line{ |line|
  line.chomp!
  warn "No key in '#{line}'" unless key = line[ /\bE\w+/ ]
  lookup[ key ] << line
}

p lookup[ "E050441" ]
#=> ["AP850SDA;INCLIB;E050441;AP013;240107;0730",
"AP850SDS;INCLIB;E050441;AP013;240107;0730"]

require 'pp'
pp lookup
#=> {"E050441"=>
#=> ["AP850SDA;INCLIB;E050441;AP013;240107;0730",
#=> "AP850SDS;INCLIB;E050441;AP013;240107;0730"],
#=> "E052337"=>
#=> ["AP850SD$;INCLIB;E052337;AP013;240107;0730",
#=> "AP850SDO;INCLIB;E052337;AP013;240107;0730"],
#=> "E023889"=>
#=> ["AP850KP;INCLIB;E023889;AP013;240107;0730",
#=> "AP850SDI;INCLIB;E023889;AP013;240107;0730"]}

__END__
AP850KP;INCLIB;E023889;AP013;240107;0730
AP850SD$;INCLIB;E052337;AP013;240107;0730
AP850SDA;INCLIB;E050441;AP013;240107;0730
AP850SDI;INCLIB;E023889;AP013;240107;0730
AP850SDO;INCLIB;E052337;AP013;240107;0730
AP850SDS;INCLIB;E050441;AP013;240107;0730

···

On Feb 6, 8:36 am, Drew Olson <olso...@gmail.com> wrote:

I think he wants to append this array with information each time he sees
the same key [...]

Rebhan_Gilbert · 7 February 2007 08:47

Hi,

···

-----Original Message-----
From: Phrogz [mailto:gavin@refinery.com]
Sent: Tuesday, February 06, 2007 6:00 PM
To: ruby-talk ML
Subject: Re: Parse csv similar file

On Feb 6, 8:36 am, Drew Olson <olso...@gmail.com> wrote:

I think he wants to append this array with information each time he

sees

the same key [...]

i still don't know how to go, so here some more notes ...

i get a folder

/timestamp
  metafile.txt
   /INCLIB
   /PLI

metafile looks like that =
APLVZDT;INCLIB;E050441;AP013;240107;0730
AP400ER;INCLIB;E023889;AP013;240107;0730
AP540RBP;INCLIB;E052337;AP013;240107;0730
AP700PA;INCLIB;E050441;AP013;240107;0730
... more lines

field 1 is a filename
field 2 is a foldername, shows whether path is /INCLIB/file or /PLI/file
field 3 is a ticketnr
field 4 is a username
field 5 is a date
field 6 is a timestamp

i need to parse the metafile and =

1. create a folderstructure for every ticketnr that occurs, f.e.

/E050441
/INCLIB
/PLI

and put all the files that belong to that ticket
(means the line with the filename contains that ticketnr)
in the subfolder which is field 2

2. create a file in the root of the /ticketnr folder
which contains the rest of a dataset (line), means =

field 4
field 5
field 6

which are the same for every file with the same ticketnr

the format might look like

user=...
date=...
time=...

have to decide it later.

I thought with =

File.open("mycsvfile", "r").each do |line|
        if line =~ /(\w+.?);(\w+);(\w+);(\w+);(\w+);(\w+)/

         efas<<$3.to_s<<',' unless efas.include?($3.to_s)

i get an array with all ticketnr
then i create a folderstructure for every index in that array
and put the files in it, but i don't get it.

Any ideas ?

Regards, Gilbert

Brian_Candler · 7 February 2007 09:41

I'd do all the work on-the-fly. Untested code:

require 'fileutils'
SRCDIR="/path_to_src"
DSTDIR="/path_to_dst"

def copy_ticket(filename, folder, ticket, user, date, time)
  srcdir = SRCDIR + File::SEPARATOR + folder
  dstdir = DSTDIR + File::SEPARATOR + ticket + File::SEPARATOR + folder
  FileUtils.mkdir_p(dstdir)
  FileUtils.cp(srcdir + File::SEPARATOR + filename,
               dstdir + File::SEPARATOR + filename)

  # write out status file
  statusfile = dstdir + File::SEPARATOR + "status.txt"
  unless FileTest.exists?(statusfile)
    File.open(statusfile, "w") do |sf|
      sf.puts "user=#{user}"
      sf.puts "date=#{date}"
      sf.puts "time=#{time}"
    end
  end
end

def process_meta(f)
  f.each_line do |line|
    next unless line =~ /^(\w+);(\w+);(\w+);(\w+);(\w+);(\w+)$/
    copy_ticket($1,$2,$3,$4,$5,$6)
  end
end

# Main program
File.open("mycsvfile") do |f|
process_meta(f)
end

If you want to build up a hash of ticket IDs seen, you can do that in
process_meta as well. I'd pass in an empty hash, and update it in the
each_line loop.

HTH,

Brian.

···

On Wed, Feb 07, 2007 at 05:47:26PM +0900, Rebhan, Gilbert wrote:

i get a folder

/timestamp
  metafile.txt
   /INCLIB
   /PLI

metafile looks like that =
APLVZDT;INCLIB;E050441;AP013;240107;0730
AP400ER;INCLIB;E023889;AP013;240107;0730
AP540RBP;INCLIB;E052337;AP013;240107;0730
AP700PA;INCLIB;E050441;AP013;240107;0730
... more lines

field 1 is a filename
field 2 is a foldername, shows whether path is /INCLIB/file or /PLI/file
field 3 is a ticketnr
field 4 is a username
field 5 is a date
field 6 is a timestamp

i need to parse the metafile and =

1. create a folderstructure for every ticketnr that occurs, f.e.

/E050441
   /INCLIB
   /PLI

and put all the files that belong to that ticket
(means the line with the filename contains that ticketnr)
in the subfolder which is field 2

2. create a file in the root of the /ticketnr folder
    which contains the rest of a dataset (line), means =

field 4
field 5
field 6

which are the same for every file with the same ticketnr

the format might look like

user=...
date=...
time=...

have to decide it later.

I thought with =

File.open("mycsvfile", "r").each do |line|
        if line =~ /(\w+.?);(\w+);(\w+);(\w+);(\w+);(\w+)/

         efas<<$3.to_s<<',' unless efas.include?($3.to_s)

i get an array with all ticketnr
then i create a folderstructure for every index in that array
and put the files in it, but i don't get it.

Any ideas ?

Rebhan_Gilbert · 7 February 2007 10:28

Hi,

···

-----Original Message-----
From: Brian Candler [mailto:B.Candler@pobox.com]
Sent: Wednesday, February 07, 2007 10:41 AM
To: ruby-talk ML
Subject: Re: Parse csv similar file

thanks Brian, works like a charm

i had to add the Extension .txt (this may be altered)
to the filename and did it like that =

require 'fileutils'
SRCDIR="/path_to_src"
DSTDIR="/path_to_dst"
#EXT=".extension"
EXT=".txt"

def copy_ticket(filename, folder, ticket, user, date, time)
  srcdir = SRCDIR + File::SEPARATOR + folder
  dstdir = DSTDIR + File::SEPARATOR + ticket + File::SEPARATOR + folder
  filename=filename<<EXT
...

is there a better way ?

what a pitty it don't work with jruby 0.9.2

Have to go with jruby as using it in an ant script
with the <script> task

jruby gives no error, it just don't work, nothing happens ?!

Possible workaround =

i can create an executable via rubyscript2exe.rb and call
that .exe in my antscript.

But therefore the .exe has to accept the parameters

SRCDIR, DSTDIR,EXT when calling it

How to alter your class to achieve that ?

Regards, Gilbert

P.S. :
i hope you are open for stupid questions here on the list ;-),
as i'm quite new to ruby (used it some month but only for small
purposes in ant scripts) , coming from java.

-----Original Message-----
From: Brian Candler [mailto:B.Candler@pobox.com]
Sent: Wednesday, February 07, 2007 10:41 AM
To: ruby-talk ML
Subject: Re: Parse csv similar file

Rebhan_Gilbert · 7 February 2007 11:32

Hi,

···

-----Original Message-----
From: Rebhan, Gilbert [mailto:Gilbert.Rebhan@huk-coburg.de]
Sent: Wednesday, February 07, 2007 11:28 AM
To: ruby-talk ML
Subject: Re: Parse csv similar file

/*
But therefore the .exe has to accept the parameters

SRCDIR, DSTDIR,EXT when calling it

How to alter your class to achieve that ?
*/

OK, it works like that =

require 'fileutils'
SRCDIR=ARGV[0]
DSTDIR=ARGV[1]
EXT=ARGV[2]

converting *.rb to *.exe and call it
*.exe "/path_to_src" "/path_to_dst" ".extension"

thanks a lot for your help !!

Regards, Gilbert

Brian_Candler · 7 February 2007 14:31

That's OK, just beware that the way you've done it you've modified the
string which was passed in. e.g.

a="foobar"
copy_ticket(a, "/tmp", "E123", "x", "y", "z")
puts a

will print "foobar.txt"

To avoid that:

filename = filename + EXT

(which creates a new String object, and then updates the local variable
'filename' to point to this new object)

This is an interesting "small" file-chomping task. I wonder what the
equivalent Java program would look like

B.

···

On Wed, Feb 07, 2007 at 07:28:03PM +0900, Rebhan, Gilbert wrote:

i had to add the Extension .txt (this may be altered)
to the filename and did it like that =

require 'fileutils'
SRCDIR="/path_to_src"
DSTDIR="/path_to_dst"
#EXT=".extension"
EXT=".txt"

def copy_ticket(filename, folder, ticket, user, date, time)
  srcdir = SRCDIR + File::SEPARATOR + folder
  dstdir = DSTDIR + File::SEPARATOR + ticket + File::SEPARATOR + folder
  filename=filename<<EXT
...

is there a better way ?

Rebhan_Gilbert · 7 February 2007 14:51

Hi,

filename=filename<<EXT
...

is there a better way ?

/*
That's OK, just beware that the way you've done it you've modified the
string which was passed in. e.g.
...
*/

yup, i know, but somewhere i read that
string concatenation via << would be better/quicker as +
because no new String object gets created.

Regards, Gilbert

Topic		Replies	Views
Splitting ruby-talk	12	112	25 July 2009
Parsing excel CVS data on a mac OSX to extract blocks of cells ruby-talk	12	113	28 November 2005
Using ruby hash on array ruby-talk	13	174	22 December 2008
Group several lines into one line ruby-talk	6	185	1 May 2009
Splitting a CSV file into 40,000 line chunks ruby-talk	42	362	2 December 2006

Parse csv similar file

Related topics