Seeking the Ruby way

Todd_Breiholz · 2 February 2006 21:57

I'm just getting my feet wet with Ruby and would like some advice on how you
"old-timers" would write the following script using Ruby idioms.

The intent of the script is to parse a CSV file that contains 2 fields per
row, sorted on the second field. There may be multiple rows for field 2. I
want to get a list of all of the unique values of field2 that has more than
1 value for the 1st 6 characters of field 1.

Here's what I did:

require 'csv'

last_account_id = ''
last_adv_id = ''
parent_co_ids = []
cntr = 0
first = true
CSV::Reader.parse(File.open('e:\\tmp\\20060201\\bsa.csv', 'r')) do |row|
    if row[1] == last_account_id
        parent_co_ids << last_adv_id[0, 6] unless
parent_co_ids.include?(last_adv_id[0, 6])
    else
        if !first
            parent_co_ids << last_adv_id[0, 6] unless
parent_co_ids.include?(last_adv_id[0, 6])
            if parent_co_ids.size > 1
                puts "#{last_account_id} - (#{parent_co_ids.join(',')})"
                cntr = cntr + 1
            end
            parent_co_ids.clear
        else
            first = false
        end
    end
    last_account_id = row[1]
    last_adv_id = row[0]
end
puts "Found #{cntr} accounts with multiple parent companies"

Thanks in advance!

Todd Breiholz

Ara.T.Howard6 · 2 February 2006 22:21

harp:~ > cat a.rb
require "csv"
require "yaml"

   path = ARGV.shift
   sum = Hash::new{|h,k| h[k] = 0}
   count = lambda{|row| sum[row.last.to_s[0,6]] += 1}
   CSV::open(path,"r"){|row| count[row]}
   y sum.delete_if{|k,v| v == 1}

   harp:~ > cat in.csv
   0,aaaaaa___
   1,aaaaaa___
   2,aaabbb___
   3,aaabbb___
   4,aaabbb___
   5,aaaccc___

harp:~ > ruby a.rb in.csv

···

On Fri, 3 Feb 2006, Todd Breiholz wrote:

I'm just getting my feet wet with Ruby and would like some advice on how you
"old-timers" would write the following script using Ruby idioms.

The intent of the script is to parse a CSV file that contains 2 fields per
row, sorted on the second field. There may be multiple rows for field 2. I
want to get a list of all of the unique values of field2 that has more than
1 value for the 1st 6 characters of field 1.

Here's what I did:

require 'csv'

last_account_id = ''
last_adv_id = ''
parent_co_ids =
cntr = 0
first = true
CSV::Reader.parse(File.open('e:\\tmp\\20060201\\bsa.csv', 'r')) do |row|
   if row[1] == last_account_id
       parent_co_ids << last_adv_id[0, 6] unless
parent_co_ids.include?(last_adv_id[0, 6])
   else
       if !first
           parent_co_ids << last_adv_id[0, 6] unless
parent_co_ids.include?(last_adv_id[0, 6])
           if parent_co_ids.size > 1
               puts "#{last_account_id} - (#{parent_co_ids.join(',')})"
               cntr = cntr + 1
           end
           parent_co_ids.clear
       else
           first = false
       end
   end
   last_account_id = row[1]
   last_adv_id = row[0]
end
puts "Found #{cntr} accounts with multiple parent companies"

Thanks in advance!

Todd Breiholz

   ---
   aaaaaa: 2
   aaabbb: 3

hth. regards.

-a

--
happiness is not something ready-made. it comes from your own actions.
- h.h. the 14th dali lama

W_James · 3 February 2006 02:38

Todd Breiholz wrote:

I'm just getting my feet wet with Ruby and would like some advice on how you
"old-timers" would write the following script using Ruby idioms.

The intent of the script is to parse a CSV file that contains 2 fields per
row, sorted on the second field. There may be multiple rows for field 2. I
want to get a list of all of the unique values of field2 that has more than
1 value for the 1st 6 characters of field 1.

--- input data -----
123456ab,900
123456cd,900
123456ef,909
012345gh,909
--- end of input -----

--- Using a hash of arrays:

require 'csv'

h = Hash.new{ }
CSV::Reader.parse(File.open( ARGV.first )) { |row|
h[row.last] |= [ row.first[0,6] ] }
p h.delete_if{|k,v| v.size == 1 }

--- output -----
{"909"=>["123456", "012345"]}
--- end of output -----

--- Using a hash of hashes:

require 'csv'

h = Hash.new{|h,k| h[k] = {} }
CSV::Reader.parse(File.open( ARGV.first )) { |row|
h[row.last][ row.first[0,6] ] = 8 }
p h.delete_if{|k,v| v.size == 1 }

--- output -----
{"909"=>{"012345"=>8, "123456"=>8}}
--- end of output -----

Robert · 3 February 2006 10:23

Todd Breiholz wrote:

I'm just getting my feet wet with Ruby and would like some advice on
how you "old-timers" would write the following script using Ruby
idioms.

The intent of the script is to parse a CSV file that contains 2
fields per row, sorted on the second field. There may be multiple
rows for field 2. I want to get a list of all of the unique values of
field2 that has more than 1 value for the 1st 6 characters of field 1.

There are two possible interpretations of what you state here:

1. You want all values for row2 that occur more than once.

2. You want all values for row2 that have more than one distinct row1
value.

Implementations:

ad 1.

require 'csv'

h = Hash.new(0)
CSV::Reader.parse(ARGF) {|row| h[row[1]] += 1}
h.each {|k,v| puts k if v > 1}

ad 2.

require 'csv'
require 'set'

h = Hash.new {|h,k| h[k] = Set.new}
CSV::Reader.parse(ARGF) {|row| h[row[1]] << row[0]}
h.each {|k,v| puts k if v.size > 1}

Note: CSV::Reader can use ARGF which makes it easy to read from stdin as
well as multiple files.

Kind regards

robert

Jacob_Fugal · 2 February 2006 23:10

I'm curious why you decided to make `count` its own lambda when:

  1) It's only ever used once
  2) The block that uses it has only one statement, namely the call to `count`
  3) count and the block to CSV::open have the same signature

I think at a minimum, given 2) and 3), I'd just replace the block to
CSV::open with count itself:

count = lambda{|row| sum[row.last.to_s[0,6]] += 1}
CSV::open(path,"r", &count)

Then, since count isn't used anywhere else, I'd join those together:

CSV::open(path,"r"){|row| sum[row.last.to_s[0,6]] += 1}

After those transformations:

  galadriel:~ lukfugl$ cat a.rb
  require "csv"
  require "yaml"

  path = ARGV.shift
  sum = Hash::new{|h,k| h[k] = 0}
  CSV::open(path,"r"){|row| sum[row.last.to_s[0,6]] += 1}
  y sum.delete_if{|k,v| v == 1}

  galadriel:~ lukfugl$ cat in.csv
  0,aaaaaa___
  1,aaaaaa___
  2,aaabbb___
  3,aaabbb___
  4,aaabbb___
  5,aaaccc___

galadriel:~ lukfugl$ ruby a.rb in.csv

···

On 2/2/06, ara.t.howard@noaa.gov <ara.t.howard@noaa.gov> wrote:

   require "csv"
   require "yaml"

   path = ARGV.shift
   sum = Hash::new{|h,k| h[k] = 0}
   count = lambda{|row| sum[row.last.to_s[0,6]] += 1}
   CSV::open(path,"r"){|row| count[row]}
   y sum.delete_if{|k,v| v == 1}

  ---
  aaaaaa: 2
  aaabbb: 3

Just seems a little clearer to me over having an extra one-time use lambda.

Jacob Fugal

Robert · 3 February 2006 10:18

William James wrote:

Todd Breiholz wrote:

I'm just getting my feet wet with Ruby and would like some advice on
how you "old-timers" would write the following script using Ruby
idioms.

The intent of the script is to parse a CSV file that contains 2
fields per row, sorted on the second field. There may be multiple
rows for field 2. I want to get a list of all of the unique values
of field2 that has more than 1 value for the 1st 6 characters of
field 1.

--- input data -----
123456ab,900
123456cd,900
123456ef,909
012345gh,909
--- end of input -----

--- Using a hash of arrays:

require 'csv'

h = Hash.new{ }

I wonder how this works since the Hash never stores these arrays.

CSV::Reader.parse(File.open( ARGV.first )) { |row|
h[row.last] |= [ row.first[0,6] ] }
p h.delete_if{|k,v| v.size == 1 }

--- output -----
{"909"=>["123456", "012345"]}
--- end of output -----

Is this really the output of the script above?

robert

Robert · 3 February 2006 10:38

Robert Klemme wrote:

Todd Breiholz wrote:

I'm just getting my feet wet with Ruby and would like some advice on
how you "old-timers" would write the following script using Ruby
idioms.

The intent of the script is to parse a CSV file that contains 2
fields per row, sorted on the second field. There may be multiple
rows for field 2. I want to get a list of all of the unique values of
field2 that has more than 1 value for the 1st 6 characters of field
1.

There are two possible interpretations of what you state here:

1. You want all values for row2 that occur more than once.

Just remembered that the file is sorted. Then this implementation of case
1 is even more efficient as it does not store values in mem and works on
arbitrary large files:

require 'csv'

last = nil
CSV::Reader.parse(ARGF) do |row|
last, k = row[1], last
puts k if last == k
end

Kind regards

robert

Joel_VanderWerf1 · 2 February 2006 23:34

Jacob Fugal wrote:

sum = Hash::new{|h,k| h[k] = 0}

And for some reason, I tend to write

sum = Hash.new(0)

when dealing with an immediate value. (But maybe it's a better practice
to use Ara's form, so that if you ever replace 0 with, say, a matrix,
you don't reuse the same object for each key in the hash.)

···

--
vjoel : Joel VanderWerf : path berkeley edu : 510 665 3407

Ara.T.Howard6 · 2 February 2006 23:55

   require "csv"
   require "yaml"

   path = ARGV.shift
   sum = Hash::new{|h,k| h[k] = 0}
   count = lambda{|row| sum[row.last.to_s[0,6]] += 1}
   CSV::open(path,"r"){|row| count[row]}
   y sum.delete_if{|k,v| v == 1}

I'm curious why you decided to make `count` its own lambda when:

1) It's only ever used once
2) The block that uses it has only one statement, namely the call to `count`
3) count and the block to CSV::open have the same signature

it's for abstraction only. i wrote how to count before writing the csv open
line. when i wrote it ended up with something like

CSV::open(path,"r"){|row| p row; count[row]}

during editing - as i always seem to for debugging

basically i find

{{{{}}}}

tough to read sometimes and factor out things using lambda. it's rare that it
acutally ends up being the the only thing left as in this case - but here you
are quite right that it can be compacted.

I think at a minimum, given 2) and 3), I'd just replace the block to
CSV::open with count itself:

count = lambda{|row| sum[row.last.to_s[0,6]] += 1}
CSV::open(path,"r", &count)

Then, since count isn't used anywhere else, I'd join those together:

CSV::open(path,"r"){|row| sum[row.last.to_s[0,6]] += 1}

but i disagree here. people, esp nubies will look at that and say - what?
whereas reading

count = lambda{|row| sum[row.last.to_s[0,6]] += 1}

... count[row] ...

is pretty clear. i often us variable as comments to others and myself. eg.
what does this do:

password = "#{ sifname }_#{ eval( ((0...256).to_a.map{|c| c.chr}.sort_by{rand}.select{|c| c =~ %r/[[:print:]]/})[0,4].join.inspect ) }"

hard to say huh?

how about this?

four_random_printable_chars = eval( ((0...256).to_a.map{|c| c.chr}.sort_by{rand}.select{|c| c =~ %r/[[:print:]]/})[0,4].join.inspect )
password = "#{ sifname }_#{ four_random_printable_chars }"

ugly (yes i'm hacking like crazy today) but at least anyone reading it (most
importantly me) knows what i'm trying to do if not how!

anyhow - same goes with 'count': it's all good until you start cutting and
pasting - then you want vars not wicked expressions to move around.

Just seems a little clearer to me over having an extra one-time use lambda.

__iff__ you are good at reading ruby

cheers.

-a

···

On Fri, 3 Feb 2006, Jacob Fugal wrote:

On 2/2/06, ara.t.howard@noaa.gov <ara.t.howard@noaa.gov> wrote:

--
happiness is not something ready-made. it comes from your own actions.
- h.h. the 14th dali lama

Jacob_Fugal · 3 February 2006 00:17

> I'm curious why you decided to make `count` its own lambda when:
>
> 1) It's only ever used once
> 2) The block that uses it has only one statement, namely the call to `count`
> 3) count and the block to CSV::open have the same signature

it's for abstraction only.

<snip>

basically i find

{{{{}}}}

tough to read sometimes and factor out things using lambda. it's rare that it
acutally ends up being the the only thing left as in this case - but here you
are quite right that it can be compacted.

Yeah, I agree. I often use similar abstraction techniques for
readability. My brain just has the tendency to refactor code inwards
as well as outwards when an abstraction seems extraneous.

> CSV::open(path,"r"){|row| sum[row.last.to_s[0,6]] += 1}

but i disagree here. people, esp nubies will look at that and say - what?
whereas reading

count = lambda{|row| sum[row.last.to_s[0,6]] += 1}

... count[row] ...

is pretty clear. i often us variable as comments to others and myself.

Again, agreed. In this case though I don't think the abstraction of
naming sum[...] += 1 as count is a necessary one. If I were to
refactor part of the complex expression

sum[row.last.to_s[0,6]] += 1

to improve readability, it would be the index:

identifier_prefix = lambda{ |row| row.last.to_s[0,6] }
... sum[identifier_prefix[row]] += 1 ...

what does this do:

password = "#{ sifname }_#{ eval( ((0...256).to_a.map{|c| c.chr}.sort_by{rand}.select{|c| c =~ %r/[[:print:]]/})[0,4].join.inspect ) }"

hard to say huh?

Ick, yes, I'd definitely split that into chunks.

how about this?

four_random_printable_chars = eval( ((0...256).to_a.map{|c| c.chr}.sort_by{rand}.select{|c| c =~ %r/[[:print:]]/})[0,4].join.inspect )
password = "#{ sifname }_#{ four_random_printable_chars }"

ugly (yes i'm hacking like crazy today) but at least anyone reading it (most
importantly me) knows what i'm trying to do if not how!

If you say so...

Jacob Fugal

···

On 2/2/06, ara.t.howard@noaa.gov <ara.t.howard@noaa.gov> wrote:

On Fri, 3 Feb 2006, Jacob Fugal wrote:

Topic		Replies	Views
Parsing CSV file with ruby ruby-talk	4	156	30 August 2006
Ruby Code Quiz - How can I parse quoted values with String#split? ruby-talk	0	414	25 October 2018
Parsing a CSV file having multiple records in RUBYp ruby-talk	7	125	27 December 2006
Parsing a CSV file column-wise ruby-talk	13	146	8 September 2008
Parsing CSV ruby-talk	8	133	26 February 2007

Seeking the Ruby way

Related topics