Seeking the Ruby way

I'm just getting my feet wet with Ruby and would like some advice on how you
"old-timers" would write the following script using Ruby idioms.

The intent of the script is to parse a CSV file that contains 2 fields per
row, sorted on the second field. There may be multiple rows for field 2. I
want to get a list of all of the unique values of field2 that has more than
1 value for the 1st 6 characters of field 1.

Here's what I did:

require 'csv'

last_account_id = ''
last_adv_id = ''
parent_co_ids = []
cntr = 0
first = true
CSV::Reader.parse(File.open('e:\\tmp\\20060201\\bsa.csv', 'r')) do |row|
    if row[1] == last_account_id
        parent_co_ids << last_adv_id[0, 6] unless
parent_co_ids.include?(last_adv_id[0, 6])
    else
        if !first
            parent_co_ids << last_adv_id[0, 6] unless
parent_co_ids.include?(last_adv_id[0, 6])
            if parent_co_ids.size > 1
                puts "#{last_account_id} - (#{parent_co_ids.join(',')})"
                cntr = cntr + 1
            end
            parent_co_ids.clear
        else
            first = false
        end
    end
    last_account_id = row[1]
    last_adv_id = row[0]
end
puts "Found #{cntr} accounts with multiple parent companies"

Thanks in advance!

Todd Breiholz

harp:~ > cat a.rb
   require "csv"
   require "yaml"

   path = ARGV.shift
   sum = Hash::new{|h,k| h[k] = 0}
   count = lambda{|row| sum[row.last.to_s[0,6]] += 1}
   CSV::open(path,"r"){|row| count[row]}
   y sum.delete_if{|k,v| v == 1}

   harp:~ > cat in.csv
   0,aaaaaa___
   1,aaaaaa___
   2,aaabbb___
   3,aaabbb___
   4,aaabbb___
   5,aaaccc___

   harp:~ > ruby a.rb in.csv

···

On Fri, 3 Feb 2006, Todd Breiholz wrote:

I'm just getting my feet wet with Ruby and would like some advice on how you
"old-timers" would write the following script using Ruby idioms.

The intent of the script is to parse a CSV file that contains 2 fields per
row, sorted on the second field. There may be multiple rows for field 2. I
want to get a list of all of the unique values of field2 that has more than
1 value for the 1st 6 characters of field 1.

Here's what I did:

require 'csv'

last_account_id = ''
last_adv_id = ''
parent_co_ids =
cntr = 0
first = true
CSV::Reader.parse(File.open('e:\\tmp\\20060201\\bsa.csv', 'r')) do |row|
   if row[1] == last_account_id
       parent_co_ids << last_adv_id[0, 6] unless
parent_co_ids.include?(last_adv_id[0, 6])
   else
       if !first
           parent_co_ids << last_adv_id[0, 6] unless
parent_co_ids.include?(last_adv_id[0, 6])
           if parent_co_ids.size > 1
               puts "#{last_account_id} - (#{parent_co_ids.join(',')})"
               cntr = cntr + 1
           end
           parent_co_ids.clear
       else
           first = false
       end
   end
   last_account_id = row[1]
   last_adv_id = row[0]
end
puts "Found #{cntr} accounts with multiple parent companies"

Thanks in advance!

Todd Breiholz

   ---
   aaaaaa: 2
   aaabbb: 3

hth. regards.

-a

--
happiness is not something ready-made. it comes from your own actions.
- h.h. the 14th dali lama

Todd Breiholz wrote:

I'm just getting my feet wet with Ruby and would like some advice on how you
"old-timers" would write the following script using Ruby idioms.

The intent of the script is to parse a CSV file that contains 2 fields per
row, sorted on the second field. There may be multiple rows for field 2. I
want to get a list of all of the unique values of field2 that has more than
1 value for the 1st 6 characters of field 1.

--- input data -----
123456ab,900
123456cd,900
123456ef,909
012345gh,909
--- end of input -----

--- Using a hash of arrays:

require 'csv'

h = Hash.new{ }
CSV::Reader.parse(File.open( ARGV.first )) { |row|
  h[row.last] |= [ row.first[0,6] ] }
p h.delete_if{|k,v| v.size == 1 }

--- output -----
{"909"=>["123456", "012345"]}
--- end of output -----

--- Using a hash of hashes:

require 'csv'

h = Hash.new{|h,k| h[k] = {} }
CSV::Reader.parse(File.open( ARGV.first )) { |row|
  h[row.last][ row.first[0,6] ] = 8 }
p h.delete_if{|k,v| v.size == 1 }

--- output -----
{"909"=>{"012345"=>8, "123456"=>8}}
--- end of output -----

Todd Breiholz wrote:

I'm just getting my feet wet with Ruby and would like some advice on
how you "old-timers" would write the following script using Ruby
idioms.

The intent of the script is to parse a CSV file that contains 2
fields per row, sorted on the second field. There may be multiple
rows for field 2. I want to get a list of all of the unique values of
field2 that has more than 1 value for the 1st 6 characters of field 1.

There are two possible interpretations of what you state here:

1. You want all values for row2 that occur more than once.

2. You want all values for row2 that have more than one distinct row1
value.

Implementations:

ad 1.

require 'csv'

h = Hash.new(0)
CSV::Reader.parse(ARGF) {|row| h[row[1]] += 1}
h.each {|k,v| puts k if v > 1}

ad 2.

require 'csv'
require 'set'

h = Hash.new {|h,k| h[k] = Set.new}
CSV::Reader.parse(ARGF) {|row| h[row[1]] << row[0]}
h.each {|k,v| puts k if v.size > 1}

Note: CSV::Reader can use ARGF which makes it easy to read from stdin as
well as multiple files.

Kind regards

    robert

I'm curious why you decided to make `count` its own lambda when:

  1) It's only ever used once
  2) The block that uses it has only one statement, namely the call to `count`
  3) count and the block to CSV::open have the same signature

I think at a minimum, given 2) and 3), I'd just replace the block to
CSV::open with count itself:

  count = lambda{|row| sum[row.last.to_s[0,6]] += 1}
  CSV::open(path,"r", &count)

Then, since count isn't used anywhere else, I'd join those together:

  CSV::open(path,"r"){|row| sum[row.last.to_s[0,6]] += 1}

After those transformations:

  galadriel:~ lukfugl$ cat a.rb
  require "csv"
  require "yaml"

  path = ARGV.shift
  sum = Hash::new{|h,k| h[k] = 0}
  CSV::open(path,"r"){|row| sum[row.last.to_s[0,6]] += 1}
  y sum.delete_if{|k,v| v == 1}

  galadriel:~ lukfugl$ cat in.csv
  0,aaaaaa___
  1,aaaaaa___
  2,aaabbb___
  3,aaabbb___
  4,aaabbb___
  5,aaaccc___

  galadriel:~ lukfugl$ ruby a.rb in.csv

···

On 2/2/06, ara.t.howard@noaa.gov <ara.t.howard@noaa.gov> wrote:

   require "csv"
   require "yaml"

   path = ARGV.shift
   sum = Hash::new{|h,k| h[k] = 0}
   count = lambda{|row| sum[row.last.to_s[0,6]] += 1}
   CSV::open(path,"r"){|row| count[row]}
   y sum.delete_if{|k,v| v == 1}

  ---
  aaaaaa: 2
  aaabbb: 3

Just seems a little clearer to me over having an extra one-time use lambda.

Jacob Fugal

William James wrote:

Todd Breiholz wrote:

I'm just getting my feet wet with Ruby and would like some advice on
how you "old-timers" would write the following script using Ruby
idioms.

The intent of the script is to parse a CSV file that contains 2
fields per row, sorted on the second field. There may be multiple
rows for field 2. I want to get a list of all of the unique values
of field2 that has more than 1 value for the 1st 6 characters of
field 1.

--- input data -----
123456ab,900
123456cd,900
123456ef,909
012345gh,909
--- end of input -----

--- Using a hash of arrays:

require 'csv'

h = Hash.new{ }

I wonder how this works since the Hash never stores these arrays.

CSV::Reader.parse(File.open( ARGV.first )) { |row|
  h[row.last] |= [ row.first[0,6] ] }
p h.delete_if{|k,v| v.size == 1 }

--- output -----
{"909"=>["123456", "012345"]}
--- end of output -----

Is this really the output of the script above?

    robert

Robert Klemme wrote:

Todd Breiholz wrote:

I'm just getting my feet wet with Ruby and would like some advice on
how you "old-timers" would write the following script using Ruby
idioms.

The intent of the script is to parse a CSV file that contains 2
fields per row, sorted on the second field. There may be multiple
rows for field 2. I want to get a list of all of the unique values of
field2 that has more than 1 value for the 1st 6 characters of field
1.

There are two possible interpretations of what you state here:

1. You want all values for row2 that occur more than once.

Just remembered that the file is sorted. Then this implementation of case
1 is even more efficient as it does not store values in mem and works on
arbitrary large files:

require 'csv'

last = nil
CSV::Reader.parse(ARGF) do |row|
  last, k = row[1], last
  puts k if last == k
end

Kind regards

    robert

Jacob Fugal wrote:

  sum = Hash::new{|h,k| h[k] = 0}

And for some reason, I tend to write

sum = Hash.new(0)

when dealing with an immediate value. (But maybe it's a better practice
to use Ara's form, so that if you ever replace 0 with, say, a matrix,
you don't reuse the same object for each key in the hash.)

···

--
      vjoel : Joel VanderWerf : path berkeley edu : 510 665 3407

   require "csv"
   require "yaml"

   path = ARGV.shift
   sum = Hash::new{|h,k| h[k] = 0}
   count = lambda{|row| sum[row.last.to_s[0,6]] += 1}
   CSV::open(path,"r"){|row| count[row]}
   y sum.delete_if{|k,v| v == 1}

I'm curious why you decided to make `count` its own lambda when:

1) It's only ever used once
2) The block that uses it has only one statement, namely the call to `count`
3) count and the block to CSV::open have the same signature

it's for abstraction only. i wrote how to count before writing the csv open
line. when i wrote it ended up with something like

   CSV::open(path,"r"){|row| p row; count[row]}

during editing - as i always seem to for debugging :wink:

basically i find

   {{{{}}}}

tough to read sometimes and factor out things using lambda. it's rare that it
acutally ends up being the the only thing left as in this case - but here you
are quite right that it can be compacted.

I think at a minimum, given 2) and 3), I'd just replace the block to
CSV::open with count itself:

count = lambda{|row| sum[row.last.to_s[0,6]] += 1}
CSV::open(path,"r", &count)

Then, since count isn't used anywhere else, I'd join those together:

CSV::open(path,"r"){|row| sum[row.last.to_s[0,6]] += 1}

but i disagree here. people, esp nubies will look at that and say - what?
whereas reading

   count = lambda{|row| sum[row.last.to_s[0,6]] += 1}

   ... count[row] ...

is pretty clear. i often us variable as comments to others and myself. eg.
what does this do:

   password = "#{ sifname }_#{ eval( ((0...256).to_a.map{|c| c.chr}.sort_by{rand}.select{|c| c =~ %r/[[:print:]]/})[0,4].join.inspect ) }"

hard to say huh?

how about this?

   four_random_printable_chars = eval( ((0...256).to_a.map{|c| c.chr}.sort_by{rand}.select{|c| c =~ %r/[[:print:]]/})[0,4].join.inspect )
   password = "#{ sifname }_#{ four_random_printable_chars }"

ugly (yes i'm hacking like crazy today) but at least anyone reading it (most
importantly me) knows what i'm trying to do if not how!

anyhow - same goes with 'count': it's all good until you start cutting and
pasting - then you want vars not wicked expressions to move around.

Just seems a little clearer to me over having an extra one-time use lambda.

__iff__ you are good at reading ruby :wink:

cheers.

-a

···

On Fri, 3 Feb 2006, Jacob Fugal wrote:

On 2/2/06, ara.t.howard@noaa.gov <ara.t.howard@noaa.gov> wrote:

--
happiness is not something ready-made. it comes from your own actions.
- h.h. the 14th dali lama

> I'm curious why you decided to make `count` its own lambda when:
>
> 1) It's only ever used once
> 2) The block that uses it has only one statement, namely the call to `count`
> 3) count and the block to CSV::open have the same signature

it's for abstraction only.

<snip>

basically i find

   {{{{}}}}

tough to read sometimes and factor out things using lambda. it's rare that it
acutally ends up being the the only thing left as in this case - but here you
are quite right that it can be compacted.

Yeah, I agree. I often use similar abstraction techniques for
readability. My brain just has the tendency to refactor code inwards
as well as outwards when an abstraction seems extraneous.

> CSV::open(path,"r"){|row| sum[row.last.to_s[0,6]] += 1}

but i disagree here. people, esp nubies will look at that and say - what?
whereas reading

   count = lambda{|row| sum[row.last.to_s[0,6]] += 1}

   ... count[row] ...

is pretty clear. i often us variable as comments to others and myself.

Again, agreed. In this case though I don't think the abstraction of
naming sum[...] += 1 as count is a necessary one. If I were to
refactor part of the complex expression

  sum[row.last.to_s[0,6]] += 1

to improve readability, it would be the index:

  identifier_prefix = lambda{ |row| row.last.to_s[0,6] }
  ... sum[identifier_prefix[row]] += 1 ...

what does this do:

   password = "#{ sifname }_#{ eval( ((0...256).to_a.map{|c| c.chr}.sort_by{rand}.select{|c| c =~ %r/[[:print:]]/})[0,4].join.inspect ) }"

hard to say huh?

Ick, yes, I'd definitely split that into chunks. :slight_smile:

how about this?

   four_random_printable_chars = eval( ((0...256).to_a.map{|c| c.chr}.sort_by{rand}.select{|c| c =~ %r/[[:print:]]/})[0,4].join.inspect )
   password = "#{ sifname }_#{ four_random_printable_chars }"

ugly (yes i'm hacking like crazy today) but at least anyone reading it (most
importantly me) knows what i'm trying to do if not how!

If you say so... :wink:

Jacob Fugal

···

On 2/2/06, ara.t.howard@noaa.gov <ara.t.howard@noaa.gov> wrote:

On Fri, 3 Feb 2006, Jacob Fugal wrote: