# Sorting problem with an Array of Arrays

Hi there, I have an array of arrays that looks like the following:

arr1 = [["ABC-1", 1271768400, 2], ["ABC-2", 1271773800, 1],
["ABC-3", 1271863200, 2], ["ABC-4", 1271869200, 2],
["AAA-1", 1271862000, 2], ["DEF-1", 1271772000, 1]]

Desired sort order:
(1) label (1st element = all unique)
(2) different labels (e.g. AAA, DEF) need to be inserted according to
timestamp (2nd element), but maintain (1) label sort order; so AAA-2
can't come before AAA-1. (I don't have that data point here but it
happens)

=> I don't know how to do (2), so I just have the AoA sorted by (1)
right now.

Is there a way to do this? I'm trying to find duplicate and
overlapping timestamps but my current method for checking this is
getting complicated, so I thought I'd check to see if there's a good
way to sort the data properly in the first place.

TIA.

You could define a class to represent each of your inner array elements.
That class could then define the <=> method in which you could compare
each item in the order you desire. Then your outer array, consisting of
instances of your class, could be trivially sorted with Array#sort.

-Jeremy

···

On 9/22/2010 5:35 PM, Paul wrote:

Hi there, I have an array of arrays that looks like the following:

arr1 = [["ABC-1", 1271768400, 2], ["ABC-2", 1271773800, 1],
["ABC-3", 1271863200, 2], ["ABC-4", 1271869200, 2],
["AAA-1", 1271862000, 2], ["DEF-1", 1271772000, 1]]

Desired sort order:
(1) label (1st element = all unique)
(2) different labels (e.g. AAA, DEF) need to be inserted according to
timestamp (2nd element), but maintain (1) label sort order; so AAA-2
can't come before AAA-1. (I don't have that data point here but it
happens)

=> I don't know how to do (2), so I just have the AoA sorted by (1)
right now.

Is there a way to do this? I'm trying to find duplicate and
overlapping timestamps but my current method for checking this is
getting complicated, so I thought I'd check to see if there's a good
way to sort the data properly in the first place.

Silly me. You can also pass a block to the Array#sort method. That
block will receive 2 arguments which you may then compare as you wish,
returning -1 if the first argument sorts before the last, 0 if they are
equal, and 1 if the last argument sorts before the first.

-Jeremy

···

On 9/22/2010 5:43 PM, Jeremy Bopp wrote:

On 9/22/2010 5:35 PM, Paul wrote:

Hi there, I have an array of arrays that looks like the following:

arr1 = [["ABC-1", 1271768400, 2], ["ABC-2", 1271773800, 1],
["ABC-3", 1271863200, 2], ["ABC-4", 1271869200, 2],
["AAA-1", 1271862000, 2], ["DEF-1", 1271772000, 1]]

Desired sort order:
(1) label (1st element = all unique)
(2) different labels (e.g. AAA, DEF) need to be inserted according to
timestamp (2nd element), but maintain (1) label sort order; so AAA-2
can't come before AAA-1. (I don't have that data point here but it
happens)

=> I don't know how to do (2), so I just have the AoA sorted by (1)
right now.

Is there a way to do this? I'm trying to find duplicate and
overlapping timestamps but my current method for checking this is
getting complicated, so I thought I'd check to see if there's a good
way to sort the data properly in the first place.

You could define a class to represent each of your inner array elements.
That class could then define the <=> method in which you could compare
each item in the order you desire. Then your outer array, consisting of
instances of your class, could be trivially sorted with Array#sort.

arr2 = arr1.sort { |x,y| (x[0] <=> y[0]) * 10 + (x[1] <=> y[1]) }

or

arr1.sort! { |x,y| (x[0] <=> y[0]) * 10 + (x[1] <=> y[1]) }

···

On Sep 22, 6:54 pm, Jeremy Bopp <jer...@bopp.net> wrote:

On 9/22/2010 5:43 PM, Jeremy Bopp wrote:

> On 9/22/2010 5:35 PM, Paul wrote:
>> Hi there, I have an array of arrays that looks like the following:

>> arr1 = [["ABC-1", 1271768400, 2], ["ABC-2", 1271773800, 1],
>> ["ABC-3", 1271863200, 2], ["ABC-4", 1271869200, 2],
>> ["AAA-1", 1271862000, 2], ["DEF-1", 1271772000, 1]]

>> Desired sort order:
>> (1) label (1st element = all unique)
>> (2) different labels (e.g. AAA, DEF) need to be inserted according to
>> timestamp (2nd element), but maintain (1) label sort order; so AAA-2
>> can't come before AAA-1. (I don't have that data point here but it
>> happens)

>> => I don't know how to do (2), so I just have the AoA sorted by (1)
>> right now.

>> Is there a way to do this? I'm trying to find duplicate and
>> overlapping timestamps but my current method for checking this is
>> getting complicated, so I thought I'd check to see if there's a good
>> way to sort the data properly in the first place.

> You could define a class to represent each of your inner array elements.
> That class could then define the <=> method in which you could compare
> each item in the order you desire. Then your outer array, consisting of
> instances of your class, could be trivially sorted with Array#sort.

Silly me. You can also pass a block to the Array#sort method. That
block will receive 2 arguments which you may then compare as you wish,
returning -1 if the first argument sorts before the last, 0 if they are
equal, and 1 if the last argument sorts before the first.

-Jeremy

Hi Jeremy, thanks for the reply.

> You could define a class to represent each of your inner array elements.
> That class could then define the <=> method in which you could compare
> each item in the order you desire. Then your outer array, consisting of
> instances of your class, could be trivially sorted with Array#sort.

I'm afraid I don't understand what you are saying here. All I
understood was "could be trivially sorted with Array#sort." I don't
understand classes.

Silly me. You can also pass a block to the Array#sort method. That
block will receive 2 arguments which you may then compare as you wish,
returning -1 if the first argument sorts before the last, 0 if they are
equal, and 1 if the last argument sorts before the first.

-Jeremy

Do you have an example of what this might look like? To clarify what
I want to see, here is how the data should be sorted:

[["ABC-1", 1271768400, 2],
["DEF-1", 1271772000, 1],
["ABC-2", 1271773800, 1],
["AAA-1", 1271862000, 2],
["ABC-3", 1271863200, 2],
["ABC-4", 1271869200, 2]]

When sorted correctly, I can see the data input problem with the
'ABC-3' time stamp. The label sort is important except when the
labels change, then they need to be sorted numerically.

···

On Sep 22, 6:54 pm, Jeremy Bopp <jer...@bopp.net> wrote:

Hi Jeremy, thanks for the reply.

> You could define a class to represent each of your inner array elements.
> That class could then define the <=> method in which you could compare
> each item in the order you desire. Then your outer array, consisting of
> instances of your class, could be trivially sorted with Array#sort.

I'm afraid I don't understand what you are saying here. All I
understood was "could be trivially sorted with Array#sort." I don't
understand classes.

Silly me. You can also pass a block to the Array#sort method. That
block will receive 2 arguments which you may then compare as you wish,
returning -1 if the first argument sorts before the last, 0 if they are
equal, and 1 if the last argument sorts before the first.

-Jeremy

Do you have an example of what this might look like? To clarify what
I want to see, here is how the data should be sorted:

[["ABC-1", 1271768400, 2],
["DEF-1", 1271772000, 1],
["ABC-2", 1271773800, 1],
["AAA-1", 1271862000, 2],
["ABC-3", 1271863200, 2],
["ABC-4", 1271869200, 2]]

Does this mean the array has to be sorted based on timestamps (i.e:
the second element of each subarray)?
If this is so, then this works:

irb(main):004:0> arr1.sort_by {|x| x[1]}
=> [["ABC-1", 1271768400, 2], ["DEF-1", 1271772000, 1], ["ABC-2",
1271773800, 1], ["AAA-1", 1271862000, 2], ["ABC-3", 1271863200, 2],
["ABC-4", 1271869200, 2]]

When sorted correctly, I can see the data input problem with the
'ABC-3' time stamp. The label sort is important except when the
labels change, then they need to be sorted numerically.

I don't understand this. What do you mean by "when the labels change?"

Jesus.

···

On Thu, Sep 23, 2010 at 6:30 AM, Paul <tester.paul@gmail.com> wrote:

On Sep 22, 6:54 pm, Jeremy Bopp <jer...@bopp.net> wrote:

Classes are rather a central concept in Ruby, so it would be a really
good idea to learn about them; however, they are not necessary here.
I'll keep this short and say that using an array of arrays such as what
you have here can make your code cryptic and brittle as it grows and
ages. Using a class to represent your inner array items will help avoid
these issues.

-Jeremy

···

On 9/22/2010 11:30 PM, Paul wrote:

Hi Jeremy, thanks for the reply.

On Sep 22, 6:54 pm, Jeremy Bopp <jer...@bopp.net> wrote:

You could define a class to represent each of your inner array elements.
That class could then define the <=> method in which you could compare
each item in the order you desire. Then your outer array, consisting of
instances of your class, could be trivially sorted with Array#sort.

I'm afraid I don't understand what you are saying here. All I
understood was "could be trivially sorted with Array#sort." I don't
understand classes.

I'm not sure anybody really explained this well.

arr.sort! do |a,b|
#stuff to sort the array goes here
end

What that does is it puts one array element in a, and one in b. So, for example, you might have
a = ["DEF-1", 1271772000, 1]
and
b = ["ABC-2", 1271773800, 1]

So what you have to do in the "#stuff to sort the array goes here" is whatever code you need to have to figure out which one of those two comes first. It sounds like you need something like a[0].split("-")[0] to extract out the "DEF" or the "ABC" first, in order to sort by the label? Because a[0] is, of course, "DEF-1".

sortorder = (a[0].split("-")[0] <=> b[0].split("-")[0])
if sortorder == 0 then
a[1]<=>b[1]
else
sortorder
end

(You might need to look up what the <=> operator does...also, that's not how I would normally write that code. I was verbose for clarity's sake. ) If I understand your explanation correctly, then that code will NOT sort your array in the order you want, but it should help you understand how to use the .sort function to get there.

···

On Sep 22, 2010, at 21:30 , Paul wrote:

Silly me. You can also pass a block to the Array#sort method. That
block will receive 2 arguments which you may then compare as you wish,
returning -1 if the first argument sorts before the last, 0 if they are
equal, and 1 if the last argument sorts before the first.

Do you have an example of what this might look like? To clarify what
I want to see, here is how the data should be sorted:

[["ABC-1", 1271768400, 2],
["DEF-1", 1271772000, 1],
["ABC-2", 1271773800, 1],
["AAA-1", 1271862000, 2],
["ABC-3", 1271863200, 2],
["ABC-4", 1271869200, 2]]

Does this mean the array has to be sorted based on timestamps (i.e:
the second element of each subarray)?

Yes and no. I can't sort by timestamps as a primary sort.

> When sorted correctly, I can see the data input problem with the
> 'ABC-3' time stamp. The label sort is important except when the
> labels change, then they need to be sorted numerically.

I don't understand this. What do you mean by "when the labels change?"

Jesus.

The arrays represent timesheet records for work done. Each person on
the crew submits their records but sometimes they work on projects
with others. The majority of the record labels will be for a single
person (e.g. the ABC-*) but from time to time a different label will
show up (AAA, DEF, ...) because they worked a project with someone
else.

The ruby script I have goes through the list of records and looks for
double-bookings and other mistakes made during data entry. For
example, sometimes someone writes 10 PM when it was 10 AM. That's why
the label order is primary - it represents the order in which the jobs
happened. The timestamps themselves might be wrong.

However, when a new/different label appears, the "best guess" scenario
is to insert it into the list according to timestamp and rerun the
analysis to check that the timestamps are correct (no double-bookings,
etc.)

I wrote a method that does a recursive analysis on the data using 2
separately sorted arrays - one by label and one by timestamp. It's
weird and a bit complex but it works. If I could find a way to sort
the data properly in the first place, then I can simplify the code
checking that takes place after it.

Does this help?

TIA.

···

On Sep 23, 3:22 am, Jesús Gabriel y Galán <jgabrielyga...@gmail.com> wrote:

\$cat h1.rb
#!/usr/bin/ruby
puts [
["DEF-1", 1271772000, 1],
["ABC-2", 1271773800, 1],
["ABC-3", 1271863200, 2],
["AAA-1", 1271862000, 2],
["ABC-4", 1271869200, 2],
["ABC-1", 1271768400, 2]
].sort{ |x,y| (x[1] <=> y[1]) * 10 + (x[0] <=> y[0]) }.inspect

\$./h1.rb
[["ABC-1", 1271768400, 2], ["DEF-1", 1271772000, 1], ["ABC-2",
1271773800, 1], ["AAA-1", 1271862000, 2], ["ABC-3", 1271863200, 2],
["ABC-4", 1271869200, 2]]
\$

···

On Sep 23, 11:27 am, Jeremy Bopp <jer...@bopp.net> wrote:

On 9/22/2010 11:30 PM, Paul wrote:

> Hi Jeremy, thanks for the reply.

> On Sep 22, 6:54 pm, Jeremy Bopp <jer...@bopp.net> wrote:

>>> You could define a class to represent each of your inner array elements.
>>> That class could then define the <=> method in which you could compare
>>> each item in the order you desire. Then your outer array, consisting of
>>> instances of your class, could be trivially sorted with Array#sort.

> I'm afraid I don't understand what you are saying here. All I
> understood was "could be trivially sorted with Array#sort." I don't
> understand classes.

Classes are rather a central concept in Ruby, so it would be a really
good idea to learn about them; however, they are not necessary here.
I'll keep this short and say that using an array of arrays such as what
you have here can make your code cryptic and brittle as it grows and
ages. Using a class to represent your inner array items will help avoid
these issues.

-Jeremy

What you appear to be describing is actually an ordered insert operation
rather than a strict sort operation. Array#sort is abstracted such that
you cannot know at any given time what is already sorted in your array,
so you cannot conditionally change your primary sort key from label to
timestamp based upon what is currently the last entry of the sorted array.

You need to define a custom insert function. Beware that your
description of labels and how they "change" is ambiguous, so you may
need to modify the label comparison logic in the function to capture
what it really means to have different labels:

def special_insert(arr, new_item)
if arr.empty? || new_item[0] == arr.last[0] then
# If the array is empty or the label of the new item is the same
# as the label of the last item in the array, append the new item
# to the array.
arr << new_item
else
# Otherwise, insert the new item by its timestamp.
idx = arr.index { |item| new_item[1] < item[1] }
if idx.nil? then
# The new item is the oldest, so append it.
arr << new_item
else
# Otherwise, insert the new item before the first item younger
# than it is.
arr.insert(idx, new_item)
end
end
end

unsorted_arr = [
["AAA-1", 1271862000, 2],
["ABC-1", 1271768400, 2],
["ABC-2", 1271773800, 1],
["ABC-3", 1271863200, 2],
["ABC-4", 1271869200, 2],
["DEF-1", 1271772000, 1]
]

sorted_arr =

unsorted_arr.each do |item|
special_insert(sorted_arr, item)
end

puts sorted_arr.collect { |item| "[#{item.join(", ")}]" }.join("\n")

···

On 9/23/2010 9:20 AM, Paul wrote:

The arrays represent timesheet records for work done. Each person on
the crew submits their records but sometimes they work on projects
with others. The majority of the record labels will be for a single
person (e.g. the ABC-*) but from time to time a different label will
show up (AAA, DEF, ...) because they worked a project with someone
else.

The ruby script I have goes through the list of records and looks for
double-bookings and other mistakes made during data entry. For
example, sometimes someone writes 10 PM when it was 10 AM. That's why
the label order is primary - it represents the order in which the jobs
happened. The timestamps themselves might be wrong.

However, when a new/different label appears, the "best guess" scenario
is to insert it into the list according to timestamp and rerun the
analysis to check that the timestamps are correct (no double-bookings,
etc.)

I wrote a method that does a recursive analysis on the data using 2
separately sorted arrays - one by label and one by timestamp. It's
weird and a bit complex but it works. If I could find a way to sort
the data properly in the first place, then I can simplify the code
checking that takes place after it.

The approach with multiplying is fragile. I would prefer one of these
approaches:

10:03:18 tmp\$ cat x.rb

require 'pp'

data = [
["DEF-1", 1271772000, 1],
["ABC-2", 1271773800, 1],
["ABC-3", 1271863200, 2],
["AAA-1", 1271862000, 2],
["ABC-4", 1271869200, 2],
["ABC-1", 1271768400, 2]
]

pp data.sort_by {|x| [x[1], x[0]]}

pp data.sort do |(la,ta,na),(lb,tb,nb)|
c = ta <=> tb
c == 0 ? la <=> lb : c
end

10:03:24 tmp\$ ruby19 x.rb
[["ABC-1", 1271768400, 2],
["DEF-1", 1271772000, 1],
["ABC-2", 1271773800, 1],
["AAA-1", 1271862000, 2],
["ABC-3", 1271863200, 2],
["ABC-4", 1271869200, 2]]
[["AAA-1", 1271862000, 2],
["ABC-1", 1271768400, 2],
["ABC-2", 1271773800, 1],
["ABC-3", 1271863200, 2],
["ABC-4", 1271869200, 2],
["DEF-1", 1271772000, 1]]
10:03:26 tmp\$

However, as Jeremy said already, using specific classes is much more
preferable because it makes the code easier to read and less error
prone. In this case it can be as simple as

10:13:34 tmp\$ cat y.rb

require 'pp'

Booking = Struct.new :label, :timestamp, :room do
include Comparable

# defines default ordering:
# 1. timestamp
# 2. label
def <=> o
c = timestamp <=> o.timestamp
c == 0 ? label <=> o.label : c
end
end

data = [
Booking["DEF-1", 1271772000, 1],
Booking["ABC-2", 1271773800, 1],
Booking["ABC-3", 1271863200, 2],
Booking["AAA-1", 1271862000, 2],
Booking["ABC-4", 1271869200, 2],
Booking["ABC-1", 1271768400, 2]
]

pp data.sort

10:13:37 tmp\$ ruby19 y.rb
[#<struct Booking label="ABC-1", timestamp=1271768400, room=2>,
#<struct Booking label="DEF-1", timestamp=1271772000, room=1>,
#<struct Booking label="ABC-2", timestamp=1271773800, room=1>,
#<struct Booking label="AAA-1", timestamp=1271862000, room=2>,
#<struct Booking label="ABC-3", timestamp=1271863200, room=2>,
#<struct Booking label="ABC-4", timestamp=1271869200, room=2>]
10:13:41 tmp\$

Btw, I'd also use Time instances for the timestamps but class Booking
also works with Fixnums which you have in your data.

Kind regards

robert

···

On Thu, Sep 23, 2010 at 10:45 PM, (r.*n){2} <perl2ruby@gmail.com> wrote:

On Sep 23, 11:27 am, Jeremy Bopp <jer...@bopp.net> wrote:

On 9/22/2010 11:30 PM, Paul wrote:

> Hi Jeremy, thanks for the reply.

> On Sep 22, 6:54 pm, Jeremy Bopp <jer...@bopp.net> wrote:

>>> You could define a class to represent each of your inner array elements.
>>> That class could then define the <=> method in which you could compare
>>> each item in the order you desire. Then your outer array, consisting of
>>> instances of your class, could be trivially sorted with Array#sort.

> I'm afraid I don't understand what you are saying here. All I
> understood was "could be trivially sorted with Array#sort." I don't
> understand classes.

Classes are rather a central concept in Ruby, so it would be a really
good idea to learn about them; however, they are not necessary here.
I'll keep this short and say that using an array of arrays such as what
you have here can make your code cryptic and brittle as it grows and
ages. Using a class to represent your inner array items will help avoid
these issues.

-Jeremy

\$cat h1.rb
#!/usr/bin/ruby
puts [
["DEF-1", 1271772000, 1],
["ABC-2", 1271773800, 1],
["ABC-3", 1271863200, 2],
["AAA-1", 1271862000, 2],
["ABC-4", 1271869200, 2],
["ABC-1", 1271768400, 2]
].sort{ |x,y| (x[1] <=> y[1]) * 10 + (x[0] <=> y[0]) }.inspect

--
remember.guy do |as, often| as.you_can - without end
http://blog.rubybestpractices.com/

Wonder what the performance of this would be if there were 10,000
entries?

···

On Sep 23, 4:46 pm, Jeremy Bopp <jer...@bopp.net> wrote:

On 9/23/2010 9:20 AM, Paul wrote:

> The arrays represent timesheet records for work done. Each person on
> the crew submits their records but sometimes they work on projects
> with others. The majority of the record labels will be for a single
> person (e.g. the ABC-*) but from time to time a different label will
> show up (AAA, DEF, ...) because they worked a project with someone
> else.

> The ruby script I have goes through the list of records and looks for
> double-bookings and other mistakes made during data entry. For
> example, sometimes someone writes 10 PM when it was 10 AM. That's why
> the label order is primary - it represents the order in which the jobs
> happened. The timestamps themselves might be wrong.

> However, when a new/different label appears, the "best guess" scenario
> is to insert it into the list according to timestamp and rerun the
> analysis to check that the timestamps are correct (no double-bookings,
> etc.)

> I wrote a method that does a recursive analysis on the data using 2
> separately sorted arrays - one by label and one by timestamp. It's
> weird and a bit complex but it works. If I could find a way to sort
> the data properly in the first place, then I can simplify the code
> checking that takes place after it.

What you appear to be describing is actually an ordered insert operation
rather than a strict sort operation. Array#sort is abstracted such that
you cannot know at any given time what is already sorted in your array,
so you cannot conditionally change your primary sort key from label to
timestamp based upon what is currently the last entry of the sorted array.

You need to define a custom insert function. Beware that your
description of labels and how they "change" is ambiguous, so you may
need to modify the label comparison logic in the function to capture
what it really means to have different labels:

def special_insert(arr, new_item)
if arr.empty? || new_item[0] == arr.last[0] then
# If the array is empty or the label of the new item is the same
# as the label of the last item in the array, append the new item
# to the array.
arr << new_item
else
# Otherwise, insert the new item by its timestamp.
idx = arr.index { |item| new_item[1] < item[1] }
if idx.nil? then
# The new item is the oldest, so append it.
arr << new_item
else
# Otherwise, insert the new item before the first item younger
# than it is.
arr.insert(idx, new_item)
end
end
end

unsorted_arr = [
["AAA-1", 1271862000, 2],
["ABC-1", 1271768400, 2],
["ABC-2", 1271773800, 1],
["ABC-3", 1271863200, 2],
["ABC-4", 1271869200, 2],
["DEF-1", 1271772000, 1]
]

sorted_arr =

unsorted_arr.each do |item|
special_insert(sorted_arr, item)
end

puts sorted_arr.collect { |item| "[#{item.join(", ")}]" }.join("\n")

Hi Jeremy, I tried this code but I get an error on the line

idx = arr.index { |item| new_item[1] < item[1] }

:in `index': wrong number of arguments (0 for 1) (ArgumentError)

Can you tell me how I can get your code below to work?

You also mentioned working with classes (in a previous reply) as an
alternate way of organizing this data. Do you know where I can find
examples or docs to help me learn how to do this?

I still don't have an algorithm that finds all the problems that I can
catch by eye. It's just hard to catch them all when looking at lists
of 1000's of records.

TIA

···

On Sep 23, 4:46 pm, Jeremy Bopp <jer...@bopp.net> wrote:

What you appear to be describing is actually an ordered insert operation
rather than a strict sort operation. Array#sort is abstracted such that
you cannot know at any given time what is already sorted in yourarray,
so you cannot conditionally change your primary sort key from label to
timestamp based upon what is currently the last entry of the sortedarray.

You need to define a custom insert function. Beware that your
description of labels and how they "change" is ambiguous, so you may
need to modify the label comparison logic in the function to capture
what it really means to have different labels:

def special_insert(arr, new_item)
if arr.empty? || new_item[0] == arr.last[0] then
# If thearrayis empty or the label of the new item is the same
# as the label of the last item in thearray, append the new item
# to thearray.
arr << new_item
else
# Otherwise, insert the new item by its timestamp.
idx = arr.index { |item| new_item[1] < item[1] }
if idx.nil? then
# The new item is the oldest, so append it.
arr << new_item
else
# Otherwise, insert the new item before the first item younger
# than it is.
arr.insert(idx, new_item)
end
end
end

unsorted_arr = [
["AAA-1", 1271862000, 2],
["ABC-1", 1271768400, 2],
["ABC-2", 1271773800, 1],
["ABC-3", 1271863200, 2],
["ABC-4", 1271869200, 2],
["DEF-1", 1271772000, 1]
]

sorted_arr =

unsorted_arr.each do |item|
special_insert(sorted_arr, item)
end

puts sorted_arr.collect { |item| "[#{item.join(", ")}]" }.join("\n")

Probably pretty poor overall. The big problem is that timestamp search
in the ever-growing sorted array. Unfortunately, timestamps could well
be out of order since they could be erroneous but ignored when the
labels match up, so we can't do something more efficient like a binary
search to find the insert location.

Basically, this data is potentially pretty messy, and more effort should
probably be spent earlier in the chain to ensure that high quality data
is entered in the first place. That way the error detection heuristic
which needs this ordered array would likely be unnecessary.

-Jeremy

···

On 9/23/2010 4:15 PM, (r.*n){2} wrote:

On Sep 23, 4:46 pm, Jeremy Bopp <jer...@bopp.net> wrote:

On 9/23/2010 9:20 AM, Paul wrote:

The arrays represent timesheet records for work done. Each person on
the crew submits their records but sometimes they work on projects
with others. The majority of the record labels will be for a single
person (e.g. the ABC-*) but from time to time a different label will
show up (AAA, DEF, ...) because they worked a project with someone
else.

The ruby script I have goes through the list of records and looks for
double-bookings and other mistakes made during data entry. For
example, sometimes someone writes 10 PM when it was 10 AM. That's why
the label order is primary - it represents the order in which the jobs
happened. The timestamps themselves might be wrong.

However, when a new/different label appears, the "best guess" scenario
is to insert it into the list according to timestamp and rerun the
analysis to check that the timestamps are correct (no double-bookings,
etc.)

I wrote a method that does a recursive analysis on the data using 2
separately sorted arrays - one by label and one by timestamp. It's
weird and a bit complex but it works. If I could find a way to sort
the data properly in the first place, then I can simplify the code
checking that takes place after it.

What you appear to be describing is actually an ordered insert operation
rather than a strict sort operation. Array#sort is abstracted such that
you cannot know at any given time what is already sorted in your array,
so you cannot conditionally change your primary sort key from label to
timestamp based upon what is currently the last entry of the sorted array.

You need to define a custom insert function. Beware that your
description of labels and how they "change" is ambiguous, so you may
need to modify the label comparison logic in the function to capture
what it really means to have different labels:

def special_insert(arr, new_item)
if arr.empty? || new_item[0] == arr.last[0] then
# If the array is empty or the label of the new item is the same
# as the label of the last item in the array, append the new item
# to the array.
arr << new_item
else
# Otherwise, insert the new item by its timestamp.
idx = arr.index { |item| new_item[1] < item[1] }
if idx.nil? then
# The new item is the oldest, so append it.
arr << new_item
else
# Otherwise, insert the new item before the first item younger
# than it is.
arr.insert(idx, new_item)
end
end
end

unsorted_arr = [
["AAA-1", 1271862000, 2],
["ABC-1", 1271768400, 2],
["ABC-2", 1271773800, 1],
["ABC-3", 1271863200, 2],
["ABC-4", 1271869200, 2],
["DEF-1", 1271772000, 1]
]

sorted_arr =

unsorted_arr.each do |item|
special_insert(sorted_arr, item)
end

puts sorted_arr.collect { |item| "[#{item.join(", ")}]" }.join("\n")

Wonder what the performance of this would be if there were 10,000
entries?

Hi Jeremy, I tried this code but I get an error on the line

idx = arr.index { |item| new_item[1] < item[1] }

:in `index': wrong number of arguments (0 for 1) (ArgumentError)

Can you tell me how I can get your code below to work?

What version of Ruby are you using? Array#index as of Ruby 1.8.7
supports that usage. Maybe your Ruby is too old. If that's the case,
maybe you could rewrite that line and the one following it as follows
(not tested in context):

idx = arr.each_with_index do |item, index|
break index if new_item[1] < item[1]
end
if idx.object_id == arr.object_id then

You also mentioned working with classes (in a previous reply) as an
alternate way of organizing this data. Do you know where I can find
examples or docs to help me learn how to do this?

The standard library has a couple classes which can simplify this for
you a bit, namely Struct and OpenStruct. Search in the class list here:

http://rdoc.info/docs/ruby-stdlib/1.8.7/frames

I'm fond of OpenStruct, but those classes will only help you avoid using
the inner arrays to store the fields for each item of your main array.
That will provide some clarity to your code since you won't have to
memorize array indexes for your fields, but it won't do much more than that.

I still don't have an algorithm that finds all the problems that I can
catch by eye. It's just hard to catch them all when looking at lists
of 1000's of records.

I don't envy you your task one bit. You may want to see if you can
estimate the error rate in your data and then try to limit your work to
identifying and correcting a certain percentage of that error. 100% is
not likely from the sound of things, but 80% might be possible.

-Jeremy

···

On 10/5/2010 4:10 PM, Paul wrote:

Thanks Jeremy. I will try the new methods tomorrow.

As for your latter point, I hear you! I'm working on something to
hopefully address this point. This script _is_ a temporary measure,
and it provides value to me right now.

Cheers! Paul.

···

On Sep 23, 5:41 pm, Jeremy Bopp <jer...@bopp.net> wrote:

> Wonder what the performance of this would be if there were 10,000
> entries?

Probably pretty poor overall. The big problem is that timestamp search
in the ever-growing sorted array. Unfortunately, timestamps could well
be out of order since they could be erroneous but ignored when the
labels match up, so we can't do something more efficient like a binary
search to find the insert location.

Basically, this data is potentially pretty messy, and more effort should
probably be spent earlier in the chain to ensure that high quality data
is entered in the first place. That way the error detection heuristic
which needs this ordered array would likely be unnecessary.

-Jeremy