Sorted arrays

I'm a relative newcomer to Ruby. Most of my experience is in Delphi. And in Delphi one of the most commonly-used classes is TStringList, which is sort of analogous to ruby's Array (Delphi also has dynamic arrays and static arrays). TStringList has a property called Sorted, which if set to True makes it possible to insert strings into the list and have it maintain them as a sorted list (without having to re-sort it each time). Then you can use the IndexOf method (or the Find method) to do a binary search on the list, so you can quickly find the element you're looking for. My question is whether Ruby has anything like this. It seems like one could create a descendant of Array that does this.

What I'm trying to accomplish is this: I am processing a large number of items (almost 100,000 rows) of data and trying to find the duplicate items. I create an MD5 hash based on all of the elements within each row. Then I check to see if the MD5 value already exists in the list, and if it does, I know the item is a duplicate. If not, then I add it to the list. Very few of them items are duplicated, so most of the time it will be trying to locate the value in the list, and then immediately after that inserting the same value into the list. So there will be a lot of searching and a lot of inserting going on. For that reason, it won't be acceptable to re-sort it each time I insert an item.

Thanks in advance,

Dan

ruby@danb64.com wrote:

I'm a relative newcomer to Ruby. Most of my experience is in Delphi.
And in Delphi one of the most commonly-used classes is TStringList,
which is sort of analogous to ruby's Array (Delphi also has dynamic
arrays and static arrays). TStringList has a property called Sorted,
which if set to True makes it possible to insert strings into the
list and have it maintain them as a sorted list (without having to
re-sort it each time). Then you can use the IndexOf method (or the
Find method) to do a binary search on the list, so you can quickly
find the element you're looking for. My question is whether Ruby has
anything like this. It seems like one could create a descendant of
Array that does this.

This comes up once in a while. I think there is an implementation of a
tree as well as sorted arrays in the RAA. However...

What I'm trying to accomplish is this: I am processing a large
number of items (almost 100,000 rows) of data and trying to find the
duplicate items. I create an MD5 hash based on all of the elements
within each row. Then I check to see if the MD5 value already exists
in the list, and if it does, I know the item is a duplicate. If not,
then I add it to the list. Very few of them items are duplicated, so
most of the time it will be trying to locate the value in the list,
and then immediately after that inserting the same value into the
list. So there will be a lot of searching and a lot of inserting
going on. For that reason, it won't be acceptable to re-sort it each
time I insert an item.

This problem is most easily solved with a Hash or Set: lookups are much
faster than binary search on a sorted array. Only downside is that memory
usage of these is typically worse than that of an array for large data
sets. But it's worth a try as this is an easy change to your app.

If you draw the data from a RDBMS you should use that for finding
duplicates. You can do something along

select key1, key2, key3, count(*)
from your_table
group by key1, key2, key3
having count(*) > 1

Kind regards

    robert

should your list be sorted in the end? if not, you could simply use
Hash instead of Array.

···

On 9/6/05, ruby@danb64.com <ruby@danb64.com> wrote:

I'm a relative newcomer to Ruby. Most of my experience is in Delphi. And in Delphi one of the most commonly-used classes is TStringList, which is sort of analogous to ruby's Array (Delphi also has dynamic arrays and static arrays). TStringList has a property called Sorted, which if set to True makes it possible to insert strings into the list and have it maintain them as a sorted list (without having to re-sort it each time). Then you can use the IndexOf method (or the Find method) to do a binary search on the list, so you can quickly find the element you're looking for. My question is whether Ruby has anything like this. It seems like one could create a descendant of Array that does this.

What I'm trying to accomplish is this: I am processing a large number of items (almost 100,000 rows) of data and trying to find the duplicate items. I create an MD5 hash based on all of the elements within each row. Then I check to see if the MD5 value already exists in the list, and if it does, I know the item is a duplicate. If not, then I add it to the list. Very few of them items are duplicated, so most of the time it will be trying to locate the value in the list, and then immediately after that inserting the same value into the list. So there will be a lot of searching and a lot of inserting going on. For that reason, it won't be acceptable to re-sort it each time I insert an item.

Thanks in advance,

Dan

What I'm trying to accomplish is this: I am processing a large
number of items (almost 100,000 rows) of data and trying to find
the duplicate items. [\n inserted by hgs]

If you wish to remove duplicate items, read about uniq ...

I create an MD5 hash based on all of the
elements within each row. Then I check to see if the MD5 value
already exists in the list, and if it does, I know the item is a

Which is basically a search. You may find in practice that for
largish number of items, it is quicker to search a Hash or a Set
than an array.

duplicate. If not, then I add it to the list. Very few of them

         [...]

searching and a lot of inserting going on. For that reason, it
won't be acceptable to re-sort it each time I insert an item.

and Hashes don't get sorted, so they are more suited to this.

Thanks in advance,

Other speed tips I've been gathering are at:

http://www.eng.cse.dmu.ac.uk/~hgs/ruby/performance/

you'll see I was trying to solve a similar problem....

Dan

         Hugh

···

On Wed, 7 Sep 2005, ruby@danb64.com wrote:

I'm a relative newcomer to Ruby. Most of my experience is in Delphi.
And in Delphi one of the most commonly-used classes is TStringList,
which is sort of analogous to ruby's Array (Delphi also has dynamic
arrays and static arrays). TStringList has a property called Sorted,
which if set to True makes it possible to insert strings into the
list and have it maintain them as a sorted list (without having to
re-sort it each time). Then you can use the IndexOf method (or the
Find method) to do a binary search on the list, so you can quickly
find the element you're looking for. My question is whether Ruby has
anything like this. It seems like one could create a descendant of
Array that does this.

This comes up once in a while. I think there is an implementation of a
tree as well as sorted arrays in the RAA. However...

this one is great:

   http://raa.ruby-lang.org/project/ruby-rbtree/

it'd be great if it were part of the stdlib imho...

What I'm trying to accomplish is this: I am processing a large number of
items (almost 100,000 rows) of data and trying to find the duplicate items.
I create an MD5 hash based on all of the elements within each row. Then I
check to see if the MD5 value already exists in the list, and if it does, I
know the item is a duplicate. If not, then I add it to the list. Very few
of them items are duplicated, so most of the time it will be trying to
locate the value in the list, and then immediately after that inserting the
same value into the list. So there will be a lot of searching and a lot of
inserting going on. For that reason, it won't be acceptable to re-sort it
each time I insert an item.

This problem is most easily solved with a Hash or Set: lookups are much
faster than binary search on a sorted array. Only downside is that memory
usage of these is typically worse than that of an array for large data sets.
But it's worth a try as this is an easy change to your app.

If you draw the data from a RDBMS you should use that for finding
duplicates. You can do something along

select key1, key2, key3, count(*)
from your_table
group by key1, key2, key3
having count(*) > 1

great idea - with sqlite you can even do this in-memory and, of course, w/o a
server.

-a

···

On Wed, 7 Sep 2005, Robert Klemme wrote:

ruby@danb64.com wrote:

--

email :: ara [dot] t [dot] howard [at] noaa [dot] gov
phone :: 303.497.6469
Your life dwells amoung the causes of death
Like a lamp standing in a strong breeze. --Nagarjuna

===============================================================================

what did you end up doing? i forgot you were working with csv data - did you
consider using sqlite, loading everything into an in-memory table, and going
from there? could be very fast and quick to code...

on a related note, i did some research into lookups in c using hahses verses
sorted arrays/bsearch; much to my suprise (as a computer scientist) i found
that, with the exception of HUGE (millions) of entries lookup by bsearch was
and order of magnitude faster than any hashing mechanism i could find. my
test looked at cdb, hsearch, glib hashing functions, gperf and, for bsearch,
the c library bsearch. profiling the different programs showed that the
reason the bsearch was faster was speedy was due to lack of function calls
and this is easy to imagine - flipping a pointer around memory with only one
stack frame is about as lightweight as one can get... anyhow, my tests were
very specific to my application but interesting nonetheless - thought you'd be
interested.

cheers.

-a

···

On Wed, 7 Sep 2005, Hugh Sasse wrote:

On Wed, 7 Sep 2005, ruby@danb64.com wrote:

What I'm trying to accomplish is this: I am processing a large
number of items (almost 100,000 rows) of data and trying to find
the duplicate items. [\n inserted by hgs]

If you wish to remove duplicate items, read about uniq ...

I create an MD5 hash based on all of the
elements within each row. Then I check to see if the MD5 value
already exists in the list, and if it does, I know the item is a

Which is basically a search. You may find in practice that for
largish number of items, it is quicker to search a Hash or a Set
than an array.

duplicate. If not, then I add it to the list. Very few of them

       [...]

searching and a lot of inserting going on. For that reason, it
won't be acceptable to re-sort it each time I insert an item.

and Hashes don't get sorted, so they are more suited to this.

Thanks in advance,

Other speed tips I've been gathering are at:

http://www.eng.cse.dmu.ac.uk/~hgs/ruby/performance/

you'll see I was trying to solve a similar problem....

--

email :: ara [dot] t [dot] howard [at] noaa [dot] gov
phone :: 303.497.6469
Your life dwells amoung the causes of death
Like a lamp standing in a strong breeze. --Nagarjuna

===============================================================================

Thanks for the help. I switched it to use a Hash, and that performed MUCH better. The data is not coming from any SQL server--it is just a flat file. The Ruby script processes it into a format so that it can be BCP'd into MS SQL Server. And when the table that it gets BCP'd into has a unique constraint on the MD5 hash field--hence the need for me to eliminate the duplicate values beforehand. Granted, I could have changed the process of importing the data so that this unique constraint was not necessary, but it seemed better to me to eliminate the duplicates using Ruby in this case.

Thanks again,

Dan

From Robert Klemme <bob.news@gmx.net> on 6 Sep 2005:

···

ruby@danb64.com wrote:
> I'm a relative newcomer to Ruby. Most of my experience is in Delphi.
> And in Delphi one of the most commonly-used classes is TStringList,
> which is sort of analogous to ruby's Array (Delphi also has dynamic
> arrays and static arrays). TStringList has a property called Sorted,
> which if set to True makes it possible to insert strings into the
> list and have it maintain them as a sorted list (without having to
> re-sort it each time). Then you can use the IndexOf method (or the
> Find method) to do a binary search on the list, so you can quickly
> find the element you're looking for. My question is whether Ruby has
> anything like this. It seems like one could create a descendant of
> Array that does this.

This comes up once in a while. I think there is an implementation of a
tree as well as sorted arrays in the RAA. However...

> What I'm trying to accomplish is this: I am processing a large
> number of items (almost 100,000 rows) of data and trying to find the
> duplicate items. I create an MD5 hash based on all of the elements
> within each row. Then I check to see if the MD5 value already exists
> in the list, and if it does, I know the item is a duplicate. If not,
> then I add it to the list. Very few of them items are duplicated, so
> most of the time it will be trying to locate the value in the list,
> and then immediately after that inserting the same value into the
> list. So there will be a lot of searching and a lot of inserting
> going on. For that reason, it won't be acceptable to re-sort it each
> time I insert an item.

This problem is most easily solved with a Hash or Set: lookups are much
faster than binary search on a sorted array. Only downside is that
memory
usage of these is typically worse than that of an array for large data
sets. But it's worth a try as this is an easy change to your app.

If you draw the data from a RDBMS you should use that for finding
duplicates. You can do something along

select key1, key2, key3, count(*)
from your_table
group by key1, key2, key3
having count(*) > 1

Kind regards

    robert

Ara.T.Howard schrieb:

on a related note, i did some research into lookups in c using hahses
verses sorted arrays/bsearch; much to my suprise (as a computer
scientist) i found that, with the exception of HUGE (millions) of
entries lookup by bsearch was and order of magnitude faster than any
hashing mechanism i could find. my test looked at cdb, hsearch, glib
hashing functions, gperf and, for bsearch, the c library bsearch.
profiling the different programs showed that the reason the bsearch
was faster was speedy was due to lack of function calls and this is
easy to imagine - flipping a pointer around memory with only one stack frame is about as lightweight as one can get... anyhow, my
tests were very specific to my application but interesting
nonetheless - thought you'd be interested.

Interesting. What types of keys and values did you use?

Regards,
Pit

Other speed tips I've been gathering are at:

http://www.eng.cse.dmu.ac.uk/~hgs/ruby/performance/

you'll see I was trying to solve a similar problem....

what did you end up doing? i forgot you were working with csv data - did you
consider using sqlite, loading everything into an in-memory table, and going
from there? could be very fast and quick to code...

I found that the real time-killer was the CSV library itself. This
is fair enough: it takes account of all sorts of things like if the
strings contain quotes and commas. For the input I've been given
this never happens, so it was much faster to use split to separate
the fields -- by which I mean that several hours dropped to about 5
minutes. This was sufficient speedup for my purposes.

on a related note, i did some research into lookups in c using hahses verses
sorted arrays/bsearch; much to my suprise (as a computer scientist) i found
that, with the exception of HUGE (millions) of entries lookup by bsearch was
and order of magnitude faster than any hashing mechanism i could find. my

I'm not too surprised (with this hindsight :-)) because the address
calculations are probably cheaper than the full hashing
we'd-have-to-cope-with-a-whole-bunch-of-hashable-things function.

test looked at cdb, hsearch, glib hashing functions, gperf and, for bsearch,
the c library bsearch. profiling the different programs showed that the
reason the bsearch was faster was speedy was due to lack of function calls
and this is easy to imagine - flipping a pointer around memory with only one
stack frame is about as lightweight as one can get... anyhow, my tests were
very specific to my application but interesting nonetheless - thought you'd be
interested.

Yes, and I do wonder (as someone who has not looked at the code to
this extent) how much more speed we could wring out of the ruby
implementations.

cheers.

-a

         Thank you,
         Hugh

···

On Wed, 7 Sep 2005, Ara.T.Howard wrote:

On Wed, 7 Sep 2005, Hugh Sasse wrote:

ruby@danb64.com wrote:

Thanks for the help. I switched it to use a Hash, and that performed
MUCH better. The data is not coming from any SQL server--it is just
a flat file. The Ruby script processes it into a format so that it
can be BCP'd into MS SQL Server. And when the table that it gets
BCP'd into has a unique constraint on the MD5 hash field--hence the
need for me to eliminate the duplicate values beforehand. Granted, I
could have changed the process of importing the data so that this
unique constraint was not necessary, but it seemed better to me to
eliminate the duplicates using Ruby in this case.

If you're on MS SQL Server there's another option: you can use an index
with IGNORE DUPLICATES: that way you don't have to preprocess the file at
all:

Kind regards

    robert

http://www.rcrchive.net/rcr/show/306

martin

···

Ara.T.Howard <Ara.T.Howard@noaa.gov> wrote:

   http://raa.ruby-lang.org/project/ruby-rbtree/

it'd be great if it were part of the stdlib imho...

strings => lists of strings

the keys were around 32 bytes, the values around 256. the format was custom,
basically the key was null terminated and the lists was null terminated and
comma separated (commas were not allowed in list elements). the updates were
very infrequent and i ended up writing a ruby script that read a text file
representing the data, turned it into c programs which declared the data
structure in an ordered way, and compiled this into a shared libaray that
included a custom search function. it was kindof an object oriented approach
in that the data and functions were together and programs took advantage of
the fact the libs are mmap'd into memory to searched read no more than they
need to find any element. i actually found my weekly report from that week
and i included it below...

cheers.

-a

···

On Wed, 7 Sep 2005, Pit Capitain wrote:

Ara.T.Howard schrieb:

on a related note, i did some research into lookups in c using hahses
verses sorted arrays/bsearch; much to my suprise (as a computer
scientist) i found that, with the exception of HUGE (millions) of
entries lookup by bsearch was and order of magnitude faster than any
hashing mechanism i could find. my test looked at cdb, hsearch, glib
hashing functions, gperf and, for bsearch, the c library bsearch.
profiling the different programs showed that the reason the bsearch
was faster was speedy was due to lack of function calls and this is
easy to imagine - flipping a pointer around memory with only one stack frame is about as lightweight as one can get... anyhow, my
tests were very specific to my application but interesting
nonetheless - thought you'd be interested.

Interesting. What types of keys and values did you use?

--

email :: ara [dot] t [dot] howard [at] noaa [dot] gov
phone :: 303.497.6469
Your life dwells amoung the causes of death
Like a lamp standing in a strong breeze. --Nagarjuna

===============================================================================

29.07.2002
______________________________________________________________________________

<snip>

Task 567 35 hrs - Metadata

  * spent a few days running **exhaustive** analysis of several search
     methodologies for libmm

     this all started because i was amazed at the speed which my search ran

     brief synopsis of tested methods, all methods did 65536 lookups of every
     icao name, averaging the results -- storage formats varied with method,
     but all mapped an icao name to it's position in a table (int value)

     #####################################################################
     # METHOD : sqlite
     #####################################################################

       -------------------------------------------------------------------
       AVG TIME FOR 65536 LOOKUPS
   32.623693s
       -------------------------------------------------------------------

     #####################################################################
     # METHOD : berkely db (latest version), hash db type
     #####################################################################

       -------------------------------------------------------------------
       AVG TIME FOR 65536 LOOKUPS
   1.260673268s
       -------------------------------------------------------------------

     #####################################################################
     # METHOD : gperf function encoded into shareable library
     #####################################################################

       -------------------------------------------------------------------
       AVG TIME FOR 65536 LOOKUPS
   0.1376355546
       -------------------------------------------------------------------

     #####################################################################
     # METHOD : libmm roll-my-own binary search
     #####################################################################

       -------------------------------------------------------------------
       AVG TIME FOR 65536 LOOKUPS
   0.01562877542
       -------------------------------------------------------------------

     i also tested a hashtab c library which i plan on using to accomplish
     lookups when the db (shared lib) is not available and the cfg file itself
     need be parsed -- note that these last tests do not include the time to
     popoulate the db, all above cases are working with pre populated db's...

     #####################################################################
     # METHOD : lib hashtab
     #####################################################################

       -------------------------------------------------------------------
       AVG TIME FOR 65536 LOOKUPS
   0.03334630323
       -------------------------------------------------------------------

     #####################################################################
     # METHOD : a similar test using stl std::map<std::string, int>
     #####################################################################

       -------------------------------------------------------------------
       AVG TIME FOR 65536 LOOKUPS
   0.242236081
       -------------------------------------------------------------------

   there's alot i'm not mentioning here like, easy of use, maintainance time,
   how each storage format could be updated and synchroized with remote nodes,
   scalability, available language api s, ease of understanding, reliance on
   outside packages, ease of use, etc -- which i've also been factoring in to
   the design choices.

   however, the one real conclusion i came to was that not only does my search
   facility provide searches for lt, gt, le, ge, eq, floor, and ceiling, it
   does so at very good performance (best log(n), worst log(n)) -- i was
   generally suprised that that gperf-sharable lib method was not faster, and
   that the speed differential (order of magnitude) remained constant up to
   searches over large (rows numbering 65536) tables (i did no testing above
   this number) and was even a little worse for small (like grib) tables! it
   must be the case that the time for hash based lookups (best 1, worst n)
   under these test conditions does not out perform my simple binary search on
   a sorted shared library stored array of index structs :

     struct index { char *key, int idx};

   or simply that the function call overhead (three for gperf) one for
   mm_search is responsible. in anycase, i can use yet another third party
   peice of software to generate a search function in the modules, or create a
   simple table of two element structs (done by cfg2c at compile time -- not
   runtime) and associated search function (mm_search). both methods result in
   a virtual memory resident searchable embedded database (no page pool, os
   responsible for paging -- e.q. db should fit into real memory for good
   performane) shared by all processes on the same system; gperf slows down the
   search by about a factor of ten and slightly simplifies code generation --
   mm_search is super fast, is only slightly more complicated (but only because
   gperf generates the hashing function automatically!!) to code, and adds no
   additional requirements on the system.

   it seems like my first hack at this problem (searching libmm tables) is the
   fastest and easiest overall... this was *really* a suprise -- like having
   something compile the first time... something must be wrong :wink:

</snip>

I found that the real time-killer was the CSV library itself. This is fair
enough: it takes account of all sorts of things like if the strings contain
quotes and commas. For the input I've been given this never happens, so it
was much faster to use split to separate the fields -- by which I mean that
several hours dropped to about 5 minutes. This was sufficient speedup for
my purposes.

this is exactly the approach i always use with csv: i split the first line,
remember the number of fields, and continue splitting - checking that the
numbe of fields is the same, iff it is not i parse the failed line using csv.
this approach is fast and handles certain conditions like quoted/escaped
commas. it doesn't handle csv lines that span multiple lines... but the
speedup is worth it.

on a related note, i did some research into lookups in c using hahses
verses sorted arrays/bsearch; much to my suprise (as a computer scientist)
i found that, with the exception of HUGE (millions) of entries lookup by
bsearch was and order of magnitude faster than any hashing mechanism i
could find. my

I'm not too surprised (with this hindsight :-)) because the address
calculations are probably cheaper than the full hashing
we'd-have-to-cope-with-a-whole-bunch-of-hashable-things function.

exactly right.

test looked at cdb, hsearch, glib hashing functions, gperf and, for
bsearch, the c library bsearch. profiling the different programs showed
that the reason the bsearch was faster was speedy was due to lack of
function calls and this is easy to imagine - flipping a pointer around
memory with only one stack frame is about as lightweight as one can get...
anyhow, my tests were very specific to my application but interesting
nonetheless - thought you'd be interested.

Yes, and I do wonder (as someone who has not looked at the code to this
extent) how much more speed we could wring out of the ruby implementations.

trees are pretty nice - good performance for everything - ruby could use one.

cheers.

-a

···

On Wed, 7 Sep 2005, Hugh Sasse wrote:
--

email :: ara [dot] t [dot] howard [at] noaa [dot] gov
phone :: 303.497.6469
Your life dwells amoung the causes of death
Like a lamp standing in a strong breeze. --Nagarjuna

===============================================================================

Yes, but that would sort of defeat the purpose. The point is, I don't WANT the duplicates. Otherwise I could simply do away with the unique constraint altogether and then just write something in T-SQL to remove any duplicates after the data is BCP'd in. But, since I'm creating this .DAT file for BCP which is going to be BCP'd into quite a few SQL databases, it is much better for me to just eliminate the duplicates from the .DAT file.

But anyway, thanks for the suggestion. It's always good to know what all the alternatives are.

Dan

From Robert Klemme <bob.news@gmx.net> on 7 Sep 2005:

···

ruby@danb64.com wrote:
> Thanks for the help. I switched it to use a Hash, and that performed
> MUCH better. The data is not coming from any SQL server--it is just
> a flat file. The Ruby script processes it into a format so that it
> can be BCP'd into MS SQL Server. And when the table that it gets
> BCP'd into has a unique constraint on the MD5 hash field--hence the
> need for me to eliminate the duplicate values beforehand. Granted, I
> could have changed the process of importing the data so that this
> unique constraint was not necessary, but it seemed better to me to
> eliminate the duplicates using Ruby in this case.

If you're on MS SQL Server there's another option: you can use an index
with IGNORE DUPLICATES: that way you don't have to preprocess the file
at
all:

http://msdn.microsoft.com/library/default.asp?url=/library/en-us/tsqlref/ts_create_64l4.asp

Kind regards

    robert

thanks! i'll vote - just as soon as i can remember my password... :wink:

-a

···

On Thu, 8 Sep 2005, Martin DeMello wrote:

Ara.T.Howard <Ara.T.Howard@noaa.gov> wrote:

   http://raa.ruby-lang.org/project/ruby-rbtree/

it'd be great if it were part of the stdlib imho...

RCR 306: include rbtree in the stdlib

martin

--

email :: ara [dot] t [dot] howard [at] noaa [dot] gov
phone :: 303.497.6469
Your life dwells amoung the causes of death
Like a lamp standing in a strong breeze. --Nagarjuna

===============================================================================

I found that the real time-killer was the CSV library itself. This is fair

         [...]

was much faster to use split to separate the fields -- by which I mean that

this is exactly the approach i always use with csv: i split the first line,
remember the number of fields, and continue splitting - checking that the
numbe of fields is the same, iff it is not i parse the failed line using csv.
this approach is fast and handles certain conditions like quoted/escaped

Oh, that's rather neat.

commas. it doesn't handle csv lines that span multiple lines... but the
speedup is worth it.

         [...]

Yes, and I do wonder (as someone who has not looked at the code to this
extent) how much more speed we could wring out of the ruby implementations.

trees are pretty nice - good performance for everything - ruby could use one.

But there are so many types to choose from... :slight_smile:

cheers.

-a

         Hugh

···

On Wed, 7 Sep 2005, Ara.T.Howard wrote:

On Wed, 7 Sep 2005, Hugh Sasse wrote:

ruby@danb64.com wrote:

Yes, but that would sort of defeat the purpose. The point is, I
don't WANT the duplicates. Otherwise I could simply do away with the
unique constraint altogether and then just write something in T-SQL
to remove any duplicates after the data is BCP'd in.

You don't with option IGNORE_DUP_KEY (see the page I referenced): with
this option your duplicates are simply ignored during bcp'ing. You just
see a warning but the import is otherwise fine. So you actually do not
have to preprocess your files; there will be no duplicates in the DB.
It's different with a constraint: in that case the import will fail so you
end up with no records imported into the DB.

But, since I'm
creating this .DAT file for BCP which is going to be BCP'd into quite
a few SQL databases, it is much better for me to just eliminate the
duplicates from the .DAT file.

That depends on how many DB's you are writing to - it might be more
efficient with or without preprocessing.

But anyway, thanks for the suggestion. It's always good to know what
all the alternatives are.

Anytime.

Kind regards

    robert

Dan

From Robert Klemme <bob.news@gmx.net> on 7 Sep 2005:

Thanks for the help. I switched it to use a Hash, and that
performed MUCH better. The data is not coming from any SQL
server--it is just a flat file. The Ruby script processes it into
a format so that it can be BCP'd into MS SQL Server. And when the
table that it gets BCP'd into has a unique constraint on the MD5
hash field--hence the need for me to eliminate the duplicate values
beforehand. Granted, I could have changed the process of importing
the data so that this unique constraint was not necessary, but it
seemed better to me to eliminate the duplicates using Ruby in this
case.

If you're on MS SQL Server there's another option: you can use an
index with IGNORE DUPLICATES: that way you don't have to preprocess
the file at
all:

···

ruby@danb64.com wrote:

Kind regards

    robert

Ah, OK. I'm finally getting this through my thick skull. :slight_smile: That sounds pretty cool, I wasn't aware you could do that (have it just skip the duplicates). But, hey, I've already coded it in Ruby using a hash, which seems to work fairly well. And this isn't the kind of thing that I'll be running every day--maybe once a month, if that. So the Ruby solution works just fine for me. It's a little slower than it was before, but not a lot. Believe me, it was *really* bad the way I was doing it using an unsorted array.

Dan

From Robert Klemme <bob.news@gmx.net> on 7 Sep 2005:

···

ruby@danb64.com wrote:
> Yes, but that would sort of defeat the purpose. The point is, I
> don't WANT the duplicates. Otherwise I could simply do away with the
> unique constraint altogether and then just write something in T-SQL
> to remove any duplicates after the data is BCP'd in.

You don't with option IGNORE_DUP_KEY (see the page I referenced): with
this option your duplicates are simply ignored during bcp'ing. You just
see a warning but the import is otherwise fine. So you actually do not
have to preprocess your files; there will be no duplicates in the DB.
It's different with a constraint: in that case the import will fail so
you
end up with no records imported into the DB.

> But, since I'm
> creating this .DAT file for BCP which is going to be BCP'd into quite
> a few SQL databases, it is much better for me to just eliminate the
> duplicates from the .DAT file.

That depends on how many DB's you are writing to - it might be more
efficient with or without preprocessing.

> But anyway, thanks for the suggestion. It's always good to know what
> all the alternatives are.

Anytime.

Kind regards

    robert

>
> Dan
>
> From Robert Klemme <bob.news@gmx.net> on 7 Sep 2005:
>
>> ruby@danb64.com wrote:
>>> Thanks for the help. I switched it to use a Hash, and that
>>> performed MUCH better. The data is not coming from any SQL
>>> server--it is just a flat file. The Ruby script processes it into
>>> a format so that it can be BCP'd into MS SQL Server. And when the
>>> table that it gets BCP'd into has a unique constraint on the MD5
>>> hash field--hence the need for me to eliminate the duplicate values
>>> beforehand. Granted, I could have changed the process of importing
>>> the data so that this unique constraint was not necessary, but it
>>> seemed better to me to eliminate the duplicates using Ruby in this
>>> case.
>>
>> If you're on MS SQL Server there's another option: you can use an
>> index with IGNORE DUPLICATES: that way you don't have to preprocess
>> the file at
>> all:
>>
>>
Technical documentation | Microsoft Learn
>>
>> Kind regards
>>
>> robert