Database speed issues

So, I have a bit of a design problem. I have an application that work,
but not as well as I would like it to. My problem is that I had to
write an application that checks millions of records against hundreds
of millions of records to see if there are any duplicates. The only
way I could think to do this is pretty much select from the different
tables where the specific column matches my search string. This is
really slow! Obviously if it returns something, then there is a dup,
if not, then YAY!

I know there has to be a better way then just doing a SQL select
statement. Any ideas?

Thanks,

~Jeremy

you could sort both lists, and find duplicates using O(n + m) time.

···

On Nov 2, 7:45 pm, "JeremyWoert...@gmail.com" <JeremyWoert...@gmail.com> wrote:

So, I have a bit of a design problem. I have an application that work,
but not as well as I would like it to. My problem is that I had to
write an application that checks millions of records against hundreds
of millions of records to see if there are any duplicates. The only
way I could think to do this is pretty much select from the different
tables where the specific column matches my search string. This is
really slow! Obviously if it returns something, then there is a dup,
if not, then YAY!

I know there has to be a better way then just doing a SQL select
statement. Any ideas?

Jeremy Woertink wrote:

So, I have a bit of a design problem. I have an application that work,
but not as well as I would like it to. My problem is that I had to
write an application that checks millions of records against hundreds
of millions of records to see if there are any duplicates. The only
way I could think to do this is pretty much select from the different
tables where the specific column matches my search string. This is
really slow! Obviously if it returns something, then there is a dup,
if not, then YAY!

I know there has to be a better way then just doing a SQL select
statement. Any ideas?

I dunno how fast it is compared to your select statement, but if you can
build arrays out of the data you want to compare you call the % method
on them. Example...

[ 5, 10 ] & [ 10, 15 ]

=> [10]

It works with more complex arrays, too...

[ [ 5, 10 ], [ 15, 20 ] ] & [ [ 10, 15 ], [ 5, 10] ]

=> [[5, 10]]

Hope that helps...

···

--
Posted via http://www.ruby-forum.com/\.

You could let the database do the work in the following way, though it's
relatively dirty:

First, duplicate the database you're comparing to. Then change the indexes
on the tables so that you can force uniqueness on the records based on the
data you're comparing. Then simply insert the records you're checking. If
the insert did not work, it's a duplicate.

HTH,

Felix

···

-----Original Message-----
From: JeremyWoertink@gmail.com [mailto:JeremyWoertink@gmail.com]
Sent: Friday, November 02, 2007 10:45 AM
To: ruby-talk ML
Subject: Database speed issues

So, I have a bit of a design problem. I have an application that work,
but not as well as I would like it to. My problem is that I had to
write an application that checks millions of records against hundreds
of millions of records to see if there are any duplicates. The only
way I could think to do this is pretty much select from the different
tables where the specific column matches my search string. This is
really slow! Obviously if it returns something, then there is a dup,
if not, then YAY!

I know there has to be a better way then just doing a SQL select
statement. Any ideas?

Thanks,

~Jeremy

haha, you said "call the % method on them." and then used the "&"
method.
That may work, but how long would it take to build an array from 500GB
of text files? Then every time that no duplicate exists, that file is
added to the array?

but wait... THAT'S AWESOME... hmmmm so I could probably do something
like
IO.readlines("somefile) & IO.readlines("anotherfile) and it would
return the lines that are the same.

The "anotherfile" would have to be an iteration through a complete
network drive of folders, so I could find each folder and then go
through each file inside that folder doing the same thing.

ughh. I hate dealing with these stupid mass quantities of records.
Thanks for the advice :slight_smile:

~Jeremy

···

On Nov 2, 12:14 pm, Daniel Waite <rabbitb...@gmail.com> wrote:

Jeremy Woertink wrote:
> So, I have a bit of a design problem. I have an application that work,
> but not as well as I would like it to. My problem is that I had to
> write an application that checks millions of records against hundreds
> of millions of records to see if there are any duplicates. The only
> way I could think to do this is pretty much select from the different
> tables where the specific column matches my search string. This is
> really slow! Obviously if it returns something, then there is a dup,
> if not, then YAY!

> I know there has to be a better way then just doing a SQL select
> statement. Any ideas?

I dunno how fast it is compared to your select statement, but if you can
build arrays out of the data you want to compare you call the % method
on them. Example...

> [ 5, 10 ] & [ 10, 15 ]
=> [10]

It works with more complex arrays, too...

> [ [ 5, 10 ], [ 15, 20 ] ] & [ [ 10, 15 ], [ 5, 10] ]
=> [[5, 10]]

Hope that helps...
--
Posted viahttp://www.ruby-forum.com/.

"Felix Windt" <fwmailinglists@gmail.com> writes:

···

-----Original Message-----
From: JeremyWoertink@gmail.com [mailto:JeremyWoertink@gmail.com]
Sent: Friday, November 02, 2007 10:45 AM
To: ruby-talk ML
Subject: Database speed issues

So, I have a bit of a design problem. I have an application that work,
but not as well as I would like it to. My problem is that I had to
write an application that checks millions of records against hundreds
of millions of records to see if there are any duplicates. The only
way I could think to do this is pretty much select from the different
tables where the specific column matches my search string. This is
really slow! Obviously if it returns something, then there is a dup,
if not, then YAY!

I know there has to be a better way then just doing a SQL select
statement. Any ideas?

You could let the database do the work in the following way, though it's
relatively dirty:

First, duplicate the database you're comparing to. Then change the indexes
on the tables so that you can force uniqueness on the records based on the
data you're comparing. Then simply insert the records you're checking. If
the insert did not work, it's a duplicate.

Rather than duplicating the database, create a temporary table, load
the 2nd set of data there, create index appropriately,
and do:

select second.* from first join second on (...) limit 1;

If the result is not an empty set, then there is duplicate.

YS.

Could you be more specific?
First if your data is in some kind of SQL database first you should
look in how well you explore the power of SQL.

In my experience if you generate your indexes right, specially the
multiple columns indexes that are in favor of your select statement
you can gain some performance boost.

The second thing can be helping SQL database to find the right way to
do the JOINs.
This has proven mostly right for MySQL.

If you give us some more details perhaps we can give you some more
meaningful response.

···

On Nov 2, 8:14 pm, rabbitb...@gmail.com wrote:

Jeremy Woertink wrote:
> So, I have a bit of a design problem. I have an application that work,
> but not as well as I would like it to. My problem is that I had to
> write an application that checks millions of records against hundreds
> of millions of records to see if there are any duplicates. The only
> way I could think to do this is pretty much select from the different
> tables where the specific column matches my search string. This is
> really slow! Obviously if it returns something, then there is a dup,
> if not, then YAY!

> I know there has to be a better way then just doing a SQL select
> statement. Any ideas?

I dunno how fast it is compared to your select statement, but if you can
build arrays out of the data you want to compare you call the % method
on them. Example...

> [ 5, 10 ] & [ 10, 15 ]
=> [10]

It works with more complex arrays, too...

> [ [ 5, 10 ], [ 15, 20 ] ] & [ [ 10, 15 ], [ 5, 10] ]
=> [[5, 10]]

Hope that helps...
--
Posted viahttp://www.ruby-forum.com/.

Jeremy Woertink wrote:

So, I have a bit of a design problem. I have an application that work,
but not as well as I would like it to. My problem is that I had to
write an application that checks millions of records against hundreds
of millions of records to see if there are any duplicates. The only
way I could think to do this is pretty much select from the different
tables where the specific column matches my search string. This is
really slow! Obviously if it returns something, then there is a dup,
if not, then YAY!
I know there has to be a better way then just doing a SQL select
statement. Any ideas?

I dunno how fast it is compared to your select statement, but if you can
build arrays out of the data you want to compare you call the % method
on them. Example...

> [ 5, 10 ] & [ 10, 15 ]
=> [10]

It works with more complex arrays, too...

> [ [ 5, 10 ], [ 15, 20 ] ] & [ [ 10, 15 ], [ 5, 10] ]
=> [[5, 10]]

Hope that helps...
--
Posted viahttp://www.ruby-forum.com/.

Could you be more specific?

The information given is really weak.

First if your data is in some kind of SQL database first you should
look in how well you explore the power of SQL.

In my experience if you generate your indexes right, specially the
multiple columns indexes that are in favor of your select statement
you can gain some performance boost.

The second thing can be helping SQL database to find the right way to
do the JOINs.
This has proven mostly right for MySQL.

If you give us some more details perhaps we can give you some more
meaningful response.

Completely agree to all stated above! This should be offloaded to the database.

Kind regards

  robert

···

On 02.11.2007 20:43, dima wrote:

On Nov 2, 8:14 pm, rabbitb...@gmail.com wrote:

Cool. Ok I will explain the best I can the whole entire situation from
the beginning.

I work for a company that produces plastic cards (i.e. credit cards,
debit cards, gift cards, id cards etc..)

I receive files from customers containing anywhere from 1..100000000
records. These files get formatted and put into a network drive in
their respective job number folders. One customer is Starbucks,
andother CVS pharmacy, Mcdonalds, Jack N the Box, Mexico government
Drivers license.

About a year ago there was an idiot programmer who formatted the data
incorrectly causing 150,000 duplicate cards. So because of this, the
company wants to put in place a system of checking for duplicates. The
way it has to work (because I don't have a say in this even though i'm
the developer >.<) is that the past 18 months of jobs will be loaded
into a database (or something useful) and when we (the programmers)
get a new file, we will generate our data from these files, then run
our application to see if there are any duplicate records in what we
created compared to what we have created in the past 18 months
according to their respective companies. (i.e. we get a starbucks job,
program it and then test it against last 18 months of starbucks
records.)

If there are no duplicate files, then we will load that job into the
database (or whatever we need) and continue on from there. If there
are duplicates, then we have to re-program that job, and generate a
report stating how many duplicates, and what exactly it is that was
duplicated (i.e. account number, mag stripe, pin, customer name). The
report is then automatically e-mailed to our boss who then walks next
door to our office to yell at us and tell us to re-program the job.

So for a figure we did 100 million starbucks cards in the past 12
months. We need to check the last 18 months. Starbucks jobs come in
from a company called Valuelink, who generates the data for us. CVS
pharmacy and Disney and McDonalds roughly equal another 100 million
records combined. These also come from Valuelink.
This is important, because if we program it correctly, but Valulink
sends over a duplicate card number, then we need to be able to report
this. So when they send over another job of 5 million cards, now we
are checking that against 200 million + cards to see if anything is
duplicate.

We program roughly 20-40 jobs in a day, so we need a quick way to go
through all this and not lose time or else we will end up working
longer days, and not be compensated for it :frowning:

I hope that explains it a little better. Thanks again for all the
advice.

~Jeremy

Cool. Ok I will explain the best I can the whole entire situation from
the beginning.

I work for a company that produces plastic cards (i.e. credit cards,
debit cards, gift cards, id cards etc..)

I receive files from customers containing anywhere from 1..100000000
records. These files get formatted and put into a network drive in
their respective job number folders. One customer is [...].

First of all a bit of general advice: in my experience customers are not happy reading their names somewhere public without their prior permission. So you probably should omit them in future postings to avoid trouble. Company names do not actually help in resolving a technical issue.

About a year ago there was an idiot programmer who formatted the data
incorrectly causing 150,000 duplicate cards. So because of this, the
company wants to put in place a system of checking for duplicates.

Even in the absence of "idiot programmers" duplicate checking is a good idea. There is so much that can go wrong - and you indicated yourself, that you might get buggy input data. It is always a good idea to verify that data you receive from somewhere else complies with your requirements or otherwise meets your expectations. And this holds true on all levels (i.e. input files obtained from somewhere, method arguments etc.). So you might as well be thankful for this guy's mistake because the checking you are doing might save your company a lot of hassle in another situation. :slight_smile:

The
way it has to work (because I don't have a say in this even though i'm
the developer >.<) is that the past 18 months of jobs will be loaded
into a database (or something useful) and when we (the programmers)
get a new file, we will generate our data from these files, then run
our application to see if there are any duplicate records in what we
created compared to what we have created in the past 18 months
according to their respective companies. (i.e. we get a starbucks job,
program it and then test it against last 18 months of starbucks
records.)

Do you actually reload past 18 month's job data for every new job? This would be a major source of inefficiency. If you do I would rather have a large table that stays there and add entries with timestamps. Then with a proper index (or with a table partitioned by month) you can efficiently delete old data (i.e. data from jobs older than 18 months). If your DB product supports partitioning you should definitively use it as it makes deletions much more efficient.

Btw, I can't find a reference to the DB vendor that you are using. *That* would be an interesting brand name. :slight_smile: Seriously, this can have a major impact on your options.

If there are no duplicate files, then we will load that job into the
database (or whatever we need) and continue on from there. If there
are duplicates, then we have to re-program that job, and generate a
report stating how many duplicates, and what exactly it is that was
duplicated (i.e. account number, mag stripe, pin, customer name). The
report is then automatically e-mailed to our boss who then walks next
door to our office to yell at us and tell us to re-program the job.

So for a figure we did 100 million starbucks cards in the past 12
months. We need to check the last 18 months. Starbucks jobs come in
from a company called Valuelink, who generates the data for us. CVS
pharmacy and Disney and McDonalds roughly equal another 100 million
records combined. These also come from Valuelink.
This is important, because if we program it correctly, but Valulink
sends over a duplicate card number, then we need to be able to report
this. So when they send over another job of 5 million cards, now we
are checking that against 200 million + cards to see if anything is
duplicate.

We program roughly 20-40 jobs in a day, so we need a quick way to go
through all this and not lose time or else we will end up working
longer days, and not be compensated for it :frowning:

You do not give details about your jobs and what it is that gets duplicated. I will just assume that it is some kind of "key", which might be just a single character sequence or a number of fields. Also, we do not know what other information comprises a "job", so take the following with a grain of salt.

There are a few ways you could tackle this. I will try to sketch some.

1. Avoid invalid jobs

You could do this by having a table with job definitions (maybe one per customer if keys are different) where the key has a UNIQUE constraint on it. Then you prepare your data (possibly in CVS files) and load it into the DB. The unique constraint will prevent duplicate jobs.

Now it depends on the DB vendor you are using. IIRC with Oracle you can use SQL*Loader and have it report duplicate records, i.e. records that were not inserted. Same for SQL Server, there is a key property IGNORE DUPLICATES which will prevent duplicate insertion. I am not sure about duplicate reporting though.

If your DB vendor does not support this, you can load data into a temporary table and insert from there. You can then use an approach from 2 to detect duplicates:

2. Detect duplicates

Approach as above but do not create a UNIQUE constraints but instead index the key (if your key contains multiple fields you just have a covering index with several columns). Now you can check for duplicates with a query like this:

select key1, key2, ... keyn, count(*) occurrences
from job_table
group by key1, key2, ... keyn
having count(*) > 1

Now this query will return all key fields which occur more than once. If you also need other info you can do this:

select *
from job_table jt join (
select key1, key2, ... keyn
from job_table
group by key1, key2, ... keyn
having count(*) > 1
) jt_dup on jt.key1 = jt_dup.key1
and jt.key2 = jt_dup.key2
and ...
and jt.keyn = jt_dup.keyn

(The joined table is an "inline view" in case you want to look further into this concept.)

Using the index you can also delete duplicates pretty efficiently. Alternatively delete all entries from the new job, modify the data outside and load again. It depends on the nature of your other processing which approach is better. Either way you could also generate your job data from this table even with duplicates in the table by using SELECT DISTINCT or GROUP BY.

I hope I have given you some food for thought and this gets you started digging deeper.

Kind regards

  robert

PS: I'm traveling the next three days so please do not expect further replies too early.

···

On 03.11.2007 01:53, JeremyWoertink@gmail.com wrote: