Cool. Ok I will explain the best I can the whole entire situation from
the beginning.
I work for a company that produces plastic cards (i.e. credit cards,
debit cards, gift cards, id cards etc..)
I receive files from customers containing anywhere from 1..100000000
records. These files get formatted and put into a network drive in
their respective job number folders. One customer is [...].
First of all a bit of general advice: in my experience customers are not happy reading their names somewhere public without their prior permission. So you probably should omit them in future postings to avoid trouble. Company names do not actually help in resolving a technical issue.
About a year ago there was an idiot programmer who formatted the data
incorrectly causing 150,000 duplicate cards. So because of this, the
company wants to put in place a system of checking for duplicates.
Even in the absence of "idiot programmers" duplicate checking is a good idea. There is so much that can go wrong - and you indicated yourself, that you might get buggy input data. It is always a good idea to verify that data you receive from somewhere else complies with your requirements or otherwise meets your expectations. And this holds true on all levels (i.e. input files obtained from somewhere, method arguments etc.). So you might as well be thankful for this guy's mistake because the checking you are doing might save your company a lot of hassle in another situation. 
The
way it has to work (because I don't have a say in this even though i'm
the developer >.<) is that the past 18 months of jobs will be loaded
into a database (or something useful) and when we (the programmers)
get a new file, we will generate our data from these files, then run
our application to see if there are any duplicate records in what we
created compared to what we have created in the past 18 months
according to their respective companies. (i.e. we get a starbucks job,
program it and then test it against last 18 months of starbucks
records.)
Do you actually reload past 18 month's job data for every new job? This would be a major source of inefficiency. If you do I would rather have a large table that stays there and add entries with timestamps. Then with a proper index (or with a table partitioned by month) you can efficiently delete old data (i.e. data from jobs older than 18 months). If your DB product supports partitioning you should definitively use it as it makes deletions much more efficient.
Btw, I can't find a reference to the DB vendor that you are using. *That* would be an interesting brand name.
Seriously, this can have a major impact on your options.
If there are no duplicate files, then we will load that job into the
database (or whatever we need) and continue on from there. If there
are duplicates, then we have to re-program that job, and generate a
report stating how many duplicates, and what exactly it is that was
duplicated (i.e. account number, mag stripe, pin, customer name). The
report is then automatically e-mailed to our boss who then walks next
door to our office to yell at us and tell us to re-program the job.
So for a figure we did 100 million starbucks cards in the past 12
months. We need to check the last 18 months. Starbucks jobs come in
from a company called Valuelink, who generates the data for us. CVS
pharmacy and Disney and McDonalds roughly equal another 100 million
records combined. These also come from Valuelink.
This is important, because if we program it correctly, but Valulink
sends over a duplicate card number, then we need to be able to report
this. So when they send over another job of 5 million cards, now we
are checking that against 200 million + cards to see if anything is
duplicate.
We program roughly 20-40 jobs in a day, so we need a quick way to go
through all this and not lose time or else we will end up working
longer days, and not be compensated for it 
You do not give details about your jobs and what it is that gets duplicated. I will just assume that it is some kind of "key", which might be just a single character sequence or a number of fields. Also, we do not know what other information comprises a "job", so take the following with a grain of salt.
There are a few ways you could tackle this. I will try to sketch some.
1. Avoid invalid jobs
You could do this by having a table with job definitions (maybe one per customer if keys are different) where the key has a UNIQUE constraint on it. Then you prepare your data (possibly in CVS files) and load it into the DB. The unique constraint will prevent duplicate jobs.
Now it depends on the DB vendor you are using. IIRC with Oracle you can use SQL*Loader and have it report duplicate records, i.e. records that were not inserted. Same for SQL Server, there is a key property IGNORE DUPLICATES which will prevent duplicate insertion. I am not sure about duplicate reporting though.
If your DB vendor does not support this, you can load data into a temporary table and insert from there. You can then use an approach from 2 to detect duplicates:
2. Detect duplicates
Approach as above but do not create a UNIQUE constraints but instead index the key (if your key contains multiple fields you just have a covering index with several columns). Now you can check for duplicates with a query like this:
select key1, key2, ... keyn, count(*) occurrences
from job_table
group by key1, key2, ... keyn
having count(*) > 1
Now this query will return all key fields which occur more than once. If you also need other info you can do this:
select *
from job_table jt join (
select key1, key2, ... keyn
from job_table
group by key1, key2, ... keyn
having count(*) > 1
) jt_dup on jt.key1 = jt_dup.key1
and jt.key2 = jt_dup.key2
and ...
and jt.keyn = jt_dup.keyn
(The joined table is an "inline view" in case you want to look further into this concept.)
Using the index you can also delete duplicates pretty efficiently. Alternatively delete all entries from the new job, modify the data outside and load again. It depends on the nature of your other processing which approach is better. Either way you could also generate your job data from this table even with duplicates in the table by using SELECT DISTINCT or GROUP BY.
I hope I have given you some food for thought and this gets you started digging deeper.
Kind regards
robert
PS: I'm traveling the next three days so please do not expect further replies too early.
···
On 03.11.2007 01:53, JeremyWoertink@gmail.com wrote: