How to best approach deleting records with both duplicate a addressand a duplicate name in a large f

Mike C · Mar 15, 2008

Hello,
I have a medium sized file (approx 30,000 records) that has many names
and addresses that have been entered twice. I need to find an
efficient way to delete all records that have both a duplicate name
and a duplicate address. Records with a duplicate name but not a
duplicate address should stay and vice versa.

Is there a best way to approach this? Should this typically be
handled with a query. Does the Find duplicates query handle this
problem well?

Thanks for any insight anyone can provide. I am sure this is a common
problem, but i have not had to do it before and my first few attempts
don't seem to work.

(I will also need to attempt to delete Like records as the next step,
but for now I guess I will just figure out how to get the main file's
exact duplicates out of the equation.)

Allen Browne · Mar 15, 2008

This example assumes:
- duplicate defined on combination of:
Surname + FirstName + Address + City
- you can ignore records that have nulls in those fields
- a primary key field of ClientID
- The record you want to keep is the one with the lower primary key value.

Use a subquery like this:
DELETE FROM tblClient
WHERE tblClient.ClientID <>
(SELECT Min(ClientID) AS MinClient
FROM tblClient AS Dupe
WHERE (Dupe.Surname = tblClient.Surname)
AND (Dupe.FirstName = tblClient.FirstName)
AND (Dupe.Address = tblClient.Address)
AND (Dupe.City = tblClient.City));

If subqueries are new, here's an introduction:
http://allenbrowne.com/subquery-01.html

gllincoln · Mar 15, 2008

Hi Mike,

Build one query at a time (warning air code - might be a typo or something)

(qs_DupeNames) SELECT * FROM myTable WHERE Name IN(SELECT Name FROM myTable AS tmp GROUP BY Name HAVING Count(*)>1);

(qs_DupeAddress) SELECT * FROM qs_DupeNames WHERE Address IN(SELECT Address FROM qs_DupeNames as tmp GROUP BY Address HAVING Count(*)>1);

Take a look at the output of qs_DupeAddress - if it looks right, you can change it to a DELETE * etc. and that takes care of that.

There is a way to join those two into a single query but I'm a little shaky on the syntax and frankly, I think the above is a lot cleaner/easier to follow.

Hope this helps,
Gordon

Sylvain Lafontaine · Mar 15, 2008

Sorry, but this code is not exactly right: it will delete all duplicates but without leaving behind one value. Usually, when you want to delete duplicates, you also want to keep at least one record from each duplicate.

--
Sylvain Lafontaine, ing.
MVP - Technologies Virtual-PC
E-mail: sylvain aei ca (fill the blanks, no spam please)

Hi Mike,

Build one query at a time (warning air code - might be a typo or something)

(qs_DupeNames) SELECT * FROM myTable WHERE Name IN(SELECT Name FROM myTable AS tmp GROUP BY Name HAVING Count(*)>1);

(qs_DupeAddress) SELECT * FROM qs_DupeNames WHERE Address IN(SELECT Address FROM qs_DupeNames as tmp GROUP BY Address HAVING Count(*)>1);

Take a look at the output of qs_DupeAddress - if it looks right, you can change it to a DELETE * etc. and that takes care of that.

There is a way to join those two into a single query but I'm a little shaky on the syntax and frankly, I think the above is a lot cleaner/easier to follow.

Hope this helps,
Gordon

viktor chuzhakin · Mar 17, 2008

Mike C said:
Hello,
I have a medium sized file (approx 30,000 records) that has many names
and addresses that have been entered twice. I need to find an
efficient way to delete all records that have both a duplicate name
and a duplicate address. Records with a duplicate name but not a
duplicate address should stay and vice versa.

Is there a best way to approach this? Should this typically be
handled with a query. Does the Find duplicates query handle this
problem well?

Thanks for any insight anyone can provide. I am sure this is a common
problem, but i have not had to do it before and my first few attempts
don't seem to work.

(I will also need to attempt to delete Like records as the next step,
but for now I guess I will just figure out how to get the main file's
exact duplicates out of the equation.)

How to best approach deleting records with both duplicate a addressand a duplicate name in a large f

Mike C

Allen Browne

gllincoln

Sylvain Lafontaine

viktor chuzhakin