Sample Header Ad - 728x90

Find Duplicate Customers

1 vote
1 answer
2694 views
Okay... I have a table that has customers: -- Individual Table * ID (Internal Unique ID) * IndividualID (External Unique Individual Identifier) * Last Name * First Name * Birth Date * SSN * ... The issue is that sometimes a person gets multiple Individual ID's. Say the person doesn't provide a SSN for one Encounter, the last name changes, typo in birthday, etc. So you end up with the same person in the individual table multiple times: 1, Frost, Jack, 1/1/2000, 000-00-0008 2, Frost, Jack, 1/1/2000, 000-00-0003 3, Doe, Jane, 1/1/2000, 000-00-0005 4, Doe, Janet, 1/1/2000, 000-00-0005 5, Frost, Janet, 1/1/2000, 000-00-0005 Those are just some examples. The basic idea is that I need to find individuals that are potential matches, so that the right person can merge the individuals into a single account. The particular query I'm currently on is on SS2008-SP1, but I have other queries on SS2005 through SS2012. Is there any way I can improve this? Initially I had a single select statement (instead of 2 temp tables, 5 inserts and a select statement), but the "This or This or This or..." took many minutes and this takes ~10 seconds. Population of Customers is ~144k (Select count(*) from Data) Current I'm using a simple attempt to try and match four parts: Last Name, First Name, DOB, SSN. If 3 or 4 of them match on different individuals, the need to be inspected closer to determine if they really are the same person. IF object_id('tempdb..#DATA') IS NOT NULL DROP TABLE #DATA; GO CREATE TABLE #DATA ( EXTID VARCHAR(30) NOT NULL , LNAME VARCHAR(30) NULL , FNAME VARCHAR(30) NULL , SSN VARCHAR(11) NULL , DOB VARCHAR(8) NULL ) GO INSERT INTO #DATA SELECT EXTID = D1.EXTERNALID , LNAME = D1.LASTNAME , FNAME = D1.FIRSTNAME , SSN = CASE WHEN D1.SSN = '000-00-0000' THEN NULL ELSE D1.SSN END , DOB = convert(VARCHAR, D1.DOB, 112) FROM Data D1 WHERE Type = 1 and STATUS = 1 GO SELECT D1.*, [Splitter] = 'MATCH', D2.* FROM #Demo D1, #Demo D2 WHERE D1.ID > D2.ID AND ( D1.LNAME = D2.LNAME AND D1.FNAME = D2.FNAME AND D1.SSN = D2.SSN AND D1.DOB = D2.DOB) UNION SELECT D1.*, 'LName', D2.* FROM #Demo D1, #Demo D2 WHERE D1.ID > D2.ID AND ( D1.LNAME D2.LNAME AND D1.FNAME = D2.FNAME AND D1.SSN = D2.SSN AND D1.DOB = D2.DOB) UNION SELECT D1.*, 'FName', D2.* FROM #Demo D1, #Demo D2 WHERE D1.ID > D2.ID AND ( D1.LNAME = D2.LNAME AND D1.FNAME D2.FNAME AND D1.SSN = D2.SSN AND D1.DOB = D2.DOB) UNION SELECT D1.*, 'SSN ', D2.* FROM #Demo D1, #Demo D2 WHERE D1.ID > D2.ID AND ( D1.LNAME = D2.LNAME AND D1.FNAME = D2.FNAME AND D1.SSN D2.SSN AND D1.DOB = D2.DOB) UNION SELECT D1.*, 'DOB ', D2.* FROM #Demo D1, #Demo D2 WHERE D1.ID > D2.ID AND ( D1.LNAME = D2.LNAME AND D1.FNAME = D2.FNAME AND D1.SSN = D2.SSN AND D1.DOB D2.DOB); Edit to add Distinct Counts: LName FName SSN DOB Count 36737 14539 115073 34284 144044 Edit: Cleaned up a bit to get rid of second temp table. Poking around the Estimated Execution plan, the above query - broken into 5 parts - uses hash map inner joins and takes about 10 seconds. My initial query, and other variations seem to use loop joins and is still chugging along at 10+ minutes.
Asked by WernerCD (1245 rep)
Aug 22, 2013, 05:23 PM
Last activity: Aug 22, 2013, 08:50 PM