Sample Header Ad - 728x90

Prevent inserting duplicate entries under READ COMMITTED Isolation

5 votes
3 answers
356 views
We have a SQL Server table called dbo.Dim_Target. Multiple threads try to INSERT new data into this table in parallel. Each parallel thread first checks for matching data and then only inserts new data (non-matching), using LEFT OUTER JOIN. Isolation Level is READ COMMITTED. ### Problem On rare occasions, duplicate data is inserted into this table by two different threads a couple of milliseconds apart. According to this post , this issue is expected RC isolation level. ### Things I tried - Different isolation levels (RC, Repeatable Read, etc) all could not prevent this problem. - Serializable isolation prevents duplicates, but also causes a lot of deadlock issues. - Unique index is not possible on this large size column ### Question Is there an approach to prevent duplicates from multiple threads? ### Sample code to recreate the scenario #### 1. Preparation Create target table & two STG tables and load dummy data
DROP TABLE IF EXISTS dbo.Dim_Target
GO
CREATE TABLE dbo.Dim_Target
(    ID           INT               NOT NULL   IDENTITY(1,1)
   , Code         VARCHAR(3000)     NOT NULL
   , Created_Date DATETIME          NOT NULL   DEFAULT(GETDATE()))
GO
INSERT INTO dbo.Dim_Target(Code) SELECT NEWID()    -- some dummy data
GO 10000

--
-- STG table 1
--
DROP TABLE IF EXISTS dbo.STG_Test1
GO
CREATE TABLE dbo.STG_Test1
(    ID     INT            NOT NULL IDENTITY(1,1)
   , Code   VARCHAR(3000)
)
GO
INSERT INTO dbo.STG_Test1 (Code) SELECT NEWID()    -- some dummy data
GO 10000
INSERT INTO dbo.STG_Test1 (Code) SELECT 'abc1'     -- this is the duplicate entry that matches STG2 table
GO

--
-- STG table 2
--
DROP TABLE IF EXISTS dbo.STG_Test2
GO
CREATE TABLE dbo.STG_Test2
(    ID     INT            NOT NULL IDENTITY(1,1)
   , Code   VARCHAR(3000)
)
GO
INSERT INTO dbo.STG_Test2 (Code) SELECT NEWID()    -- some dummy data
GO 10000
INSERT INTO dbo.STG_Test2 (Code) SELECT 'abc1'     -- this is the duplicate entry that matches STG1 table
GO

--
-- Sample query to check for duplicates
--
SELECT Code, COUNT(*) 
FROM dbo.Dim_Target
GROUP BY Code
HAVING COUNT(*) > 1
#### 2.Thread 1: Try to insert from first parallel thread Run this below script from one SSMS window. Make sure the WAITFOR time is set the same in both windows.
WAITFOR TIME '16:04:00'
BEGIN TRANSACTION
    INSERT INTO dbo.Dim_Target (Code) 
    SELECT T.Code
    FROM dbo.STG_Test1 AS T
    LEFT OUTER JOIN dbo.Dim_Target AS D
        ON D.Code = T.Code
    WHERE D.Code IS NULL
COMMIT
#### 3. Thread 2: Try to insert from second parallel thread Run this below script from second SSMS window. Make sure the WAITFOR time is set the same in both windows.
WAITFOR TIME '16:04:00'
BEGIN TRANSACTION
	INSERT INTO dbo.Dim_Target (Code) 
	SELECT T.Code
	FROM dbo.STG_Test2 AS T
	LEFT OUTER JOIN dbo.Dim_Target AS D
		ON D.Code = T.Code
	WHERE D.Code IS NULL
COMMIT
#### NOTES - When I run this script, the duplicates do **not** occur every time. They happen once every two or 3 attempts, but they DO happen. The Created_Date column shows that duplicate entries were created just a couple of milliseconds apart. - We have 10 threads running in parallel loading all the Dimension tables, so the duplicates are occurring more frequently. --- With a temporary solution working, I want to take time to re-think to make this scalable. All parts of the ETL are processed with multiple threads, except this one step -- which is single threaded (for now). So any ideas would be welcome.
Asked by ToC (727 rep)
Aug 14, 2024, 05:24 PM
Last activity: Aug 17, 2024, 02:09 PM