Prevent inserting duplicate entries under READ COMMITTED Isolation
5
votes
3
answers
356
views
We have a SQL Server table called
dbo.Dim_Target
. Multiple threads try to INSERT
new data into this table in parallel. Each parallel thread first checks for matching data and then only inserts new data (non-matching), using LEFT OUTER JOIN
.
Isolation Level is READ COMMITTED
.
### Problem
On rare occasions, duplicate data is inserted into this table by two different threads a couple of milliseconds apart. According to this post , this issue is expected RC isolation level.
### Things I tried
- Different isolation levels (RC, Repeatable Read, etc) all could not prevent this problem.
- Serializable isolation prevents duplicates, but also causes a lot of deadlock issues.
- Unique index is not possible on this large size column
### Question
Is there an approach to prevent duplicates from multiple threads?
### Sample code to recreate the scenario
#### 1. Preparation
Create target table & two STG tables and load dummy data
DROP TABLE IF EXISTS dbo.Dim_Target
GO
CREATE TABLE dbo.Dim_Target
( ID INT NOT NULL IDENTITY(1,1)
, Code VARCHAR(3000) NOT NULL
, Created_Date DATETIME NOT NULL DEFAULT(GETDATE()))
GO
INSERT INTO dbo.Dim_Target(Code) SELECT NEWID() -- some dummy data
GO 10000
--
-- STG table 1
--
DROP TABLE IF EXISTS dbo.STG_Test1
GO
CREATE TABLE dbo.STG_Test1
( ID INT NOT NULL IDENTITY(1,1)
, Code VARCHAR(3000)
)
GO
INSERT INTO dbo.STG_Test1 (Code) SELECT NEWID() -- some dummy data
GO 10000
INSERT INTO dbo.STG_Test1 (Code) SELECT 'abc1' -- this is the duplicate entry that matches STG2 table
GO
--
-- STG table 2
--
DROP TABLE IF EXISTS dbo.STG_Test2
GO
CREATE TABLE dbo.STG_Test2
( ID INT NOT NULL IDENTITY(1,1)
, Code VARCHAR(3000)
)
GO
INSERT INTO dbo.STG_Test2 (Code) SELECT NEWID() -- some dummy data
GO 10000
INSERT INTO dbo.STG_Test2 (Code) SELECT 'abc1' -- this is the duplicate entry that matches STG1 table
GO
--
-- Sample query to check for duplicates
--
SELECT Code, COUNT(*)
FROM dbo.Dim_Target
GROUP BY Code
HAVING COUNT(*) > 1
#### 2.Thread 1: Try to insert from first parallel thread
Run this below script from one SSMS window. Make sure the WAITFOR
time is set the same in both windows.
WAITFOR TIME '16:04:00'
BEGIN TRANSACTION
INSERT INTO dbo.Dim_Target (Code)
SELECT T.Code
FROM dbo.STG_Test1 AS T
LEFT OUTER JOIN dbo.Dim_Target AS D
ON D.Code = T.Code
WHERE D.Code IS NULL
COMMIT
#### 3. Thread 2: Try to insert from second parallel thread
Run this below script from second SSMS window. Make sure the WAITFOR
time is set the same in both windows.
WAITFOR TIME '16:04:00'
BEGIN TRANSACTION
INSERT INTO dbo.Dim_Target (Code)
SELECT T.Code
FROM dbo.STG_Test2 AS T
LEFT OUTER JOIN dbo.Dim_Target AS D
ON D.Code = T.Code
WHERE D.Code IS NULL
COMMIT
#### NOTES
- When I run this script, the duplicates do **not** occur every time. They happen once every two or 3 attempts, but they DO happen. The Created_Date
column shows that duplicate entries were created just a couple of milliseconds apart.
- We have 10 threads running in parallel loading all the Dimension tables, so the duplicates are occurring more frequently.
---
With a temporary solution working, I want to take time to re-think to make this scalable. All parts of the ETL are processed with multiple threads, except this one step -- which is single threaded (for now). So any ideas would be welcome.
Asked by ToC
(727 rep)
Aug 14, 2024, 05:24 PM
Last activity: Aug 17, 2024, 02:09 PM
Last activity: Aug 17, 2024, 02:09 PM