Database Administrators

Q&A for database professionals who wish to improve their database skills

Latest Questions

15 votes

4 answers

6701 views

Foreign keys - link using surrogate or natural key?

database-design foreign-key surrogate-key natural-key

Is there a best practice for whether a foreign key between tables should link to a natural key or a surrogate key? The only discussion I've really found (unless my google-fu is lacking) is [Jack Douglas' answer in this question][1], and his reasoning seems sound to me. I'm aware of the discussion be...

                                  Is there a best practice for whether a foreign key between tables should link to a natural key or a surrogate key? The only discussion I've really found (unless my google-fu is lacking) is Jack Douglas' answer in this question , and his reasoning seems sound to me. I'm aware of the discussion beyond that that rules change, but this would be something that would need to be considered in any situation.

The main reason for asking is that I have a legacy application that makes uses of FKs with natural keys, but there is a strong push from devlopers to move to an OR/M (NHibernate in our case), and a fork has already produced some breaking changes, so I'm looking to either push them back on track using the natural key, or move the legacy app to use surrogate keys for the FK. My gut says to restore the original FK, but I'm honestly not sure if this is really the right path to follow.

The majority of our tables already have both a surrogate and natural key already defined (though unique constraint and PK) so having to add extra columns is a non-issue for us in this insance. We're using SQL Server 2008, but I'd hope this is generic enough for any DB.

Callie J (492 rep)

Apr 2, 2013, 12:17 PM • Last activity: Sep 6, 2023, 07:07 AM

2 votes

1 answers

1049 views

How to get the "best of both worlds" with natural and surrogate keys? And could DBMSs be better?

foreign-key primary-key enum surrogate-key natural-key

I am designing my first database, and I find myself frustrated by the choice between storing an integer or a string for each instance of a categorical variable. My understanding is that if I have a table containing cities that I want to make a child of a table of countries, the most performant way t...

                                  I am designing my first database, and I find myself frustrated by the choice between storing an integer or a string for each instance of a categorical variable. 

My understanding is that if I have a table containing cities that I want to make a child of a table of countries, the most performant way to do that is to have the PK of the countries table as a FK in in the table of cities. However for ease of use and debugging, it's nice to always have the string name associated with the country PK. Every solution I have considered either is not recommended or seems overly complex.

I'd like opinions the merits of these approaches (or hear about new ones) and also to understand if it has to be this way or if databases simply are this way because of tradition.

Possible approaches:

1. Use a string as a PK for countries. Then I will have a human-readable FK for it in any child tables. Obviously less performant than using integers, but I suspect it may be the least worst way to have the convenience I desire.

2. Create a view using application logic that join each the string name of the country to the states table. 
- I don't love this because if the application logic breaks, the tables become less readable. Also I would expect large join operations to have an even worse performance penalty than string PK/FKs.

3. Create a separate table to connect numeric IDs with the appropriate string ID. I'm not sure if it would be better to have a table coding each type of relation, or one big table with one big pool of IDs that cover all integer key-string value relations. I could then use application logic to look up the appropriate strings and fill the appropriate PK into the child table when it's string name is given by a user. 
- I feel like this might be pretty resource intensive too, as there would have to be a lookup every time a new row was added to the child. It also means that I would still have to create the views I want.

4. Use enum data type. Instinctively, this would be my go-to approach, as it seems the ideal balance between natural and synthetic keys: Use integer IDs and give the IDs a string label so that the string itself need not be repeated. 
- Unfortunately my research has found that this is not recommended. One reason for that is that categories cannot be deleted easily. I'm not sure if that is  dealbreaker for me, but I also wonder why DBMSs are designed this way. Aren't categorical variables commonly used enough to add convenience features for them?

Stonecraft (125 rep)

Jul 24, 2022, 06:43 PM • Last activity: Jul 25, 2022, 12:29 PM

2 votes

1 answers

204 views

Good Natural Key For a Physical Mailing Address

mysql natural-key

I'm trying to figure out what a good natural key for a Physical Mailing (PO Box) address would be. I haven't designed the fields for the table yet, and initially I was going to go with a surrogate key and a text field for the address, allowing free input. But the other dev I'm working with is interested in having a natural key (just for this one table). However, it's not quite making sense to me, because I'm not sure how to ensure, with an address, properly input data (when keyed in either on create or to do some kind of manual lookup). I know that you can often format the same address in more than one way. For example:

Suite vs Ste.
Apt. 36 vs #36 vs Unit 36
1110 West 300 South vs 1110 W 300 South vs 1110 W. 300 S. vs....

All valid, but all different. Granted PO Boxes are more standardized, but I feel like it may still suffer from some of the same issues. Any thoughts on the matter would be *greatly* appreciated!

Mr Mikkél (123 rep)

Mar 23, 2021, 03:52 AM • Last activity: Mar 23, 2021, 03:45 PM

2 votes

1 answers

583 views

Natural Keys vs Surrogate Keys part 2

sql-server surrogate-key natural-key

A while back, I asked if [surrogate keys provide better performance than natural keys in SQL Server](https://dba.stackexchange.com/questions/50708/do-natural-keys-provide-higher-or-lower-performance-in-sql-server-than-surrogate) . [@sqlvogel](https://dba.stackexchange.com/users/1357/sqlvogel) provided an answer to that question yesterday that caused me to revisit it. This question is an attempt to "upgrade" the prior question, and hopefully provide the opportunity for thoughtful answers that help the community. Consider a system for storing details about computers. Each computer has an architecture, and an Operating System. In SQL Server, we could create these tables using natural keys like this: CREATE TABLE dbo.Architecture ( ArchitectureName varchar(10) NOT NULL , ArchitectureVersion decimal(5,2) NOT NULL , ReleaseDate date NOT NULL , CONSTRAINT PK_Architecture PRIMARY KEY CLUSTERED (ArchitectureName, ArchitectureVersion) ); CREATE TABLE dbo.Manufacturer ( ManufacturerName varchar(10) NOT NULL CONSTRAINT PK_Manufacturer PRIMARY KEY CLUSTERED ); CREATE TABLE dbo.OS ( OSName varchar(30) NOT NULL , ManufacturerName varchar(10) NOT NULL CONSTRAINT FK_OS_Manufacturer FOREIGN KEY (ManufacturerName) REFERENCES dbo.Manufacturer(ManufacturerName) , ArchitectureName varchar(10) NOT NULL , ArchitectureVersion decimal(5,2) NOT NULL , CONSTRAINT FK_OS_Architecture FOREIGN KEY (ArchitectureName, ArchitectureVersion) REFERENCES dbo.Architecture(ArchitectureName, ArchitectureVersion) , CONSTRAINT PK_OS PRIMARY KEY CLUSTERED (OSName) ); CREATE TABLE dbo.Computers ( ComputerID varchar(10) NOT NULL CONSTRAINT PK_Computers PRIMARY KEY CLUSTERED , OSName varchar(30) NOT NULL CONSTRAINT FK_Computers_OSName FOREIGN KEY REFERENCES dbo.OS(OSName) , ComputerManufacturerName varchar(10) NOT NULL CONSTRAINT FK_Computers_Manufacturer FOREIGN KEY REFERENCES dbo.Manufacturer(ManufacturerName) , EffectiveDate datetime NOT NULL CONSTRAINT DF_Computers_EffectiveDate DEFAULT (GETDATE()) , ExpiryDate datetime NULL ); To query the dbo.Computers table, with 2 rows in dbo.Computers, showing various details, we could do this: SELECT Computers.ComputerID , Computers.ComputerManufacturerName , OSManufacturer = OS.ManufacturerName , Computers.OSName , OS.ArchitectureName , OS.ArchitectureVersion FROM dbo.Computers INNER JOIN dbo.OS ON Computers.OSName = OS.OSName WHERE Computers.EffectiveDate = GETDATE() OR Computers.ExpiryDate IS NULL) ORDER BY Computers.ComputerID; The query output is:

╔════════════╦══════════════════════════╦════════════════╦════════════╦══════════════════╦═════════════════════╗
║ ComputerID ║ ComputerManufacturerName ║ OSManufacturer ║ OSName     ║ ArchitectureName ║ ArchitectureVersion ║
╠════════════╬══════════════════════════╬════════════════╬════════════╬══════════════════╬═════════════════════╣
║ CM700-01   ║ HP                       ║ Microsoft      ║ Windows 10 ║ x64              ║ 1.00                ║
║ CM700-02   ║ HP                       ║ Microsoft      ║ Windows 10 ║ x64              ║ 1.00                ║
╚════════════╩══════════════════════════╩════════════════╩════════════╩══════════════════╩═════════════════════╝

The query plan for this is quite simple:

Or, if we choose to use surrogate keys, like this: CREATE TABLE dbo.Architecture ( ArchitectureID int NOT NULL IDENTITY(1,1) CONSTRAINT PK_Architecture PRIMARY KEY CLUSTERED , ArchitectureName varchar(10) NOT NULL , ArchitectureVersion decimal(5,2) NOT NULL , ReleaseDate date NOT NULL , CONSTRAINT UQ_Architecture_Name UNIQUE (ArchitectureName, ArchitectureVersion) ); CREATE TABLE dbo.Manufacturer ( ManufacturerID int NOT NULL IDENTITY(1,1) CONSTRAINT PK_Manufacturer PRIMARY KEY CLUSTERED , ManufacturerName varchar(10) NOT NULL ); CREATE TABLE dbo.OS ( OS_ID int NOT NULL IDENTITY(1,1) CONSTRAINT PK_OS PRIMARY KEY CLUSTERED , OSName varchar(30) NOT NULL CONSTRAINT UQ_OS_Name UNIQUE , ManufacturerID int NOT NULL CONSTRAINT FK_OS_Manufacturer FOREIGN KEY REFERENCES dbo.Manufacturer(ManufacturerID) , ArchitectureID int NOT NULL CONSTRAINT FK_OS_Architecture FOREIGN KEY REFERENCES dbo.Architecture(ArchitectureID) ); CREATE TABLE dbo.Computers ( ComputerID int NOT NULL IDENTITY(1,1) CONSTRAINT PK_Computers PRIMARY KEY CLUSTERED , ComputerName varchar(10) NOT NULL CONSTRAINT UQ_Computers_Name UNIQUE , OS_ID int NOT NULL CONSTRAINT FK_Computers_OS FOREIGN KEY REFERENCES dbo.OS(OS_ID) , ComputerManufacturerID int NOT NULL CONSTRAINT FK_Computers_Manufacturer FOREIGN KEY REFERENCES dbo.Manufacturer(ManufacturerID) , EffectiveDate datetime NOT NULL CONSTRAINT DF_Computers_EffectiveDate DEFAULT (GETDATE()) , ExpiryDate datetime NULL ); In the design above, you may notice we have to include several new unique constraints to ensure our data model is consistent across both approaches. Querying this surrogate-key approach with 2 rows in dbo.Computers looks like: SELECT Computers.ComputerName , ComputerManufacturerName = cm.ManufacturerName , OSManufacturer = om.ManufacturerName , OS.OSName , Architecture.ArchitectureName , Architecture.ArchitectureVersion FROM dbo.Computers INNER JOIN dbo.OS ON Computers.OS_ID = OS.OS_ID INNER JOIN dbo.Manufacturer cm ON Computers.ComputerManufacturerID = cm.ManufacturerID INNER JOIN dbo.Architecture ON OS.ArchitectureID = Architecture.ArchitectureID INNER JOIN dbo.Manufacturer om ON OS.ManufacturerID = om.ManufacturerID WHERE Computers.EffectiveDate = GETDATE() OR Computers.ExpiryDate IS NULL) ORDER BY Computers.ComputerID; The results:

╔══════════════╦══════════════════════════╦════════════════╦════════════╦══════════════════╦═════════════════════╗
║ ComputerName ║ ComputerManufacturerName ║ OSManufacturer ║ OSName     ║ ArchitectureName ║ ArchitectureVersion ║
╠══════════════╬══════════════════════════╬════════════════╬════════════╬══════════════════╬═════════════════════╣
║ CM700-01     ║ HP                       ║ Microsoft      ║ Windows 10 ║ x64              ║ 1.00                ║
║ CM700-02     ║ HP                       ║ Microsoft      ║ Windows 10 ║ x64              ║ 1.00                ║
╚══════════════╩══════════════════════════╩════════════════╩════════════╩══════════════════╩═════════════════════╝

The I/O statistics are even more telling. For the natural keys, we have:

Table 'OS'. Scan count 0, logical reads 4, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.  
Table 'Computers'. Scan count 1, logical reads 2, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

For the surrogate key setup, we get:

Table 'Manufacturer'. Scan count 0, logical reads 8, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table 'Architecture'. Scan count 0, logical reads 4, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table 'OS'. Scan count 0, logical reads 4, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table 'Computers'. Scan count 1, logical reads 2, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

Quite clearly, in the above admittedly very simple setup, the surrogate key is lagging in both ease-of-use, and performance. Having said that, what happens if we need to change the name of one of the manufacturers? Here's the T-SQL for the natural key version: UPDATE dbo.Manufacturer SET ManufacturerName = 'Microsoft™' WHERE ManufacturerName = 'Microsoft'; And the plan:

The T-SQL for the surrogate key version: UPDATE dbo.Manufacturer SET ManufacturerName = 'Microsoft™' WHERE ManufacturerID = 1; And that plan:

The natural key version has an estimated subtree cost that is nearly three times greater than the surrogate key version. Am I correct in saying that both natural keys and surrogate keys offer benefits; deciding which methodology to use should be carefully considered? Are there common situations where the comparisons I made above don't work? What other considerations should be made when choosing natural or surrogate keys?

Hannah Vernon (70988 rep)

Aug 30, 2017, 08:04 PM • Last activity: Oct 22, 2019, 06:02 AM

29 votes

3 answers

6548 views

Do natural keys provide higher or lower performance in SQL Server than surrogate integer keys?

sql-server sql-server-2012 performance surrogate-key natural-key performance-testing

I'm a fan of surrogate keys. There is a risk my findings are confirmation biased. Many questions I've seen both here and at http://stackoverflow.com use natural keys instead of surrogate keys based on `IDENTITY()` values. My background in computer systems tells me performing any comparative operatio...

                                  I'm a fan of surrogate keys.  There is a risk my findings are confirmation biased.  

Many questions I've seen both here and at http://stackoverflow.com  use natural keys instead of surrogate keys based on IDENTITY() values.

My background in computer systems tells me performing any comparative operation on an integer will be faster than comparing strings.

This  comment made me question my beliefs, so I thought I would create a system to investigate my thesis that integers are faster than strings for use as keys in SQL Server.

Since there is likely to be very little discernible difference in small datasets, I immediately thought of a two table setup where the primary table has 1,000,000 rows and the secondary table has 10 rows for each row in the primary table for a total of 10,000,000 rows in the secondary table.  The premise of my test is to create two sets of tables like this, one using natural keys and one using integer keys, and run timing tests on a simple query like:

    SELECT *
    FROM Table1
        INNER JOIN Table2 ON Table1.Key = Table2.Key;

The following is the code I created as a test bed:

	USE Master;
	IF (SELECT COUNT(database_id) FROM sys.databases d WHERE d.name = 'NaturalKeyTest') = 1
	BEGIN
		ALTER DATABASE NaturalKeyTest SET SINGLE_USER WITH ROLLBACK IMMEDIATE;
		DROP DATABASE NaturalKeyTest;
	END
	GO
	CREATE DATABASE NaturalKeyTest 
    	ON (NAME = 'NaturalKeyTest', FILENAME = 
            'C:\SQLServer\Data\NaturalKeyTest.mdf', SIZE=8GB, FILEGROWTH=1GB) 
    	LOG ON (NAME='NaturalKeyTestLog', FILENAME = 
            'C:\SQLServer\Logs\NaturalKeyTest.mdf', SIZE=256MB, FILEGROWTH=128MB);
	GO
	ALTER DATABASE NaturalKeyTest SET RECOVERY SIMPLE;
	GO
	USE NaturalKeyTest;
	GO
	CREATE VIEW GetRand
	AS 
		SELECT RAND() AS RandomNumber;
	GO
	CREATE FUNCTION RandomString
	(
		@StringLength INT
	)
	RETURNS NVARCHAR(max)
	AS
	BEGIN
		DECLARE @cnt INT = 0
		DECLARE @str NVARCHAR(MAX) = '';
		DECLARE @RandomNum FLOAT = 0;
		WHILE @cnt < @StringLength
		BEGIN
			SELECT @RandomNum = RandomNumber
			FROM GetRand;
			SET @str = @str + CAST(CHAR((@RandomNum * 64.) + 32) AS NVARCHAR(MAX)); 
			SET @cnt = @cnt + 1;
		END
		RETURN @str;
	END;
	GO
	CREATE TABLE NaturalTable1
	(
		NaturalTable1Key NVARCHAR(255) NOT NULL 
		    CONSTRAINT PK_NaturalTable1 PRIMARY KEY CLUSTERED 
		, Table1TestData NVARCHAR(255) NOT NULL 
	);
	CREATE TABLE NaturalTable2
	(
		NaturalTable2Key NVARCHAR(255) NOT NULL 
		    CONSTRAINT PK_NaturalTable2 PRIMARY KEY CLUSTERED 
		, NaturalTable1Key NVARCHAR(255) NOT NULL 
		    CONSTRAINT FK_NaturalTable2_NaturalTable1Key 
		    FOREIGN KEY REFERENCES dbo.NaturalTable1 (NaturalTable1Key) 
		    ON DELETE CASCADE ON UPDATE CASCADE
		, Table2TestData NVARCHAR(255) NOT NULL  
	);
	GO

	/* insert 1,000,000 rows into NaturalTable1 */
	INSERT INTO NaturalTable1 (NaturalTable1Key, Table1TestData) 
		VALUES (dbo.RandomString(25), dbo.RandomString(100));
	GO 1000000 

	/* insert 10,000,000 rows into NaturalTable2 */
	INSERT INTO NaturalTable2 (NaturalTable2Key, NaturalTable1Key, Table2TestData)
	SELECT dbo.RandomString(25), T1.NaturalTable1Key, dbo.RandomString(100)
	FROM NaturalTable1 T1
	GO 10 

	CREATE TABLE IDTable1
	(
		IDTable1Key INT NOT NULL CONSTRAINT PK_IDTable1 
		PRIMARY KEY CLUSTERED IDENTITY(1,1)
		, Table1TestData NVARCHAR(255) NOT NULL 
		CONSTRAINT DF_IDTable1_TestData DEFAULT dbo.RandomString(100)
	);
	CREATE TABLE IDTable2
	(
		IDTable2Key INT NOT NULL CONSTRAINT PK_IDTable2 
		    PRIMARY KEY CLUSTERED IDENTITY(1,1)
		, IDTable1Key INT NOT NULL 
		    CONSTRAINT FK_IDTable2_IDTable1Key FOREIGN KEY 
		    REFERENCES dbo.IDTable1 (IDTable1Key) 
		    ON DELETE CASCADE ON UPDATE CASCADE
		, Table2TestData NVARCHAR(255) NOT NULL 
		    CONSTRAINT DF_IDTable2_TestData DEFAULT dbo.RandomString(100)
	);
	GO
	INSERT INTO IDTable1 DEFAULT VALUES;
	GO 1000000
	INSERT INTO IDTable2 (IDTable1Key)
	SELECT T1.IDTable1Key
	FROM IDTable1 T1
	GO 10

The code above creates a database and 4 tables, and fills the tables with data, ready to test.  The test code I ran is:

	USE NaturalKeyTest;
	GO
	DECLARE @loops INT = 0;
	DECLARE @MaxLoops INT = 10;
	DECLARE @Results TABLE (
		FinishedAt DATETIME DEFAULT (GETDATE())
		, KeyType NVARCHAR(255)
		, ElapsedTime FLOAT
	);
	WHILE @loops < @MaxLoops
	BEGIN
		DBCC FREEPROCCACHE;
		DBCC FREESESSIONCACHE;
		DBCC FREESYSTEMCACHE ('ALL');
		DBCC DROPCLEANBUFFERS;
		WAITFOR DELAY '00:00:05';
		DECLARE @start DATETIME = GETDATE();
		DECLARE @end DATETIME;
		DECLARE @count INT;
		SELECT @count = COUNT(*) 
		FROM dbo.NaturalTable1 T1
			INNER JOIN dbo.NaturalTable2 T2 ON T1.NaturalTable1Key = T2.NaturalTable1Key;
		SET @end = GETDATE();
		INSERT INTO @Results (KeyType, ElapsedTime)
		SELECT 'Natural PK' AS KeyType, CAST((@end - @start) AS FLOAT) AS ElapsedTime;

		DBCC FREEPROCCACHE;
		DBCC FREESESSIONCACHE;
		DBCC FREESYSTEMCACHE ('ALL');
		DBCC DROPCLEANBUFFERS;
		WAITFOR DELAY '00:00:05';
		SET @start = GETDATE();
		SELECT @count = COUNT(*) 
		FROM dbo.IDTable1 T1
			INNER JOIN dbo.IDTable2 T2 ON T1.IDTable1Key = T2.IDTable1Key;
		SET @end = GETDATE();
		INSERT INTO @Results (KeyType, ElapsedTime)
		SELECT 'IDENTITY() PK' AS KeyType, CAST((@end - @start) AS FLOAT) AS ElapsedTime;

		SET @loops = @loops + 1;
	END
	SELECT KeyType, FORMAT(CAST(AVG(ElapsedTime) AS DATETIME), 'HH:mm:ss.fff') AS AvgTime 
	FROM @Results
	GROUP BY KeyType;

These are the results:



Am I doing something wrong here, or are INT keys 3 times faster than 25 character natural keys?

Note, I've written a follow-up question [here](https://dba.stackexchange.com/questions/184756/natural-keys-vs-surrogate-keys-part-2-the-showdown) .

                                

Hannah Vernon (70988 rep)

Sep 29, 2013, 01:56 AM • Last activity: Mar 18, 2018, 03:22 PM

5 votes

2 answers

2189 views

Surrogate key vs Natural key

database-design best-practices surrogate-key natural-key

I have a table called devices. Most of the devices that will get stored in this table can be uniquely identified by their serial number and part number. But there are some device types that do not have serial number and part number assigned to them. Instead they can be uniquely identified by another...

                                  I have a table called devices. Most of the devices that will get stored in this table can be uniquely identified by their serial number and part number. But there are some device types that do not have serial number and part number assigned to them. Instead they can be uniquely identified by another field (internal id).

Should I create a surrogate key for this table or should I create a composite primary key (serial number, part number, internal id) and insert default values to the serial number and part number columns when they are not supplied? The device types that do not have part number and serial number now, will have the numbers assigned to them in the future releases (may be 5 years from now). Should I create a surrogate key or a composite key in this scenario? Or using the three unique attributes, should I create a hash in the program and use that as a surrogate key for the tables?

DBK (378 rep)

Aug 26, 2017, 04:07 AM • Last activity: Mar 18, 2018, 12:43 PM

0 votes

1 answers

58 views

coping with long natural string key

mysql natural-key

I have inherited a (quite) big database (for my standards at least, since by no means I call myself a DBA or DB dev). So, on this DB they used a unique natural string as the primary key on EVERY table. In my novice experience it seems that is slowing joins down. Just to get some more info on this, o...

                                  I have inherited a (quite) big database (for my standards at least, since by no means I call myself a DBA or DB dev).  
So, on this DB they used a unique natural string as the primary key on EVERY table. In my novice experience it seems that is slowing joins down.   
Just to get some more info on this, our main table is having roughly 1.5 million rows, the key is a taxing office id number, which might have some leading 0s, so they went and declared this as a varchar(12), in order not to loose the leading 0.  
One of our secondary table has more than 3 million rows, and an other one with roughly the same number of rows as the main table. Each of these tables (and many more of smaller size) are using the same varchar(12) as a foreign key to the main table mentioned before.  

Questions  

 1. Would it make any difference using an int instead of this varchar as
    key? 
 2. If 1=Y then : There are a few applications that are using this
    database and they are all using this key for joining tables. Is
    there a way to keep the old columns for joining the tables, but get
    some advantage using some int column for the key? i.e. can i change the key, but keep the same old queries?

Edit:  
One major limitation on changing the queries is that most are not stored procedures. They are hard coded inside the actual applications
                                

Skaros Ilias (131 rep)

Jan 26, 2018, 10:02 AM • Last activity: Jan 26, 2018, 04:41 PM

1 votes

0 answers

85 views

Is the source system typically one of the fields used to uniquely identify business key?

database-design best-practices master-data-services natural-key data-vault

One of the key components of designing a data vault is identifying **enterprise-wide unique business keys** ("business key" AKA "natural key"). It's not enough to use `OrderID` to identify records in an `Orders` table, because when you add another orders table from another source system (e.g., a bri...

                                  One of the key components of designing a data vault is identifying **enterprise-wide unique business keys** ("business key" AKA "natural key").  It's not enough to use OrderID to identify records in an Orders table, because when you add another orders table from another source system (e.g., a brick-and-mortar store launches an online store), there may well be collisions.

In Data Vault best practices, should one include the name of the source system as one of the fields used to uniquely identify a record, when adding it to a hub table?  Rather than searching for some combination of fields in Orders which will *hopefully* be universally unique ({ CustomerID, OrderID, and OrderDate }, maybe), one could identify brick-and-mortar orders with { "Brick-and-Mortar Order System", OrderID }, and one could identify online orders with { "Online Order System", OrderID }:

This seems like the obvious solution, but I've not seen it mentioned in any of the Data Vault modeling documents I've found, so I wonder if there's some reason not to use this approach.

The Data Vault approach seems to be an outgrowth of Master Data Management practices, so if this is a solved problem in MDM, that solution probably applies here too.

Jon of All Trades (5987 rep)

Feb 17, 2017, 11:19 PM

4 votes

1 answers

800 views

Normalized Data Store - Confused with prefixes to use

sql-server database-design foreign-key naming-convention natural-key

I'm designing a Staging+NDS+DDS Data Warehouse system, where an ETL is going to normalize data from `[Staging]` and load it into `[NDS]`, which will hold all history. I've pretty much finished the T-SQL script that will create the tables and constraints in the `[NDS]` database, which contains *Maste...

                                  I'm designing a Staging+NDS+DDS Data Warehouse system, where an ETL is going to normalize data from [Staging] and load it into [NDS], which will hold all history.

I've pretty much finished the T-SQL script that will create the tables and constraints in the [NDS] database, which contains *Master* and *Transactional* tables, that will respectively feed [DDS] *Dimension* and *Fact* tables in what I'm intending to be a star schema.

I'm given myself the following rules to follow:

 - Tables sourcing [DDS] dimensions are prefixed with DWD_
 - Tables sourcing [DDS] facts are prefixed with DWF_
 - Foreign key columns are prefixed with DWK_
 - Surrogate key column is prefixed with the same prefix as the table. Which means the surrogate key is always either:
    - DWD_Key for a DWD_ table, or
    - DWF_Key for a DWF_ table.
 - Control columns are prefixed with the same prefix as the table. For example...
    - The DWD_Customers table has control columns:
        - DWD_IsLastImage
        - DWD_EffectiveFrom
        - DWD_EffectiveTo
        - DWD_DateInserted
        - DWD_DateUpdated
        - DWD_DateDeleted
    - The DWF_InvoiceHeaders table has control columns:
        - DWF_DateInserted
        - DWF_DateUpdated
        - DWF_DateDeleted
 - Primary keys (/surrogate keys) are always prefixed with PK_ followed by the table name (including the table prefix) - e.g. PK_DWD_Customers and PK_DWF_InvoiceHeaders.
 - I also added a unique constraint on *natural keys*, and those are always prefixed with NK_ followed by the table name (including the table prefix) - e.g. NK_DWD_Customers and NK_DWF_InvoiceHeaders.
 - Foreign key columns are always prefixed with DWK_ followed by the name of the referenced table (without its prefix) and the word "Key" - e.g. DWK_CustomerKey.
 - Foreign key constraints are always named FK_[ParentTableNameWithPrefix]_[ChildTableNameWithPrefix].
 - When a table has multiple FK's to the same table, the name of the FK column is appended to the constraint's name, e.g. FK_DWD_FiscalCalendar_DWF_OrderDetails_DeliveryDate.

All prefixed columns have no business meaning and should never appear in views; this leaves me with, I find, a pretty clean and consistent design, and create table scripts looking like this:

    create table DWD_SubCategories (
    	 DWD_Key int not null identity(1,1)
    	,DWD_DateInserted datetime not null
    	,DWD_DateUpdated datetime null
    	,DWK_CategoryKey int not null
    	,Code nvarchar(5) not null
    	,Name nvarchar(50) not null
    	,constraint PK_DWD_SubCategories primary key clustered (DWD_Key asc)
    	,constraint NK_DWD_SubCategories unique (Code)
    );

---

So, my question is, is there anything I should know (or *unlearn*) before I continue and implement the ETL to load data into this database? Would anyone inheriting this database want to chase me down and rip my head off in the future? What should I change to avoid this? The reason I'm asking about prefixes, is because I'm using DWD and DWF, but the tables are technically not "dimension" and "fact" tables. Is that confusing?

Also, I'm unsure about the concept of *natural key* - am I correct to presume it should be a *unique* combination of columns that the source system might consider its "key" columns, that I can use in the ETL process to locate, say, a specific record to update?
                                

Mathieu Guindon (914 rep)

Nov 19, 2014, 04:19 PM • Last activity: Nov 19, 2014, 05:38 PM

1 votes

2 answers

592 views

Column suitable for natural key?

mysql primary-key natural-key

I have read many articles now about *natural* vs *surrogate* primary keys, and came to the conclusion there is no single best practice. I have a table that will have around 2000 definite unique values, each value will range from 5 characters to 40 in length. This seems like a partial choice as a nat...

                                  I have read many articles now about *natural* vs *surrogate* primary keys, and came to the conclusion there is no single best practice.

I have a table that will have around 2000 definite unique values, each value will range from 5 characters to 40 in length.

This seems like a partial choice as a natural key, although the values which are 40 characters in length may cause some performance and storage issues when they are referenced elsewhere.

As the total maximum rows in this table is fixed as 2000 and 35% of these rows contain value length of 25-40 characters(65% have length 6-25), shall I go with a natural key here?

With your experience, what would you do here?

cecilli0n (305 rep)

Mar 2, 2014, 10:14 PM • Last activity: Jul 21, 2014, 11:15 PM

Showing page 1 of 10 total questions