Database Administrators

Q&A for database professionals who wish to improve their database skills

Latest Questions

0 votes

1 answers

145 views

Working with varchar field in Clustered Index. (Performance Test)

I am using Microsoft SQL Server 2016. For my products table; I need to define a ClusteredIndex with 2 fields of Product Type and Product Code. ProductType(tinyint), ProductCode(varchar(32)). I know that ProductId is the right option for this, but it affects my software development speed. But in term...

                                  I am using Microsoft SQL Server 2016.

For my products table;

I need to define a ClusteredIndex with 2 fields of Product Type and Product Code. ProductType(tinyint), ProductCode(varchar(32)).

I know that ProductId is the right option for this, but it affects my software development speed. 

But in terms of speed, I want my Product table to run fast.

Estimately there will be 1,000,000 records in the table.

What I want to ask is if using varchar(16) instead of varchar(32) in ProductCode, how much will it affect performance in queries?

Is there a software that I can test as ProductCode varchar(32) and varchar(16) in 1,000,000 lines? Can I do this on sql Management studio?

16 characters is enough right now. However, the user may want the product code to be given by the system itself  in this case I want to automatically set a product code with newid.

Product Id(int) is not the right solution for me. Because many transactions (orders, warehouses, sales) that have their own product codes are executed through this code.

Edit:
I tested with SqQueryStress based on the comment.

I am sharing the results.

I created 2 tables as products and sales.

productsID,salesID;
- productsID.id clustered index
- salesID.productId non clustered index

products16,sales16 ; 
- products16.productCode clustered index
- sales16.productCode non clustered index

products36,sales36; 
- products36.productCode clustered index
- sales36.productCode non clustered index

(https://www.db-fiddle.com/f/3qJM9uupQoXAtLgL7u8YaE/0) 

I entered 100,000 products, random results

productsId 06:03
[![enter image description here]]

products16 05:46
[![enter image description here]]

products36 05:42

I entered a sales record from 10,000 random products tables.

salesID 01:42

sales16 02:34

sales36 02:01

I listed the sales reports according to the data I entered. (1000 times)

Does that mean reporting 1000 * 10,000 (sales item) 10,000,000 rows?

salesId sales report (join) 01:07

sales16 sales report (join) 02:05

sales36 sales report (join) 02:55

I used SqlQueryStress for the first time, I don't know much about interpretation.

**When I pull 10,000,000 sales data.
Reporting sales36 and sales16 as well 02:55 (175 seconds) 02:05 (125 seconds) Is there a 40% performance difference?**

omerix (101 rep)

Aug 31, 2022, 09:42 AM • Last activity: Jul 21, 2025, 05:01 AM

0 votes

5 answers

162 views

Log table without primary key

sql-server primary-key clustered-index identity

I have a SQL Server log table that includes a column "Id" This is an identity column, but not a primary key, it is not even indexed. This would have just been set up from some tutorial for adding DB logging, and I'm not sure if the PK was just missed or intentionally left out. Given its logging all...

                                  I have a SQL Server log table that includes a column "Id"
This is an identity column, but not a primary key, it is not even indexed.

This would have just been set up from some tutorial for adding DB logging, and I'm not sure if the PK was just missed or intentionally left out.

Given its logging all concern is to write performance, not read.

I would have thought if its an identity column it would have to be checking the data anyway to do an insert at correct numbers, should this be set up as a pk for that reason?

Is there any benefit to leaving it with no pk?

F Dev (1 rep)

Sep 12, 2024, 11:15 PM • Last activity: Jul 16, 2025, 06:01 AM

0 votes

2 answers

175 views

Non-clustered index covering a clustered index

sql-server index index-tuning clustered-index

I have a table that has the following indexes: **IXC** clustered index that includes a single column : A asc. **IX1** that includes two columns : A asc, B asc. In this case ICX seems like a duplicate index of IX1. Is it a good idea to delete ICX and make IX1 the clustered index in that case? Or is t...

                                  I have a table that has the following indexes:

**IXC** clustered index that includes a single column : A asc.

**IX1** that includes two columns : A asc, B asc.

In this case ICX seems like a duplicate index of IX1. Is it a good idea to delete ICX and make IX1 the clustered index in that case? Or is there scenarios where this would be a bad idea?

DDL example : 

    CREATE TABLE [dbo].[foo] (
        [A]      NVARCHAR (20)  NOT NULL,
        [B]      INT           NULL,
        [C]      NVARCHAR (20)  NOT NULL,
        [D]      NVARCHAR (20)  NOT NULL,
        [E]      NVARCHAR (20)  NOT NULL,
    );
    
    
    GO
    CREATE CLUSTERED INDEX [IXC]
        ON [dbo].[foo]([A] ASC);
    
    GO
    CREATE NONCLUSTERED INDEX [IX1]
        ON [dbo].[foo]([A] ASC, [B] ASC);


                                

Kimo (229 rep)

Jan 16, 2020, 07:10 AM • Last activity: Jul 15, 2025, 12:43 PM

0 votes

1 answers

172 views

Optimizing PostgreSQL Query with DISTINCT ON, Filtering, and Ordering

postgresql index index-tuning clustered-index

I'm working on optimizing a query for a PostgreSQL database used in an educational platform. The query retrieves distinct `question_id` values along with their `is_seen` status from the `question_interaction` table based on specific `user`, `customer_id`, `student_id`, and `tag` criteria. Here's the...

I'm working on optimizing a query for a PostgreSQL database used in an educational platform. The query retrieves distinct question_id values along with their is_seen status from the question_interaction table based on specific user, customer_id, student_id, and tag criteria. Here's the query:

SELECT DISTINCT ON (question_id) question_id, is_seen
FROM question_interaction
WHERE user_id = 'USR234342'
  AND customer_id = '39874513-8bb1-461e-a73f-9e73fa31870d'
  AND student_id = '8179483214'
  AND tags @> ARRAY ['history']::TEXT[]
  AND is_seen IS NOT NULL
ORDER BY question_id, timestamp DESC;

Here is the selectivity order of the columns:

question_id > student_id > user_id > customer_id

*Can someone provide guidance on how to create an effective index for this query?* Specifically, I'm looking for recommendations on: - The choice of columns for the index - The order of columns in the index - Any additional conditions to consider

sujeet (257 rep)

Aug 21, 2023, 07:06 AM • Last activity: Jul 13, 2025, 09:04 PM

40 votes

6 answers

12256 views

Why do sequential GUID keys perform faster than sequential INT keys in my test case?

sql-server clustered-index

After asking [this][1] question comparing sequential and non-sequential GUIDs, I tried to compare the INSERT performance on 1) a table with a GUID primary key initialized sequentially with `newsequentialid()`, and 2) a table with an INT primary key initialized sequentially with `identity(1,1)`. I wo...

                                  After asking this  question comparing sequential and non-sequential GUIDs, I tried to compare the INSERT performance on 1) a table with a GUID primary key initialized sequentially with newsequentialid(), and 2) a table with an INT primary key initialized sequentially with identity(1,1). I would expect the latter to be fastest because of the smaller width of integers, and it also seems simpler to generate a sequential integer than a sequential GUID. But to my surprise, INSERTs on the table with the integer key were significantly slower than the sequential GUID table.

This shows the average time usage (ms) for the test runs: 

    NEWSEQUENTIALID()  1977
    IDENTITY()         2223

Can anyone explain this?

The following experiment was used:

    SET NOCOUNT ON
      
    CREATE TABLE TestGuid2 (Id UNIQUEIDENTIFIER NOT NULL DEFAULT NEWSEQUENTIALID() PRIMARY KEY,
    SomeDate DATETIME, batchNumber BIGINT, FILLER CHAR(100))
    
    CREATE TABLE TestInt (Id Int NOT NULL identity(1,1) PRIMARY KEY,
    SomeDate DATETIME, batchNumber BIGINT, FILLER CHAR(100))
    
    DECLARE @BatchCounter INT = 1
    DECLARE @Numrows INT = 100000
    
    
    WHILE (@BatchCounter <= 20)
    BEGIN 
    BEGIN TRAN
    
    DECLARE @LocalCounter INT = 0
    
        WHILE (@LocalCounter <= @NumRows)
        BEGIN
        INSERT TestGuid2 (SomeDate,batchNumber) VALUES (GETDATE(),@BatchCounter)
        SET @LocalCounter +=1
        END
    
    SET @LocalCounter = 0
    
        WHILE (@LocalCounter <= @NumRows)
        BEGIN
        INSERT TestInt (SomeDate,batchNumber) VALUES (GETDATE(),@BatchCounter)
        SET @LocalCounter +=1
        END
    
    SET @BatchCounter +=1
    COMMIT 
    END
    
    DBCC showcontig ('TestGuid2')  WITH tableresults
    DBCC showcontig ('TestInt')  WITH tableresults
    
    SELECT batchNumber,DATEDIFF(ms,MIN(SomeDate),MAX(SomeDate)) AS [NEWSEQUENTIALID()]
    FROM TestGuid2
    GROUP BY batchNumber
    
    SELECT batchNumber,DATEDIFF(ms,MIN(SomeDate),MAX(SomeDate)) AS [IDENTITY()]
    FROM TestInt
    GROUP BY batchNumber
    
    DROP TABLE TestGuid2
    DROP TABLE TestInt

**UPDATE:**
Modifying the script to perform the insertions based on a TEMP table, like in the examples by by Phil Sandler, Mitch Wheat and Martin below, I also find that IDENTITY is faster as it should be. But that is not the conventional way of inserting rows, and I still do not understand why the experiment went wrong at first: 
even if I omit GETDATE() from my original example, IDENTITY() is still way slower. So it seems that the only way to make IDENTITY() outperform NEWSEQUENTIALID() is to prepare the rows to insert in a temporary table and perform the many insertions as a batch-insert using this temp table. All in all, I don't think we have found an explanation to the phenomenon, and IDENTITY() still seems to be slower for most practical usages. Can anyone explain this?
                                

someName (591 rep)

May 13, 2011, 10:02 PM • Last activity: Jun 23, 2025, 02:49 PM

1 votes

2 answers

345 views

Creating a clustered index on a view, minimising contention

sql-server sql-server-2016 clustered-index concurrency materialized-view

I have a table of 100 billion rows with an identity bigint column that is the clustered primary key. I have a schema bound view on that table that is filtered down to the last 500 million rows or so. I want to create a clustered index on the view that incorporates the primary key and a few other col...

                                  I have a table of 100 billion rows with an identity bigint column that is the clustered primary key.

I have a schema bound view on that table that is filtered down to the last 500 million rows or so. I want to create a clustered index on the view that incorporates the primary key and a few other columns.

The table is highly transactional, so can't really be locked for more than a minute at a time.

Will the creation of the index on the view cause contention directly on the underlying table during the creation?

If so, is it possible for me to create the index on my schema bound view in an incremental manner (i.e. somehow pause part of the way through creation, to allow the table to catch up on its transaction backlog, then resume creation).

I have to be careful with overall server contention as well; I've seen heavy operations against one table like this slow up my whole server before.

Essentially, I'm looking for the most efficient way to create the index on my view to minimize contention.

I'm using an indexed view because I'm using the HASHBYTES function to generate a hash of the row, then creating an index on that hash and the primary key of the table. This allows me to quickly compare 500 million rows of data with another large dataset for any changes in data. (I realize I probably could do this in a computed column on the table itself as well, but I figured there may be less contention if the data is a materialized copy as an indexed view.)

The table is not partitioned.

J.D. (40893 rep)

Dec 16, 2019, 08:26 PM • Last activity: May 8, 2025, 08:25 AM

0 votes

1 answers

46 views

What happens if I insert a row in the middle of an sqlite table `WITHOUT ROWID`?

optimization sqlite clustered-index sqlite3 physical-design

`WITHOUT ROWID` is an sqlite optimization. Doc states, it is making the primary key of the table to a clustered index. As far I know, clustered index means that the physical ordering of the rows will follow the index, making linear scans effective on it. However, what will happen if I insert a row i...

                                  WITHOUT ROWID is an sqlite optimization. Doc states, it is making the primary key of the table to a clustered index.

As far I know, clustered index means that the physical ordering of the rows will follow the index, making linear scans effective on it.

However, what will happen if I insert a row into the middle of such a table? Will it be really needed to shift the whole table file after that?

peterh (2137 rep)

May 4, 2025, 08:14 AM • Last activity: May 4, 2025, 02:15 PM

0 votes

1 answers

1037 views

High CPU usage in AWS RDS

mysql query-performance amazon-rds clustered-index wordpress

When I visit the woocommerce orders page in WordPress, MySQL RDS CPU usage goes to 100% but the website is working perfectly fine. In 'active sessions' section, "wait/io/tables/sql/handler" is showing cpu usage to 99%. I looked at the performance insights of the database and saw this query: SELECT S...

                                  When I visit the woocommerce orders page in WordPress, MySQL RDS CPU usage goes to 100% but the website is working perfectly fine.
In 'active sessions' section, "wait/io/tables/sql/handler" is showing cpu usage to 99%.
I looked at the performance insights of the database and saw this query:

    SELECT SQL_CALC_FOUND_ROWS  hbm_posts.*, low_stock_amount_meta.meta_value AS low_stock_amount, MAX( product_lookup.date_created ) AS last_order_date FROM hbm_posts  LEFT JOIN hbm_wc_product_meta_lookup wc_product_meta_lookup ON hbm_posts.ID = wc_product_meta_lookup.product_id  LEFT JOIN hbm_postmeta AS low_stock_amount_meta ON hbm_posts.ID = low_stock_amount_meta.post_id AND low_stock_amount_meta.meta_key = '_low_stock_amount'  LEFT JOIN hbm_wc_order_product_lookup product_lookup ON hbm_posts.ID = CASE
    				WHEN hbm_posts.post_type = 'product' THEN product_lookup.product_id
    				WHEN hbm_posts.post_type = 'product_variation' THEN product_lookup.variation_id
    			END WHERE 1=1  AND hbm_posts.post_type IN ('product', 'product_variation') AND ((hbm_posts.post_status = 'publish'))
    			AND wc_product_meta_lookup.stock_quantity IS NOT NULL
    			AND wc_product_meta_lookup.stock_status IN('instock','outofstock')
    			AND (
    				(
    					low_stock_amount_meta.meta_value > ''
    					AND wc_product_meta_lookup.stock_quantity  ''
					AND wc_product_meta_lookup.stock_quantity <= CAST(low_stock_amount_meta.meta_value AS SIGNED)
				)
				OR (
					(
						low_stock_amount_meta.meta_value IS NULL OR low_stock_amount_meta.meta_value <= ''
					)
					AND wc_product_meta_lookup.stock_quantity <= 2
				)
			) GROUP BY hbm_posts.ID ORDER BY hbm_posts.post_date DESC, hbm_posts.ID DESC LIMIT 0, 1_posts.ID ORDER BY hbm_posts.post_date DESC, hbm_posts.ID DESC LIMIT 0, 1

I have tried tracing back the query in WordPress, but can't find it anywhere. I searched it with 'string locator', saw query logs in 'query monitor', tried disabling all plugins, and also tried "define('SAVEQUERIES', true);" as stated in this post:

https://stackoverflow.com/questions/4660692/is-it-possible-to-print-a-log-of-all-database-queries-for-a-page-request-in-word 

What can I do to trace back this?
Mysql version of server and client is 5.7.34
                                

Anish Sapkota (101 rep)

Jul 19, 2021, 07:21 AM • Last activity: Apr 13, 2025, 11:05 PM

10 votes

1 answers

485 views

Clustered Index Update operator performing excessive Logical Reads after query change

sql-server execution-plan sql-server-2019 clustered-index

If I have the following query in the Stack Overflow2010 database UPDATE dbo.Posts SET Title = CASE WHEN CreationDate <= '2008-01-01T00:00:00' THEN 'A' ELSE 'B' END FROM dbo.Posts The abridged STATISTICS IO output is below Table 'Posts'. Scan count 1, logical reads 445699, physical reads 375822, page server reads 0, read-ahead reads 445521, page server read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob page server reads 0, lob read-ahead reads 0, lob page server read-ahead reads 0. (1878895 rows affected) and the execution plan is here If I create a table to store the value I want use in my comparison: CREATE TABLE dbo.Canary ( TheDate DATETIME ) INSERT INTO dbo.Canary VALUES ('2008-01-01T00:00:00') and then if I change the query as follows: UPDATE dbo.Posts SET Title = CASE WHEN CreationDate <= Canary.TheDate THEN 'A' ELSE 'B' END FROM dbo.Posts CROSS JOIN dbo.Canary The STATISTICS IO output is

Table 'Canary'. Scan count 1, logical reads 1, physical reads 1, page server reads 0, read-ahead reads 0, page server read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob page server reads 0, lob read-ahead reads 0, lob page server read-ahead reads 0.
Table 'Worktable'. Scan count 0, logical reads 0, physical reads 0, page server reads 0, read-ahead reads 0, page server read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob page server reads 0, lob read-ahead reads 0, lob page server read-ahead reads 0.
Table 'Posts'. Scan count 1, logical reads 16787757, physical reads 3127, page server reads 0, read-ahead reads 784795, page server read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob page server reads 0, lob read-ahead reads 0, lob page server read-ahead reads 0.
Table 'Worktable'. Scan count 0, logical reads 0, physical reads 0, page server reads 0, read-ahead reads 6291, page server read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob page server reads 0, lob read-ahead reads 0, lob page server read-ahead reads 0.

(1878895 rows affected)

and the execution plan is here We can see that the number of logical reads has increased massively, from 445k to 16m. It took me some time to find in the execution plan where the source of this is but I tracked it down using the **Actual I/O statistics / Actual Logical Reads** property and can see the extra reads are on the clustered index update operator, I can also see this operator now has a value for **Actual Number of Rows for all executions** whereas the plan for the first query does not. What is happening here? what is happening in the clustered index update that is causing the increase in reads? I appreciate the query should use a variable if this "value in a config table" pattern is to be used, however, this is a query from a vendor app so I want to feedback what is happening as a result of the query being written the way it has.

SE1986 (2182 rep)

Mar 17, 2025, 02:47 PM • Last activity: Mar 18, 2025, 08:21 AM

2 votes

2 answers

237 views

Why does select query change order of rows in Postgresql?

postgresql index clustered-index

```sql test=# create table r (id serial primary key, x int, y int); CREATE TABLE test=# insert into r (x, y) select random() * 1000, random() * 1000 from generate_series(0, 1000000); INSERT 0 1000001 test=# test=# create index r_x_idx on r using btree (x); CREATE INDEX test=# cluster r using r_x_idx...

test=# create table r (id serial primary key, x int, y int);
CREATE TABLE
test=# insert into r (x, y) select random() * 1000, random() * 1000 from generate_series(0, 1000000);
INSERT 0 1000001
test=# 
test=# create index r_x_idx on r using btree (x);
CREATE INDEX
test=# cluster r using r_x_idx ;
CLUSTER
test=# select * from r limit 5;
  id  | x |  y  
------+---+-----
   79 | 0 | 556
 5997 | 0 | 774
 6104 | 0 |  75
 6937 | 0 | 818
 7859 | 0 | 598
(5 rows)

test=# select * from r limit 5 offset 10000;
   id   | x  |  y  
--------+----+-----
 483314 | 10 | 842
 484136 | 10 | 741
 484568 | 10 | 729
 488499 | 10 | 311
 489022 | 10 | 613
(5 rows)

test=# select * from r limit 5;
   id   | x |  y  
--------+---+-----
 330361 | 9 | 614
 330928 | 9 |  48
 331658 | 9 | 712
 332175 | 9 | 448
 332818 | 9 | 920
(5 rows)

As shown in the above, I created table r, then created a b+tree index on column x, and then clustered the table using that index. As expected, when I run select * from r limit 5; I got records with least x value, because the physical order of records are now based on the x column (more precisely, based on the b+tree index on column x). But after running select * from r limit 5 offset 10000;, the order of records are changed, without any insertion. Can someone explain this behavior?

Amir reza Riahi (155 rep)

Mar 6, 2025, 04:16 PM • Last activity: Mar 7, 2025, 08:41 AM

22 votes

3 answers

17147 views

What index to use with lots of duplicate values?

postgresql index performance clustered-index postgresql-9.6 query-performance

Let's make a few assumptions: I have table that looks like this: a | b ---+--- a | -1 a | 17 ... a | 21 c | 17 c | -3 ... c | 22 Facts about my set: * Size of the whole table is ~ 10 10 rows. * I have ~ 100k rows with value `a` in column `a`, similar for other values (e.g. `c`). * That means ~ 100k...

                                  Let's make a few assumptions:

I have table that looks like this:


     a | b
    ---+---
     a | -1
     a | 17
      ...
     a | 21
     c | 17
     c | -3
      ...
     c | 22

Facts about my set:

* Size of the whole table is ~ 1010 rows.

* I have ~ 100k rows with value a in column a, similar for other values (e.g. c).

* That means ~ 100k distinct values in column 'a'.

* Most of my queries will read all or most of the values for a given value in a, e.g. select sum(b) from t where a = 'c'.

* The table is written in such a way that consecutive values are physically close (either it's written in order, or we assume CLUSTER  was used on that table and column a).

* The table is rarely if ever updated, we're only concerned about read speed.

* The table is relatively narrow (say ~25 bytes per tuple, + 23 bytes overhead).

Now the question is, what kind of index should I be using? My understanding is:

* **BTree** My issue here is that the BTree index will be huge since as far as I know it will store duplicate values (it has to, since it can't assume the table is physically sorted). If the BTree is huge, I end up having to read both the index and the parts of the table that the index points to. (We can use fillfactor = 100 to decrease the size of the index a bit.)

* **BRIN** My understanding is that I can have a small index here at the expense of reading useless pages. Using a small pages_per_range means that the index is bigger (which is a problem with BRIN since I need to read the whole index), having a big pages_per_range means that I'll read a lot of useless pages. Is there a magic formula to find a good value of pages_per_range that takes into account those trade-offs?

* **GIN/GiST** Not sure those are relevant here since they're mostly used for full-text search, but I also hear that they're good at dealing with duplicate keys. Would either a GIN or GiST index help here?

Another question is, will Postgres use the fact that a table is CLUSTERed (assuming no updates) in the query planner (e.g. by binary searching for the relevant start/end pages)? Somewhat related, can I just store all my columns in a BTree and drop the table altogether (or achieve something equivalent, I believe those are clustered indices in SQL server)? Is there some hybrid BTree/BRIN index that would help here?

I'd rather avoid using arrays to store my values since my query will end up less readable that way (I understand this would reduce the cost of the 23 bytes per tuple overhead by reducing the number of tuples).
                                

foo (323 rep)

Mar 10, 2017, 03:24 PM • Last activity: Nov 21, 2024, 09:12 PM

1 votes

2 answers

238 views

Heap with Non Clustered Index vs. Non-Sequential Clustered Index

sql-server performance clustered-index sql-server-2022 heap

I work with an enterprise application (code not available) that currently uses heaps. It uses a terrible pseudo-sequential varchar key that ends up not being sequential due to string sorting not taking length into account (ex. ABCD is sorted after ABC but before BCD). For SELECT performance, we stil...

                                  I work with an enterprise application (code not available) that currently uses heaps. It uses a terrible pseudo-sequential varchar key that ends up not being sequential due to string sorting not taking length into account (ex. ABCD is sorted after ABC but before BCD).

For SELECT performance, we still have covering non-clustered indexes on the key column (among many other non-clustered indexes for other columns).

My first question is, given that INSERT, UPDATE, and DELETE (rare) still need to update both the heap and the non-clustered index, is there any performance benefit in using the heap, or would a clustered index (even on this non-sequential key) have the same 'relative' performance?

My reasoning is that while, yes, inserts into the heap are faster for non-sequential keys, the non-clustered index will also require updating which would take (I believe) an equivalent time - but at least all of the other benefits of having a clustered index will be available.

My second question is what will fragmentation look like with such a pseudo-sequential key? Will SQL-Server continually fragment the clustered index? Or is the fact that inserts are mostly sequential helpful to keep fragmentation low?

Andrew Hanlon (163 rep)

Nov 19, 2024, 05:34 PM • Last activity: Nov 21, 2024, 03:16 PM

0 votes

2 answers

749 views

Understanding of the %%lockres%% and %%physloc%% pseudo columns with index hint

sql-server index clustered-index database-internals

I am experimenting with clustered and non-clustered indexes in SQL Server and I have noticed something interesting. Here is a simplified version of my T-SQL code: IF OBJECT_ID('dbo.dept') IS NOT NULL DROP TABLE dbo.dept; CREATE TABLE dept(id INT, name NVARCHAR(20), address NVARCHAR(50)); GO INSERT I...

                                  I am experimenting with clustered and non-clustered indexes in SQL Server and I have noticed something interesting.

Here is a simplified version of my T-SQL code:

    IF OBJECT_ID('dbo.dept') IS NOT NULL DROP TABLE dbo.dept;
    CREATE TABLE dept(id INT, name NVARCHAR(20), address NVARCHAR(50));
    GO
    
    INSERT INTO dept(id, name, address) VALUES (1, 'Dept 1', 'Sarjapur Road');
    INSERT INTO dept(id, name, address) VALUES (2, 'Dept 2', 'Whitefield');
    INSERT INTO dept(id, name, address) VALUES (3, 'Dept 3', 'Electronic City');
    INSERT INTO dept(id, name, address) VALUES (4, 'Dept 4', 'Koramangala');
    GO
    
    CREATE CLUSTERED INDEX cl ON dbo.dept(id);
    CREATE INDEX ncl ON dbo.dept(address);
    GO
    
    SELECT *, %%lockres%% lock, %%physloc%% physloc, sys.fn_PhysLocFormatter(%%physloc%%) formatted
    FROM dbo.dept WITH (NOLOCK, INDEX (cl))
    
    SELECT *, %%lockres%% lock, %%physloc%% physloc, sys.fn_PhysLocFormatter(%%physloc%%) formatted
    FROM dbo.dept WITH (NOLOCK, INDEX (ncl))

And here is the result:

    +----+--------+-----------------+----------------+--------------------+------------+
    | id | name   | address         | lock           | physloc            | formatted  |
    +----+--------+-----------------+----------------+--------------------+------------+
    |  1 | Dept 1 | Sarjapur Road   | (de42f79bc795) | 0xB01F000004000000 | (4:8112:0) |
    |  2 | Dept 2 | Whitefield      | (9d6bf8154a2a) | 0xB01F000004000100 | (4:8112:1) |
    |  3 | Dept 3 | Electronic City | (052c8c7d9727) | 0xB01F000004000200 | (4:8112:2) |
    |  4 | Dept 4 | Koramangala     | (1a39e6095155) | 0xB01F000004000300 | (4:8112:3) |
    +----+--------+-----------------+----------------+--------------------+------------+

    +----+--------+-----------------+----------------+--------------------+-----------+
    | id | name   | address         | lock           | physloc            | formatted |
    +----+--------+-----------------+----------------+--------------------+-----------+
    |  3 | Dept 3 | Electronic City | (b64f1cd4ff4f) | 0x1800000003000000 | (3:24:0)  |
    |  4 | Dept 4 | Koramangala     | (4471456166ef) | 0x1800000003000100 | (3:24:1)  |
    |  1 | Dept 1 | Sarjapur Road   | (7948805432b9) | 0x1800000003000200 | (3:24:2)  |
    |  2 | Dept 2 | Whitefield      | (584262fe5906) | 0x1800000003000300 | (3:24:3)  |
    +----+--------+-----------------+----------------+--------------------+-----------+

As you can see, the lock and physloc for the two resultsets are very different. In the past, I always believed that these pseudo columns are revealing information about the dbo.dept table (the clustered index or the heap), since I'm selecting from it. But this testing seems proved that they are showing information about how the data is accessed. Is my understanding correct? 
                                

Fajela Tajkiya (1239 rep)

Oct 5, 2023, 03:42 PM • Last activity: Jul 5, 2024, 09:57 AM

3 votes

3 answers

516 views

Dropping a unique clustered index on identity colum

sql-server index-tuning clustered-index

I'm currently working with a SQL Server database that came from an inherited system. The previous system used a lot of auto incrementing bigint identity fields, let's call the field "KeyID". These KeyID columns are interspersed throughout the database but in our implementation of the app, KeyID is b...

                                  I'm currently working with a SQL Server database that came from an inherited system. The previous system used a lot of auto incrementing bigint identity fields, let's call the field "KeyID". These KeyID columns are interspersed throughout the database but in our implementation of the app, KeyID is being ignored. There are values for the fields in the tables, but KeyID is not used in any way.

I've noticed that for each table that has a KeyID field, a unique clustered index has been been created for it, IX_Tablename.

I am considering dropping these indices to optimize the number of indices the server needs to maintain, but am not sure what the long term effects will be. The resources I find online seem to assume, I believe, that the identity column is the primary key, and say that deleting the clustered index for that column will result in the table becoming a heap. But KeyID is not a primary key, the PK's are separate Guid columns with their own unique non-clustered indices. The tables with the KeyID index are clustered on a value that is not referenced and so no seeks are performed on their values.

Is their any advantage to keeping them or should I remove these indices?

I am not ready to delete the columns themselves, so they will still need to be there and auto-increment, will their missing indices affect the creation of the next auto-incremented value?

LeArgus (31 rep)

May 31, 2024, 06:12 PM • Last activity: Jun 1, 2024, 10:51 PM

-1 votes

2 answers

58 views

How important is the "clustered" property of MySQL PK index?

mysql mysql-8.0 clustered-index

I am importing ~50M rows into MySQL 8, InnoDB. It's on AWS RDS with GP3 storage. The unique key of the rows is a uuid-like string. When querying we will never care about this unique key, except when upserting new/modified rows from the primary source. Normally the unique id would be the PK. But I ha...

                                  I am importing ~50M rows into MySQL 8, InnoDB. It's on AWS RDS with GP3 storage.

The unique key of the rows is a uuid-like string.

When querying we will never care about this unique key, except when upserting new/modified rows from the primary source.

Normally the unique id would be the PK. But I have read that PK index in MySQL is special because it aims to 'cluster' the data for similar values, to enhance performance.

It seems that by using uuid-like string as PK the clustering will not help our queries.

If I was to partition the table I would do it by date range.

I could imagine defining a synthetic PK, or a composite PK, that combines the date field and uuid to get a clustering that is more likely to support the queries we actually do.

My question is this: how important is it to have a PK clustering that supports the typical queries (i.e. fetched results likely to be 'close' in the index)?

Presumably the typical case of an auto-incrementing id for PK also results in clustering that has little relation to typical queries (often no reason to select adjacent ids).

I am thinking specifically about whether modern SSD storage makes this type of optimisation less important, obsolete... or even more important?

### More context

https://dev.mysql.com/doc/refman/8.0/en/innodb-index-types.html 

> #### How the Clustered Index Speeds Up Queries
> Accessing a row through the
> clustered index is fast because the index search leads directly to the
> page that contains the row data. If a table is large, the clustered
> index architecture often saves a disk I/O operation when compared to
> storage organizations that store row data using a different page from
> the index record.

It seems like the "clustered"-ness of the PK index is only of value for queries which select by PK.

It's about co-locating the row data with the index (?)

So if all the application queries that I care about use secondary indexes I guess it doesn't really matter what the properties of the PK are? e.g. including a date partition column in the PK isn't going to magically speed up queries using a different index.

Is that right?

Anentropic (558 rep)

Apr 20, 2024, 07:48 AM • Last activity: Apr 28, 2024, 06:59 PM

0 votes

1 answers

24 views

clustered index detection with EXPLAIN statement

mysql clustered-index

I created two tables with one-to-one relationship as below, CREATE TABLE if not exists User ( id bigint AUTO_INCREMENT, coins bigint not null, level bigint not null, PRIMARY KEY (id) ); CREATE TABLE if not exists Claimingcoins ( userid bigint UNIQUE NOT NULL, claimingCoins bigint ); ALTER TABLE Clai...

                                  I created two tables with one-to-one relationship as below,


    CREATE TABLE if not exists User (
        id bigint AUTO_INCREMENT,
        coins bigint not null,
        level bigint not null,
        PRIMARY KEY (id)
    );
    
    CREATE TABLE if not exists Claimingcoins (
        userid bigint UNIQUE NOT NULL,
        claimingCoins bigint
    );

    ALTER TABLE Claimingcoins ADD CONSTRAINT fk_user_id FOREIGN KEY (userid) REFERENCES User (id);

Now I run EXPLAIN query on some statements to see if CLUSTERED INDEX is set on both these entities without me having to set it. I use mysql 8.0.28, I suppose it sets it by default. Does it also set it for UNIQUE KEY which is referenced like above?


    mysql> EXPLAIN UPDATE Claimingcoins SET claimingCoins = 100 where userid IN(2999999);
    +----+-------------+---------------+------------+-------+---------------+--------+---------+-------+------+----------+-------------+
    | id | select_type | table         | partitions | type  | possible_keys | key    | key_len | ref   | rows | filtered | Extra       |
    +----+-------------+---------------+------------+-------+---------------+--------+---------+-------+------+----------+-------------+
    |  1 | UPDATE      | Claimingcoins | NULL       | range | userid        | userid | 8       | const |    1 |   100.00 | Using where |
    +----+-------------+---------------+------------+-------+---------------+--------+---------+-------+------+----------+-------------+
    1 row in set, 1 warning (0,00 sec)
    
    mysql> EXPLAIN SELECT id,coins,level FROM User WHERE id IN (29999, 299, 299999);
    +----+-------------+-------+------------+-------+---------------+---------+---------+------+------+----------+-------------+
    | id | select_type | table | partitions | type  | possible_keys | key     | key_len | ref  | rows | filtered | Extra       |
    +----+-------------+-------+------------+-------+---------------+---------+---------+------+------+----------+-------------+
    |  1 | SIMPLE      | User  | NULL       | range | PRIMARY       | PRIMARY | 8       | NULL |    3 |   100.00 | Using where |
    +----+-------------+-------+------------+-------+---------------+---------+---------+------+------+----------+-------------+

I ran the EXPLAIN statements for the queries and I get the results above. How can I understand by looking at this table if clustered index is set? I see possible_keys and key values are exposed, so I believed those should be the one used for CLUSTERED INDEXES, right? 



                                

umarkaa (47 rep)

Apr 4, 2024, 05:58 AM • Last activity: Apr 4, 2024, 09:08 AM

4 votes

4 answers

6502 views

Difference between a unique clustered index and a clustered primary key?

sql-server primary-key clustered-index

I'd like to ask two supplementary/follow-on questions, further to this previous/existing question: [Is there any tangible difference between a unique clustered index and a clustered primary key?](https://dba.stackexchange.com/q/771/105622) - That question starts with, "I understand that there may be...

                                  I'd like to ask two supplementary/follow-on questions, further to this previous/existing question: [Is there any tangible difference between a unique clustered index and a clustered primary key?](https://dba.stackexchange.com/q/771/105622) 

- That question starts with, "I understand that there may be a difference in meaning or intent between the two". I'm a programmer not a DBA, and this fundamental might be unclear to me: what *is* the difference in meaning or intent between the two?

- My summary of the accepted answer, i.e. its most important statement IMO, is that it says, "I don't think there's any difference". If that's so then why did Microsoft implement "clustered indexes"? Why not just say instead, "It's always clustered on the primary key, and you should define as the primary key whatever you want it to be clustered on"?

It seems to me that a primary key (already) is a unique clustered index.

---

Furthermore, here's a specific problem by way of example.

Let's say I have a table of Users (with a userId as its primary key), and a second table (e.g. Items) which defines items owned by each user. A user can own many items; each item is owned by one user, and has an itemId.

So the itemId could be the primary key of the Items table; and each row in the Items table (which has an itemId) also has a userId to identify its owner.

That's a good way to define a 1-N relationship, isn't it? Assume a foreign key contraint on userId, with Users being the parent table.

At run-time I usually want to retrieve all the items owned by a user, therefore the Items table should be clustered on its userId column.

    [Users]
      userId
      + plus other user-specific fields
    
    [Items]
      userId
      itemId
      + plus other item-specific fields

I think there are two ways to define this Items table:

- itemId is **primary non-clustered key**, and (userId,itemId) is **unique clustered index**

... or:

- (userId,itemId) is **primary clustered key**, and itemId is **unique non-clustered index**.

Which of the above two is better or more correct, semantically and/or practically, and why?

---

In case it makes a difference, the itemID is an artificial key: its purpose is to disambiguate/identify the item (and/or identify the item, within the set of items owned by the user).

- 'Physically' the itemID is probably globally unique (or unique within the table, anyway): because databases make it easy to create a globally-unique artificial key.

- 'Logically' I wouldn't mind if it were not-globally-unique, but were instead only unique-within-each-user, such that I needed both userId and itemId to uniquely identify an item, i.e.

  (userId,itemId) is **primary clustered key** and itemId is **unique non-clustered index**.

So I think it isn't altogether wrong to see (userId,itemId) as a composite primary key?

Apparently it's fine and normal to use "two separate attributes" as [the primary key of an associative table](https://en.wikipedia.org/wiki/Associative_entity#Using_associative_tables)  when it's an N-to-N relationship. Is it wrong (e.g. harmful in some way, for some reason) to use two attributes as the primary key of an object in a 1-to-N relationship? Is it wrong to say that the owner ID is part of the object's identity?

ChrisW (187 rep)

Sep 12, 2016, 07:31 PM • Last activity: Mar 29, 2024, 05:03 PM

0 votes

2 answers

249 views

Is default ordering affected by clustered index?

sql-server select order-by clustered-index

My colleague and I are debating whether the default ordering of rows from a `SELECT` statement (without specifying any `ORDER BY`) is controlled by the clustered index or not. He demoed this by creating a clustered index on a column with `DESC`, then `SELECT` would return the result according to tha...

                                  My colleague and I are debating whether the default ordering of rows from a SELECT statement (without specifying any ORDER BY) is controlled by the clustered index or not.

He demoed this by creating a clustered index on a column with DESC, then SELECT would return the result according to that column as if there is a hidden ORDER BY column DESC.

But I doubt this holds true as the table grows, even given we only ever INSERT into this table and never any UPDATE nor DELETE, however I can't prove this to him.

Can anyone show a demo disproving this?

We were testing this on Microsoft SQL Server, but the question is for any DBMS.

In my case, there is only just one clustered index on the table.

user1589188 (151 rep)

Mar 25, 2024, 02:25 PM • Last activity: Mar 27, 2024, 09:24 AM

0 votes

1 answers

205 views

In Postgres, can I CLUSTER by multiple indexes?

postgresql table clustered-index

I want to order the records in my `tasks` table according to `completed` and then according to `workOrderID`, just like I can to with the `ORDER BY` clause of a `SELECT` query, so that this | taskID | workOrderID | completed | | -------- | -------------- | --- | | 1 | 1 | t | | 2 | 1 | f | | 3 | 3 |...

                                  I want to order the records in my tasks table according to completed and then according to workOrderID, just like I can to with the ORDER BY clause of a SELECT query, so that this

| taskID | workOrderID | completed |
| -------- | -------------- | --- |
| 1    | 1            | t |
| 2   | 1            | f |
| 3   | 3            | t |
| 4   | 2            | t |
| 5   | 9            | t |
| 6   | 4            | f |
| 7   | 5            | f |
| 8   | 4            | t |
| 9   | 1            | f |

is stored like this

| taskID | workOrderID | completed |
| -------- | -------------- | --- |
| 1    | 1            | t |
| 4   | 2            | t |
| 3   | 3            | t |
| 8   | 4            | t |
| 5   | 9            | t |
| 2   | 1            | f |
| 9   | 1            | f |
| 6   | 4            | f |
| 7   | 5            | f |

When I try, I get a syntax error, and I don't see that functionality mentioned in the docs.  I think this could speed up some queries.
                                

aswine (153 rep)

Feb 9, 2024, 08:11 PM • Last activity: Feb 10, 2024, 11:45 PM

2 votes

3 answers

557 views

Azure SQL index rebuild after huge size reduction

azure-sql-database clustered-index size index-rebuild

We have a table in an Azure SQL database that used to have an nvarchar(max) column storing pdf files. (The wonders of external developers.) The table grew to 156 GB. It has 476000 rows. After changing the logic, we no longer needed the pdf column. The only reasonable way to get rid of the data in it...

                                  We have a table in an Azure SQL database that used to have an nvarchar(max) column storing pdf files. (The wonders of external developers.) The table grew to 156 GB. It has 476000 rows.
After changing the logic, we no longer needed the pdf column. The only reasonable way to get rid of the data in it was to drop the column and recreate the column (in case some weird process was still referencing it).

However, the table size is still reported as 156 GB.
The backup table I just created (SELECT INTO) is 128 MB, so that seems to be the real size of the data.

I let an index rebuild (ONLINE) run overnight on the clustered PK index. It failed with a TCP error sometime between 8 and 12 hours. The index is still 95% fragmented, the size is still reported as 156 GB.

Is there an explanation why this is so eye-wateringly slow? **Is there a better way?** Production database, table is used by a website, has to be accessible, so can't do it OFFLINE unless it takes less than 10 minutes - which nobody can guarantee.

Can I just *build all the indexes on the backup table, drop the original table and rename the backup?* That sounds risky (small risk of losing a record created just the wrong time).

-----

I'm trying to make Azure realize it is no longer used. Allocated, I'm OK with that. Used, not so much:

The table in question:

Again, it's not the reserved space that is the issue, it is the used space.

vacip (133 rep)

Jan 13, 2024, 08:30 AM • Last activity: Jan 17, 2024, 01:16 PM

Showing page 1 of 20 total questions