Database Administrators

Q&A for database professionals who wish to improve their database skills

Latest Questions

1 votes

1 answers

148 views

What does column family-oriented datastore mean in practice

I was reading [this article][1] about HBase. Within it, the storage format of HBase is described: *HBase is referred to as a column family-oriented data store. It’s also row-oriented: each row is indexed by a key that you can use for lookup* So what does column family oriented mean? Is it the same a...

                                  I was reading this article  about HBase. Within it, the storage format of HBase is described:

*HBase is referred to as a column family-oriented data store. It’s also row-oriented: each row is indexed by a key that you can use for lookup*

So what does column family oriented mean? Is it the same as column oriented? In that case, how can it be both column family oriented and row oriented?

So row oriented storage is like this:

    001:10,Smith,Joe,40000;
    002:12,Jones,Mary,50000;
    003:11,Johnson,Cathy,44000;
    004:22,Jones,Bob,55000;

Column oriented storage is like this:

    10:001,12:002,11:003,22:004;
    Smith:001,Jones:002,Johnson:003,Jones:004;
    Joe:001,Mary:002,Cathy:003,Bob:004;
    40000:001,50000:002,44000:003,55000:004;

So how is data actually stored in HBase? Do you have example?

bsky (111 rep)

Dec 6, 2017, 11:29 PM • Last activity: Jul 18, 2025, 01:03 PM

0 votes

1 answers

150 views

Error loading data in MariaDB Columnstore

mariadb columnstore

I have a MariaDB Columnstore database running on the cloud in Linux. I'm trying to export a CSV file from my Windows machine to the database. I created the file with DBeaver, setting the new line as `\n`. I execute the `cpimport` command (provided by MariaDB) and I get the following error: line 2: $...

                                  I have a MariaDB Columnstore database running on the cloud in Linux. I'm trying to export a CSV file from my Windows machine to the database. I created the file with DBeaver, setting the new line as \n. I execute the cpimport command (provided by MariaDB) and I get the following error:

    line 2: $'\r': command not found

I cannot figure out where's the problem, is it in the CSV file or in the cpimport?

This is how I import in MariaDB:

    sudo cpimport -s '|' -E '"' -C '\n' my_database dates dates.csv

ps0604 (51 rep)

Jul 27, 2021, 12:49 PM • Last activity: Jul 15, 2025, 10:05 PM

5 votes

3 answers

2294 views

How Ola Hallengrens Index maintenance scripts handles columnstore indexes?

sql-server ola-hallengren-scripts columnstore

I'd like to know how Ola's script handles columnstore indexes. Couln't find any information about that, only in the version history it was mentioned that script does have a support. My experience is that script didn't do anything to fragmented columnstore index. Is that by design?

                                  I'd like to know how Ola's script handles columnstore indexes. Couln't find any information about that, only in the version history it was mentioned that script does have a support.

My experience is that script didn't do anything to fragmented columnstore index. Is that by design?

VeePee (53 rep)

Aug 28, 2018, 07:41 AM • Last activity: Jun 21, 2025, 07:27 PM

1 votes

1 answers

38 views

How to efficiently manage data retention in MariaDB ColumnStore?

mariadb columnstore olap

I’m using MariaDB 10.11+ with ColumnStore and have created a ColumnStore table to handle large volumes of analytical data. My use case requires keeping only the last 90 days of data. The table includes a created_at column of type DATETIME for tracking data age. However, I’ve run into the following l...

                                  I’m using MariaDB 10.11+ with ColumnStore and have created a ColumnStore table to handle large volumes of analytical data. My use case requires keeping only the last 90 days of data. The table includes a created_at column of type DATETIME for tracking data age.

However, I’ve run into the following limitations:

 - ColumnStore does not support user-defined partitioning, making
   time-based partitioning difficult.
 - Deleting old records does not reclaim disk space (its only delete
   metadata).
 - The documentation suggests workarounds like creating daily tables or
   truncating and recreating tables with only the required data, which
   feels inefficient and operationally complex.

My question is:
Is there a more effective or officially recommended way to manage data lifecycle and enforce data retention in MariaDB ColumnStore—preferably one that allows for reclaiming disk space and minimizing operational overhead?

Any best practices, tools, or workflows others are using would be greatly appreciated
                                

Amalka (13 rep)

Jun 3, 2025, 06:58 AM • Last activity: Jun 11, 2025, 02:55 PM

0 votes

1 answers

33 views

Insufficient memory issue in mariadb 10.5 columnstore

mariadb columnstore

I’m using MariaDB 10.5.29 with the ColumnStore engine configured. I’ve noticed that memory usage keeps increasing over time and is not released after query execution completes. Eventually, queries that previously worked start failing—especially ones involving large datasets. After restarting the MariaDB service, the same queries work again temporarily, until the memory (held by PrimeProc) accumulates once more. Has anyone encountered a similar issue with memory not being freed in ColumnStore? Are there any recommended configurations or best practices to manage or release memory properly? Exact error :

ERROR 1815 (HY000): Internal error: IDB-2057: There is insufficient memory to process the query (PrimProc)

Amalka (13 rep)

May 29, 2025, 01:39 AM • Last activity: Jun 3, 2025, 06:41 AM

0 votes

1 answers

257 views

Foreign keys without any checking or index

sql-server sql-server-2016 columnstore

I have a problem on SQL Server that led me to the question of whether it is possible to define a foreign key that is not enforced, not backed by an index, and does not need the referenced columns to constitute an (enforced) key. Foreign keys do two things: - They enforce integrity - which I am willi...

                                  I have a problem on SQL Server that led me to the question of whether it is possible to define a foreign key that is not enforced, not backed by an index, and does not need the referenced columns to constitute an (enforced) key.

Foreign keys do two things:

 - They enforce integrity - which I am willing to give up in my scenario - and
 - they can be used by the query planner to drop joins that are guaranteed to
   not change the result set's number of rows.

The last thing is what I'd still get - the integrity I'll enforce at the application layer.

(Note that what you can have the database enforce is more limited that what you could have as a droppable join condition. A frequent case is that where you'd want to specify a foreign key with ...foreign key (some_constant, almost_key) references... where some_constant is an illegal immediate constant.)

The reason why I need this is I have a SQL Server specific problem in that it does bad query plans in the presence of such an index when the clustered index is a columnstore index (at least I saw on case of such a bad query plan). Since the feature of combining a clustered columnstore indexes with nonclustered btree indexes is brand-new, I take it that there are still some issues they need to iron out.

However, I'm also curious about whether SQL databases generally always demand foreign keys to be backed by indexes on the foreign and/or the domestic side, or whether there are some that take them as hints but allow the user to enforce.

John (775 rep)

Apr 7, 2017, 11:16 AM • Last activity: Jun 1, 2025, 10:04 AM

3 votes

1 answers

192 views

How does row versioning impact the size of a non-clustered columnstore index?

sql-server sql-server-2022 columnstore size mvcc

I have a table with a non-clustered columnstore index. The database has a readable secondary AG replica ([filled with problems...](https://dba.stackexchange.com/q/346216/277982)) and the table has triggers. Neither snapshot isolation nor RCSI are enabled. I infer that this means the table must be pa...

                                  I have a table with a non-clustered columnstore index. The database has a readable secondary AG replica ([filled with problems...](https://dba.stackexchange.com/q/346216/277982))  and the table has triggers. Neither snapshot isolation nor RCSI are enabled. I infer that this means the table must be paying the 14-bytes-per-row penalty for row versioning. However, I generally regard compressed columnstore segments as immutable. This gives me my question: **how does row versioning impact the size of a non-clustered columnstore index?**

I've consulted the [documentation for row versioning](https://learn.microsoft.com/en-us/sql/relational-databases/sql-server-transaction-locking-and-row-versioning-guide?view=sql-server-ver16)  but it only mentions columnstore once and it is in an irrelevant way. I have also read [Niko Neugebauer's entire 131 part series](https://www.nikoport.com/columnstore/) , but I do not recall it ever being mentioned there even in [this very relevant part](https://www.nikoport.com/2015/09/22/columnstore-indexes-part-67-clustered-columstore-isolation-levels-transactional-locking/) . This [blog post](https://sql-sasquatch.blogspot.com/2021/06/sqlserver-snapshot-isolation-level.html)  suggests that snapshot isolation inflicts a massive performance penalty on columnstore, but does not explain the internals.

You may assume that I do not care about the in-memory OLTP version of columnstore.

J. Mini (1225 rep)

Apr 18, 2025, 10:14 PM • Last activity: May 13, 2025, 05:41 PM

2 votes

2 answers

93 views

When reading from a columnstore index, is PERSIST_SAMPLE_PERCENT ignored in UPDATE STATISTICS?

sql-server statistics maintenance sql-server-2022 columnstore

I have a partitioned table with a rowstore primary key and a nonclustered columnstore index. The leading key of both (not that key order matters for columnstore) is the same. When updating statistics, SQL Server regularly chooses to scan the columnstore index. This takes far too long for my taste, s...

UPDATE STATISTICS My_Table
WITH PERSIST_SAMPLE_PERCENT = ON, SAMPLE 3 PERCENT

To my surprise, the statistics updates still took a very long time. The execution plans very much suggest that the columnstore index is being scanned in the same way as it previously was. So, I have my question: when reading from a columnstore index, is PERSIST_SAMPLE_PERCENT ignored in UPDATE STATISTICS? Automatic statistics updates seem to share this issue. Update: I suspect that it might actually be the deleted bitmap that causes all of this to go wrong.

J. Mini (1225 rep)

Mar 29, 2025, 12:33 AM • Last activity: Apr 20, 2025, 06:39 PM

0 votes

0 answers

125 views

Why would an online columnstore rebuild be much faster than a reorganize on a very fragmented table?

sql-server nonclustered-index sql-server-2022 columnstore index-maintenance

On a table with lots of deletes spanning over the past six months, I have an **extremely** fragmented non-clustered columnstore index. It is about 500 GB and has 50 partitions. If I either rebuild it online, re-create it with `DROP_EXISTING`, or just drop and recreate it the hard way, it goes down t...

                                  On a table with lots of deletes spanning over the past six months, I have an **extremely** fragmented non-clustered columnstore index. It is about 500 GB and has 50 partitions. If I either rebuild it online, re-create it with DROP_EXISTING, or just drop and recreate it the hard way, it goes down to 80 GB. When testing this, I've observed the following:

* If I drop the columnstore index and recreate it offline (no DROP_EXISTING, just drop and create), it takes one hour
* If I re-create it with DROP_EXISTING offline, it takes two hours. 
* If I rebuild the columnstore index online, it takes two hours
* If I reorganize the columnstore index without any special settings, I get bored and quit after five hours and see no evidence that the size of the index has decreased.
* If I reorganize the columnstore index with COMPRESS_ALL_ROW_GROUPS = ON, then it completes after four and a half hours. The index does not get smaller. It gets 20 GB **bigger**.

I'm not doing anything smart with the partitions. I'm using a one-liner to hit the entire index. I am on SQL Server 2022, Enterprise Edition. The clustered index is 2 TB rowstore. The biggest column in the table is 8 bytes. The non-clustered index only consists of two columns, one of which is the partitioning column. The biggest of column in the index is an int.

I have seen no evidence that the tuple mover is disabled, so I cannot explain how the fragmentation got this bad. My only guess is that I happen to know that the rows were deleted based on the column that is not the partitioning column, so it could be the case that the alignment of the columnstore index is so terrible that no rowgroups crossed [whatever threshold it is that the background merge task uses](https://learn.microsoft.com/en-us/sql/relational-databases/indexes/reorganize-and-rebuild-indexes?view=sql-server-ver16#reorganize-an-index)  (I am unsure if that merge is part of the tuple mover or not).

I have no idea why REORGANIZE made the columnstore index bigger. [Niko Neugebauer said during the pre-release for SQL Server 2016](https://www.nikoport.com/2015/12/25/columnstore-indexes-part-74-row-groups-merging-cleanup-sql-server-2016-edition/)  that you have to invoke the tuple mover twice to really delete from a non-clustered columnstore index, but I did not think that behaviour even lasted until the real release of 2016 (I'm far past that, I'm on SQL Server 2022).

My question is this: **Why would rebuilding a columnstore index online be dramatically faster than reorganizing it?** Would it have anything to do with the extreme fragmentation? For what it's worth, this same table has me suspicious of SQL Server [bugs](https://dba.stackexchange.com/q/345635/277982)  such as  [this](https://dba.stackexchange.com/q/345953/277982) .

J. Mini (1225 rep)

Apr 16, 2025, 07:04 PM • Last activity: Apr 19, 2025, 12:27 AM

2 votes

1 answers

72 views

performance overhead of writing mostly NULL values in clustered columnstore index

sql-server sql-server-2022 columnstore

We have a columnstore table > 2billion records. and we want to add 3 new columns that are very sparse. Maybe 0.01% of records will have these fields populated (all int fields). It already has 75 columns. We insert/update into this table about 20 million records per day. I understand the storage is n...

                                  We have a columnstore table > 2billion records. and we want to add 3 new columns that are very sparse. Maybe 0.01% of records will have these fields populated (all int fields). It already has 75 columns.

We insert/update into this table about 20 million records per day.

I understand the storage is not an issue bc it will efficiently compress this data while taking up little space. My main concern is writing to this table... it's already wide and I think adding more fields will impact Write performance. Am I correct in this assessment - it still has to write to delta store and compress.

The other approach is to create a new rowstore table for these fields that are seldomly populated (and used) and just join between the two when needed. 

---

I partially regret that we have 75 columns. Part of it relates to columns that aren't used much. 

We use change tracking to load from a different system hourly. Sometimes batches are 2 million, sometimes batches are 100k, etc.

I guess the main question is, if we add a new column to a column store table that is usually 100% populated and that takes 1 additional minute, would adding a new column that is usually only .01% populated also take 1 additional minute or does it take much less?

Gabe (1396 rep)

Mar 7, 2025, 02:55 PM • Last activity: Mar 11, 2025, 08:59 AM

2 votes

1 answers

501 views

Why is it considered best practice to partition columnstore tables?

sql-server sql-server-2016 partitioning best-practices columnstore

Prior to SQL Server 2016, [partitioning columnstore indexes][1] was considered pretty much mandatory because the locks taken on them during inserts/updates/deletes were extreme. However, as of SQL Server 2016, the locks taken on columnstore indexes are [much less intense](https://web.archive.org/web...

                                  Prior to SQL Server 2016, partitioning columnstore indexes  was considered pretty much mandatory because the locks taken on them during inserts/updates/deletes were extreme. However, as of SQL Server 2016, the locks taken on columnstore indexes are [much less intense](https://web.archive.org/web/20160603180253/http://www.nikoport.com:80/2015/09/22/columnstore-indexes-part-67-clustered-columstore-isolation-levels-transactional-locking). Why, then, is it still considered best practice to partition them?

I am aware of [the trick](https://web.archive.org/web/20170224141946/http://www.nikoport.com/2014/12/02/clustered-columnstore-indexes-part-45-multi-dimensional-clustering/)  where you can improve alignment by switching out a partition, building a clustered rowstore index, replacing that index with columnstore, and then switching the partition back in. However, that's just a mild performance optimisation that rarely survives a rebuild and was largely made redundant by [SQL Server 2022's ordering features](https://learn.microsoft.com/en-us/sql/relational-databases/indexes/ordered-columnstore-indexes?view=sql-server-ver16) .

J. Mini (1225 rep)

Jan 20, 2025, 07:29 AM • Last activity: Jan 20, 2025, 01:30 PM

1 votes

1 answers

1027 views

What is the most performant way to group by and join high cardinality columns in column based table?

sql-server database-design query-performance index columnstore

I currently have a SQL Server, 20 column fact table that could grow even wider into the hundreds. The current row count is about 9 million records. The indexing strategy is currently row-based. The query joins to every lookup table totaling about a dozen joins (most columns in the table are FKs). This is an analytical table that is generated nightly. Query filters are ad hoc so row-indexing most permutations is not feasible. I would like to explore removing these joins and using a columnstore index, since for the most part, only a handful of filter columns will be specified per query request. **Given I have three tables:** Table 1: SomeFactTable (~9 million records)

| Column         | Description          | Cardinality                   |
| --------       | --------------       | --------------                |
| Id             | int32 PK             | Unique                        |
| Timestamp      | datetime, non-null   | medium                        |
| LookupTableA_ID| int32, non-null, FK  | Medium-high (40k records)     |
| LookupTableB_ID| int32, non-null, FK  | Low (dozens records)          |
| NumberA        | int32, non-null      | Very high (8 million records) |

Table 2: LookupTableA (~40,000 records)

| Column      | Description            | Cardinality    |
| --------    | --------------         | -------------- |
| Id          | int32 PK               | Unique         |
| Description | varchar(255), non-null | high           |

Table 3: LookupTableB (~50 records)

| Column      | Description            | Cardinality    |
| --------    | --------------         | -------------- |
| Id          | int32 PK               | Unique         |
| Description | varchar(255), non-null | high           |

**I have two questions.** 1. How can I combine the benefits of a columnstore index with also grouping on the high cardinality NumberA column? For example, given this query, it performs in about 100ms (which is great!). If I uncomment the group by NumberA, then the performance degrades to ~900ms (good, but I'd like to do better).

SELECT 
  max(Id) Id
  ,[Timestamp] 
FROM 
  [dbo].[SomeFactTable]
group by 
  [Timestamp]  
  --,NumberA
ORDER BY 
  [Timestamp] ASC 
OFFSET 500 ROWS FETCH NEXT 20 ROWS ONLY

2. Most of the time I need to order by one or more of the lookup tables which requires a join. Sometimes these tables have many records (e.g. LookupTableA) and sometimes they only have a handful (e.g. LookupTableB). When there are many records, then the performance degrades. When there are only a handful, then the performance does not degrade. How can I maintain high performance when the table has many records? As an example, this is a join with medium cardinality. It also has the high cardinality column from question 1. The query returns in about 2,900 ms.

SELECT 
  max(SomeFactTable.Id) Id, 
  [Timestamp],
  [Description]
FROM 
  dbo.SomeFactTable
  INNER JOIN dbo.LookupTableA ON SomeFactTable.LookupTableA_ID = LookupTableA.Id 
group by 
  [Timestamp], 
  [Description], 
  NumberA
ORDER BY 
  [Description] ASC, 
  [DateAbsent] ASC 
OFFSET 500 ROWS FETCH NEXT 20 ROWS ONLY

This is a join with low cardinality. It also has the high cardinality column from question 1. The query returns in about 1,000 ms (and returns even much faster if NumberA wasn't grouped).

SELECT 
  max(SomeFactTable.Id) Id, 
  [Timestamp],
  [Description]
FROM 
  dbo.SomeFactTable
  INNER JOIN dbo.LookupTableB ON SomeFactTable.LookupTableB_ID = LookupTableB.Id 
group by 
  [Timestamp], 
  [Description], 
  NumberA
ORDER BY 
  [Description] ASC, 
  [DateAbsent] ASC 
OFFSET 500 ROWS FETCH NEXT 20 ROWS ONLY

alex (111 rep)

Apr 15, 2021, 11:47 PM • Last activity: Jan 18, 2025, 02:21 PM

0 votes

1 answers

495 views

Reducing Fragmentation on a clustered columstore index

sql-server azure-sql-database ola-hallengren-scripts columnstore

I have a database with a clustered columnstore index on it. The table has 1,685,123,846 records in it. I use the ola hallengre script monthly to rebuild the index, but it takes so long to run before being killed after 8 hours I wasn't convinced it was actually processing most of the indexes. I manually recreated the index using WITH (DROP_EXISTING = ON);. This took 10 hours. After running this I decided to check the fragmentation using this command:

select database_id, object_id, index_id, partition_number, index_type_desc, index_depth, index_level,avg_fragmentation_in_percent, fragment_count, page_count  from sys.dm_db_index_physical_stats (DB_ID(), OBJECT_ID('ConsumerReading'), null, null, null)

Its showing me the following data: |database_id|object_id|index_id|partition_number|index_type_desc|index_depth|index_level|avg_fragmentation_in_percent|fragment_count|page_count| |---:|---:|---:|---:|:---|---:|---:|---:|---:|---:| |5|626101271|1|1|CLUSTERED INDEX|2|0|97.59036144578313|166|166| |5|626101271|1|1|CLUSTERED INDEX|3|0|0.9500950095009502|1258|9999| |5|626101271|2|1|NONCLUSTERED INDEX|5|0|2.9880120735355384|815946|9788682| |5|626101271|3|1|NONCLUSTERED INDEX|4|0|0.17827134824641383|808018|14853761| |5|626101271|4|1|NONCLUSTERED INDEX|4|0|6.488527005131855|546542|3983354| Index id 1 is the index I recreated so I think thats as defragmented as it can get. However it's showing a fragmentation of 97%. However, that index is in twice... and the second time it's only 0.95%. The issue I have is if I rebuild the index, then run the ola-hallengre script on it straight away, it thinks it needs to be rebuild again, and unless I'm missing something as it clearly shouldn't need to be. My question is why is that index in there twice? Is there a trick I'm missing with the ola-hallengre script and columnstore index? I'm ruinning this in Azure so I don't *think* partitioning the columnstore index makes sense but I could be wrong. Any advice would be appreciated! [edit] The table has the following indexes on it:

tsql
CREATE CLUSTERED COLUMNSTORE INDEX IX_CONSUMER
ON ConsumerReading;

GO

CREATE UNIQUE NONCLUSTERED INDEX IX_CONSUMER_PRIMARYKEY ON [dbo].ConsumerReading
(
	[SensorReading] ASC,
	[VesselId] ASC,
	[Name] ASC
)

GO

CREATE NONCLUSTERED INDEX [IX_Consumer_ConsumerTypes] ON [dbo].[ConsumerReading]
(
	[SensorReading] ASC,
	[VesselId] ASC
)
INCLUDE ( 	[Name],
	[ConsumerType]) 
WHERE ([ConsumingFuel]=(1))
WITH (STATISTICS_NORECOMPUTE = OFF, DROP_EXISTING = OFF, ONLINE = OFF) ON [PRIMARY]
GO

CREATE NONCLUSTERED INDEX [IX_ConsumerReading_ByVesselAndReadingDate] ON [dbo].[ConsumerReading]
(
	[VesselId] ASC,
	[SensorReading] ASC
)
INCLUDE ( 	[Name],
	[ConsumerType],
	[ReturnVolumeFlow],
	[SupplyVolumeFlow],
	[SupplyDensity]) WITH (STATISTICS_NORECOMPUTE = OFF, DROP_EXISTING = OFF, ONLINE = OFF) ON [PRIMARY]
GO

Ross Dargan (101 rep)

Nov 12, 2023, 11:56 AM • Last activity: Jan 7, 2025, 03:04 PM

4 votes

1 answers

97 views

Why are grouped aggregate functions on a columnstore index fast?

sql-server query-performance aggregate columnstore sql-server-2022

Suppose that a table only contains integers and has a clustered columnstore index. ```sql CREATE TABLE cci_account ( AccountKey int NOT NULL, UnitSold int, INDEX ColStore CLUSTERED COLUMNSTORE ); ``` then it appears that summing one column grouped by another is always fast on such a table, e.g. ```s...

Suppose that a table only contains integers and has a clustered columnstore index.

CREATE TABLE cci_account
(  
    AccountKey int NOT NULL,  
    UnitSold int,
    INDEX ColStore CLUSTERED COLUMNSTORE
);

then it appears that summing one column grouped by another is always fast on such a table, e.g.

SELECT
    AccountKey,
    SUM(UnitSold)
FROM
    cci_account
GROUP BY
    AccountKey;

Why is this so fast? My understanding of the columnstore architecture is that the segments containing AccountKey have no idea what UnitSold value is on their rows. How, then, is SQL Server able to so quickly group by AccountKey? I cannot imagine any algorithm that would achieve this.

J. Mini (1225 rep)

Jan 1, 2025, 07:18 PM • Last activity: Jan 2, 2025, 03:55 PM

3 votes

1 answers

489 views

How do nonclustered columnstore indexes in SQL Server handle linked updates and deletes with rowstore tables(either heap or clustered index tables)?

sql-server update delete database-internals columnstore

I have read several articles and blog posts describing that columnstore indexes store all fields redundantly, including the row IDs from the underlying rowstore table (either RID for heaps or clustered index keys for clustered tables). Some articles mention that within a columnstore, each data row g...

                                  I have read several articles and blog posts describing that columnstore indexes store all fields redundantly, including the row IDs from the underlying rowstore table (either RID for heaps or clustered index keys for clustered tables).

Some articles mention that within a columnstore, each data row gets its own unique identifier internal to the columnstore index, and a delete bitmap or delete buffer uses this identifier (rowgroup ID and row number within the rowgroup) to mark specific rows as deleted. 

However, these sources do not provide detailed explanations of how the columnstore index correlates an updated or deleted row in the rowstore table to the corresponding row in the columnstore index. If the process truly relies on the rowstore-based identifiers (like RID or clustered key), it seems impractical given that the columnstore index stores its own structure of data.

References Reviewed:

* https://aboutsqlserver.com/2019/10/10/sql-server-2016-2017-critical-columnstore-index-bug-has-been-fixed/ 
* https://aboutsqlserver.com/2017/11/28/splitting-and-merging-non-empty-partitions-in-columnstore-indexes/ 
* https://sqlserverfast.com/epr/table-update/ 
* https://aboutsqlserver.com/2013/10/15/sql-server-storage-engine-data-pages-and-data-rows/ 
* https://www.mssqltips.com/sqlservertip/7759/sql-server-columnstore-indexes-tuple-mover-background-process/ 
* https://www.sqlshack.com/sql-server-2017-columnstore-in-place-updates/ 
* http://www.nikoport.com/columnstore/ 

Could someone explain in detail the actual mechanism SQL Server uses to tie a row in the columnstore index to the corresponding row in the rowstore table for the purpose of correct and consistent updates and deletes, and how these operations are performed behind the scenes?

I’ve tried using the DBCC PAGE command to inspect the data pages, but the only concrete information I’ve found is that the delete bitmap stores a unique 9-byte identifier for each row.

I still don’t understand how these identifiers map back to the corresponding row in the original rowstore table or how the internal linking mechanism works for updates and deletes. Could someone provide a detailed explanation of this process?
                                

Steven (31 rep)

Dec 10, 2024, 02:25 PM • Last activity: Dec 12, 2024, 11:44 AM

0 votes

1 answers

80 views

Must fact tables be designed for aggregation?

database-design aggregate data-warehouse columnstore facttable

I have been reading up on Microsoft SQL Server's columnstore indexes recently and I have become concerned that my experience does not match that of the industry. Columnstore indexes are designed to make aggregates easy. It is very frequently said that _for this reason_ they are an ideal fit for fact...

                                  I have been reading up on Microsoft SQL Server's columnstore indexes recently and I have become concerned that my experience does not match that of the industry. Columnstore indexes are designed to make aggregates easy. It is very frequently said that _for this reason_ they are an ideal fit for fact tables.

This make me uneasy. **In my experience, fact tables are not aggregated**. The column list for every fact table I have ever made has been of the form "primary key followed by several dimension columns, followed by every bit of info we store about the primary key". They then get dumped into some BI tool like PowerBI or ThoughtSpot, so non-technical people can filter based on the dimensions and find out whatever they want to know. They might aggregate from time to time, but this certainly isn't the main use of the table. I expect such tables to have well over 100 columns.

Is the table design and access pattern I am describing what is expected of a fact table? Or am I describing an entirely different type of table, the name of which I am ignorant?

J. Mini (1225 rep)

Dec 1, 2024, 01:30 AM • Last activity: Dec 1, 2024, 03:26 AM

25 votes

1 answers

2599 views

Why MERGE doesn't insert more than 277 records into a table which is configured with temporal table and a non-clustered index on history table

azure-sql-database merge columnstore

I found again an issue with SQL Server and MERGE statement and need some confirmation. I can reproduce my issue constantly on a Azure Database (but not on a on premise SQL Server 2017/2019). Please execute following steps (step by step, not in one command execution)! **1) Script for Schema:** ``` CR...

CREATE TABLE [dbo].[ImpactValueHistory]
	(
		[Rn] BIGINT NOT NULL,

		[ImpactId] UNIQUEIDENTIFIER NOT NULL,
		[ImpactValueTypeId] INT NOT NULL,

		[Date] DATE NOT NULL,
		[Value] DECIMAL(38, 10) NOT NULL,

		[ValidFrom] DATETIME2 NOT NULL CONSTRAINT [DF_ImpactValueHistory_ValidFrom] DEFAULT CONVERT(DATETIME2, '0001-01-01'),
		[ValidTo] DATETIME2 NOT NULL CONSTRAINT [DF_ImpactValueHistory_ValidTo] DEFAULT CONVERT(DATETIME2, '9999-12-31 23:59:59.9999999'),

		[ImpactPeriodId] INT NOT NULL,

		[NormalizedValue] DECIMAL(38, 10) NOT NULL,
	)
	GO

	CREATE CLUSTERED COLUMNSTORE INDEX [COLIX_ImpactValueHistory]
		ON [dbo].[ImpactValueHistory];
	GO

	CREATE NONCLUSTERED INDEX [IX_ImpactValueHistory_ValidFrom_ValidTo_ImpactId_DimensionItemId]
		ON [dbo].[ImpactValueHistory] ([ValidFrom], [ValidTo], [ImpactId], [ImpactValueTypeId], [Date]);
	GO


	CREATE TABLE [dbo].[ImpactValue]
	(
		[Rn] BIGINT NOT NULL IDENTITY(1,1),

		[ImpactId] UNIQUEIDENTIFIER NOT NULL,
		[ImpactValueTypeId] INT NOT NULL,

		[Date] DATE NOT NULL,
		[Value] DECIMAL(38, 10) NOT NULL,

		[ValidFrom] DATETIME2 GENERATED ALWAYS AS ROW START NOT NULL CONSTRAINT [DF_ImpactValue_ValidFrom] DEFAULT CONVERT(DATETIME2, '0001-01-01'),
		[ValidTo] DATETIME2 GENERATED ALWAYS AS ROW END NOT NULL CONSTRAINT [DF_ImpactValue_ValidTo] DEFAULT CONVERT(DATETIME2, '9999-12-31 23:59:59.9999999'),

		[ImpactPeriodId] INT NOT NULL,

		[NormalizedValue] DECIMAL(38, 10) NOT NULL,

		PERIOD FOR SYSTEM_TIME ([ValidFrom], [ValidTo]),

		CONSTRAINT [PK_ImpactValue] PRIMARY KEY NONCLUSTERED ([ImpactId], [ImpactValueTypeId], [Date], [ImpactPeriodId])
	)
	WITH (SYSTEM_VERSIONING = ON (HISTORY_TABLE = [dbo].[ImpactValueHistory]))
	GO

	CREATE UNIQUE CLUSTERED INDEX [IX_ImpactValue_Id] ON [dbo].[ImpactValue]([Rn])
	GO

	CREATE COLUMNSTORE INDEX [CIX_ImpactValue] ON [dbo].[ImpactValue] ([ImpactId], [ImpactValueTypeId], [Date], [Value], [NormalizedValue])
	GO

**2) Script for inserting some random data** ``` DECLARE @inserted0 TABLE ([Date] DATE, [ImpactId] uniqueidentifier, [ImpactPeriodId] int, [ImpactValueTypeId] int); MERGE [dbo].[ImpactValue] USING ( SELECT TOP 278 -- it's then working 2) If I delete the non-clustered index [IX_ImpactValueHistory_ValidFrom_ValidTo_ImpactId_DimensionItemId] -> it's working 3) If I use SELECT TOP (@BatchSize) in step 2) --> it's working 4) If I use only OUTPUT instead of OUTPUT INTO @inserted0 --> it's working Without the COLUMNSTORE index on the history table it is working. By only removing the COLUMNSTORE index on the main table I see the same issue. Actual execution plan for a case that (a) repros the issue with TOP 278 and (b) doesn't repro with TOP (@BatchSize) available at https://1drv.ms/u/s!AsOa6e9ukBWQlIRg9_9eySDFp5hvEA?e=KBQBsP . I also added the actual execution plans for batch size of 277. Both are working with this size!

Daniel C. (353 rep)

Jun 20, 2022, 01:34 PM • Last activity: Nov 16, 2024, 08:18 AM

5 votes

3 answers

1042 views

Trim_reason = DICTIONARY_SIZE on Clustered ColumnStore index

sql-server sql-server-2019 columnstore

On a partitioned table with 17 billion rows, I'm getting a very high number of row_groups. I've heard that I should try to get 1.048.576 rows in each rowgroup. How do I achieve that? The `trim_reason` is `DICTIONARY_SIZE`, as reported by `sys.dm_db_column_store_row_group_physical_stats`. I get on av...

                                  On a partitioned table with 17 billion rows, I'm getting a very high number of row_groups.  
I've heard that I should try to get 1.048.576 rows in each rowgroup.
How do I achieve that?

The trim_reason is DICTIONARY_SIZE, as reported by sys.dm_db_column_store_row_group_physical_stats.

I get on average 100.000 rows per rowgroup, when my batch job is inserting 1,048,576 rows per batch.

I'm fairly new to columnstore indexes, but the only reference I can find to DICTIONARY_SIZE is for string columns. My table only has bit, bigint, and numeric columns.

My table looks like this (column names obscured), and I know that a lot of the columns contain mostly NULLs. If there are values, they are likely to vary a lot (large standard deviation). It is vibration data.

    CREATE TABLE fct.MeasureV2
    (
    MeasureV2ID bigint NOT NULL IDENTITY(1, 1),
    DT1ModelID int NOT NULL CONSTRAINT DF_MeasureV2_DT1ModelID DEFAULT ((0)),
    DT1TID int NOT NULL CONSTRAINT DF_MeasureV2_DT1TID DEFAULT ((0)),
    TimeStampUTC datetime NOT NULL,
    AChkSum bigint NULL,
    OChkSum bigint NULL,
    SChkSum bigint NULL,
    ATp numeric (18, 10) NULL,
    ATpAvg numeric (18, 10) NULL,
    AWDAbsAvg numeric (18, 10) NULL,
    G1TMF numeric (18, 10) NULL,
    G2Ps_1TMF numeric (18, 10) NULL,
    G2Ps_2TMF numeric (18, 10) NULL,
    /* about 600 more numeric (18, 10) NULL columns */
    WdSdv numeric (18, 10) NULL,
    UpdatedDate datetime NOT NULL,
    UpdatedID bigint NULL,
    IsCorrected bit NOT NULL CONSTRAINT DF_MeasureV2_IsCorrected DEFAULT ((0)),
    TRATransformJobID int NOT NULL
    ) ON PS_FctMeasure (TimeStampUTC)
    GO
    CREATE CLUSTERED COLUMNSTORE INDEX CCI_MeasureV2 ON fct.MeasureV2 ON PS_FctMeasure (TimeStampUTC)
    GO

There are a lot of columns, but the alternative attribute-value table would contain >200 billion rows, and be almost impossible to query. This format is slow, but not too slow.  
We've also tried SPARSE columns, but that took a lot of disk space.
                                

Henrik Staun Poulsen (2291 rep)

Aug 29, 2022, 12:08 PM • Last activity: Nov 5, 2024, 10:11 AM

6 votes

1 answers

564 views

Columnstore Aggregate Pushdown doesn't work for float/real data types

sql-server performance sql-server-2016 aggregate columnstore

I am having an issue with Aggregate Pushdown for `float/real` data types. According to [the documentation][1], Aggregate Pushdown is supported for "Any datatype * Supported aggregate operators are MIN, MAX, SUM, COUNT, AVG * Any datatype <= 64 bits is supported. For example, bigint is supported as i...

                                  I am having an issue with Aggregate Pushdown for float/real data types. 
 According to the documentation , Aggregate Pushdown is supported for "Any datatype * Supported aggregate operators are MIN, MAX, SUM, COUNT, AVG
* Any datatype <= 64 bits is supported. For example, bigint is supported as its size is 8 bytes but decimal (38,6) is not because its size is 17 bytes. Also, no string types are supported
* Aggregate operator must be on top of SCAN node or SCAN node with group by

It doesn't work no matter what I do. I tried making column nullable and not null. Grouping and w/o grouping. 

We are running on the latest version of the SQL Server 2016 (SP1-CU3). I wonder anyone experienced the same? It seems like a bug to me. Am I missing anything?

If you experience the same issue, please upvote my SQL Server Feedback Request . Now I am faced with a choice to convert float columns to numeric. But operations with numeric types are slower in general. So I may gain in one place and lose in another. I have successfully tested numeric(15,12).

Here is a script illustrating the issue (please enable Actual Execution plan to see the issue):

    DROP TABLE IF EXISTS dbo.TestTable;

    CREATE TABLE dbo.TestTable 
    (
         cKey             INT               NOT NULL
       , cGroup           INT               NOT NULL
       , cNumeric36_3     NUMERIC(36, 3)    NULL
       , cNumeric18_3     DECIMAL(18, 3)    NULL
       , cNumeric18_9     DECIMAL(18, 9)    NULL
       , cNumeric15_12    DECIMAL(15, 12)   NULL
       , cMoney           MONEY             NULL
       , cFloat53         FLOAT(53)         NULL
       , cFloat53Less1    FLOAT(53)         NULL
       , cFloat24         FLOAT(24)         NULL
       , cReal            REAL              NULL
    );

    ;WITH _Numbers0 AS (
        SELECT TOP 3000  column_id  FROM  sys.all_columns 
    )
    , _Numbers AS (
        SELECT cKey = ROW_NUMBER() OVER (ORDER BY (SELECT 1))
        FROM _Numbers0 a CROSS JOIN _Numbers0 b
    )
    , _Keys AS (
        SELECT
                cKey = n.cKey
              , Divder10    = CONVERT(INT, FLOOR(CHECKSUM(NewId()) % 10))
              , Divder100   = CONVERT(INT, FLOOR(CHECKSUM(NewId()) % 100))
              , Divder10000 = CONVERT(INT, FLOOR(CHECKSUM(NewId()) % 10000))
        FROM _Numbers n
    )
    , _RandomValues AS 
    (
        SELECT 
              cKey  
            , cGroup        = ABS(CHECKSUM(NewId())) % 100          
            , cNumeric36_3  = CONVERT(NUMERIC(36,3) , CHECKSUM(NewId())           )                         / CONVERT(NUMERIC(36,3) , NULLIF(Divder10000          , 0.00))
            , cNumeric18_3  = CONVERT(NUMERIC(18,3) , CHECKSUM(NewId())           )                         / CONVERT(NUMERIC(36,3) , NULLIF(Divder100            , 0.00))
            , cNumeric18_9  = CONVERT(NUMERIC(18,9) , CHECKSUM(NewId()) % 1000000 )                         / CONVERT(NUMERIC(36,3) , NULLIF(Divder10000          , 0.00))
            , cNumeric15_12 = CONVERT(NUMERIC(15,12), CHECKSUM(NewId()) % 100     )                         / CONVERT(NUMERIC(36,3) , NULLIF(Divder10000          , 0.00))
            , cMoney        = CONVERT(MONEY, CHECKSUM(NewId()))                                             / CONVERT(MONEY         , NULLIF(Divder10000          , 0.00))
            , cFloat53      = CONVERT(FLOAT, CHECKSUM(NewId())) * CONVERT(FLOAT, CHECKSUM(NewId()))         / CONVERT(FLOAT(53)     , NULLIF(Divder10000          , 0.00))
            , cFloat53Less1 = CONVERT(FLOAT, 1.00)                                                          / CONVERT(FLOAT(53)     , NULLIF(CHECKSUM(NewId())    , 0.00))
            , cFloat24      = CONVERT(FLOAT(24), CHECKSUM(NewId())) * CONVERT(FLOAT(24), CHECKSUM(NewId())) / CONVERT(FLOAT(24)     , NULLIF(Divder10000          , 0.00))
            , cReal         = CONVERT(REAL, CHECKSUM(NewId())) * CONVERT(REAL, CHECKSUM(NewId()))           / CONVERT(REAL          , NULLIF(Divder10000          , 0.00))
        FROM _Keys
    )
    INSERT INTO dbo.TestTable 
    SELECT *       
    FROM _RandomValues
    GO

    CHECKPOINT;
    GO

    CREATE CLUSTERED COLUMNSTORE INDEX IDXCC_dboTestTable
        ON dbo.TestTable WITH (MAXDOP = 4);
    GO

    SELECT COUNT(*) 
    FROM dbo.TestTable tt
    GO

    SELECT MAX(tt.cNumeric36_3) FROM dbo.TestTable tt;
    SELECT MAX(tt.cNumeric18_3) FROM dbo.TestTable tt;
    SELECT MAX(tt.cNumeric18_9) FROM dbo.TestTable tt;
    SELECT MAX(tt.cNumeric15_12) FROM dbo.TestTable tt;
    SELECT MAX(tt.cMoney) FROM dbo.TestTable tt;
    SELECT MAX(tt.cFloat53) FROM dbo.TestTable tt;
    SELECT MAX(tt.cFloat53Less1) FROM dbo.TestTable tt;
    SELECT MAX(tt.cFloat24) FROM dbo.TestTable tt;
    SELECT MAX(tt.cReal) FROM dbo.TestTable tt;
    GO

                                

Alexander Sharovarov (233 rep)

Jul 31, 2017, 09:27 PM • Last activity: Oct 3, 2024, 04:05 PM

1 votes

1 answers

157 views

Does inserting new data into a sorted clustered columnstore index in SQL Server disrupt the order?

sql-server performance index columnstore

I'm working with clustered columnstore indexes in SQL Server and have a question about maintaining data order. I understand that by default, clustered columnstore indexes don't sort data. I've seen a technique where people: - Create a clustered rowstore index to sort the data - Then use the DROP_EXI...

                                  I'm working with clustered columnstore indexes in SQL Server and have a question about maintaining data order.

I understand that by default, clustered columnstore indexes don't sort data. I've seen a technique where people:

 - Create a clustered rowstore index to sort the data
 - Then use the DROP_EXISTING option to create a clustered columnstore index This results in a clustered columnstore index with initially sorted data.

However, I'm concerned about what happens when new data is inserted.

My questions are:

* Does inserting new data into this table disrupt the existing order?
* If so, are there any ways to prevent this or maintain the order over time?

I'd appreciate insights from anyone who has experience with this scenario or knowledge about the internal workings of columnstore indexes in SQL Server. Thanks in advance for your help!

Just a learner (2082 rep)

Aug 12, 2024, 07:20 PM • Last activity: Aug 14, 2024, 10:54 PM

Showing page 1 of 20 total questions