Database Administrators
Q&A for database professionals who wish to improve their database skills
Latest Questions
5
votes
2
answers
1555
views
Should I snowflake or duplicate it across my facts?
I'm building up a data warehouse to be used by SSAS to create cubes on, and I'm debating between two possible schemas. In my warehouse, I've got two different fact tables that tracking daily changes in dollar values. Each entity in these fact tables have an underlying Sales Order and Line to which t...
I'm building up a data warehouse to be used by SSAS to create cubes on, and I'm debating between two possible schemas.
In my warehouse, I've got two different fact tables that tracking daily changes in dollar values. Each entity in these fact tables have an underlying Sales Order and Line to which they relate. These SOs and Lines then have other related dimensions, such as customer, product, etc. About 12 sub-dimensions total so far.
My question is if I should be rolling all these sub dimensions up directly into the fact tables, or if I should use a little snowflaking in my warehouse, and have them branching off the Sales Order and Lines dimension instead.
The first option obviously follows a star-schema model better. However, if changes are made such as adding additional dimensions, it becomes more maintenance, basically having to do the ETL twice for each fact table, rather than just the once on the SO dimension. As well, if a new fact is added that relates to Sales Orders, I'd have to go through the whole process again.
As this is my first DW/OLAP project, I'm not familiar on where the line should be drawn on snowflaking, and other people's thoughts would be highly appreciated.
Evan M.
(231 rep)
Mar 26, 2013, 03:45 PM
• Last activity: Jun 28, 2025, 10:06 PM
1
votes
1
answers
38
views
How to efficiently manage data retention in MariaDB ColumnStore?
I’m using MariaDB 10.11+ with ColumnStore and have created a ColumnStore table to handle large volumes of analytical data. My use case requires keeping only the last 90 days of data. The table includes a created_at column of type DATETIME for tracking data age. However, I’ve run into the following l...
I’m using MariaDB 10.11+ with ColumnStore and have created a ColumnStore table to handle large volumes of analytical data. My use case requires keeping only the last 90 days of data. The table includes a created_at column of type DATETIME for tracking data age.
However, I’ve run into the following limitations:
- ColumnStore does not support user-defined partitioning, making
time-based partitioning difficult.
- Deleting old records does not reclaim disk space (its only delete
metadata).
- The documentation suggests workarounds like creating daily tables or
truncating and recreating tables with only the required data, which
feels inefficient and operationally complex.
My question is:
Is there a more effective or officially recommended way to manage data lifecycle and enforce data retention in MariaDB ColumnStore—preferably one that allows for reclaiming disk space and minimizing operational overhead?
Any best practices, tools, or workflows others are using would be greatly appreciated
Amalka
(13 rep)
Jun 3, 2025, 06:58 AM
• Last activity: Jun 11, 2025, 02:55 PM
3
votes
3
answers
5928
views
Should dates in dimensional tables use dimDate?
Assuming my dimDate has a surrogate key. Should all the date columns in the *dimensional tables* (not the fact tables) store the surrogate key of date dimension? Or just plain date? For example, in dimensional table dimCustomer, there may be birthday, join date, graduation date, .... Etc.
Assuming my dimDate has a surrogate key. Should all the date columns in the *dimensional tables* (not the fact tables) store the surrogate key of date dimension? Or just plain date?
For example, in dimensional table dimCustomer, there may be birthday, join date, graduation date, .... Etc.
u23432534
(1565 rep)
Jun 5, 2017, 07:54 PM
• Last activity: Apr 5, 2025, 11:08 PM
2
votes
1
answers
2227
views
Why do OLAP cubes perform better than Relational databases on certain tasks?
My team uses a Multidimensional OLAP database for a lot of it's analytics and predictive modeling. The standard line in all the documentation is "OLAP cubes are better suited for this type of application than relational databases". On the other hand, you frequently hear that a relational database wi...
My team uses a Multidimensional OLAP database for a lot of it's analytics and predictive modeling. The standard line in all the documentation is "OLAP cubes are better suited for this type of application than relational databases".
On the other hand, you frequently hear that a relational database with a star or snowflake schema (sometimes called ROLAP) can do anything that a true OLAP database can do.
1. Why is it exactly that OLAP is better for analytical processing, aggregating, etc... than the relational model?
2. Is it true that ROLAP can do anything that a true OLAP can do, or are there still limitations or performance issues that can only be solved by using a true OLAP db?
Alex Kinman
(149 rep)
Oct 11, 2017, 09:09 PM
• Last activity: Mar 16, 2025, 11:01 PM
0
votes
0
answers
15
views
Have I converted this DB table to the correct number of dimension tables for a star schema correctly?
I'm trying to replace a pretty complicated query for a dashboard. We currently use this system where we track practices, phone numbers and start/end dates partially in an excel spreadsheet and also in a database. Whenever we want to update any of the info, like which practice uses which number durin...
I'm trying to replace a pretty complicated query for a dashboard. We currently use this system where we track practices, phone numbers and start/end dates partially in an excel spreadsheet and also in a database. Whenever we want to update any of the info, like which practice uses which number during which time period, it is a very involved process that includes checking numbers and practices in excel to make sure that nothing is duplicated, etc. If we want to change something so that practices share a line we have to go into the db and combine the names, etc. I think the entire process could be simplified if we used a star schema instead of the system that we currently have.
We currently have a database table with the following schema
| Group Practice | Practice | TransferType | DialedNumber | DigitsDialed | StartDate | EndDate|
I think the table should be broken out into the following dimensions:
* Practice
| PK | Group Practice | Practice |
* Phone Number
| PK | DialedNumber | DigitsDialed |
* TransferType
| PK| TransferType |
* Calendar Table
There would be a fact table at the center that would track "Transfer Volume by Practice".
A phone number can have one or more transfer types for one or more practices simultaneously.
For example, (123)456-7890 could be transfertype1 for practice A from Jan 1, 2024 - Feb 1, 2024. It could then be used as transfertype2 for practice B from Feb 2, 2024 - Mar 1, 2024.
Or Number (234)567-8901 could be transfertype 2 for practice C from Jan 1, 2024 - Feb 1, 2024 and it could simultaneously be transfertype 3 for practice D from Jan 1, 2024 - Mar 1, 2024.
I've read textbook definitions of how to set up a star schema but I want to make sure that I'm doing this correctly. Do my dimensions make sense? Would this capture the "slowly changing dimensions" in the data?
j.jerrod.taylor
(101 rep)
Oct 18, 2024, 05:11 PM
0
votes
2
answers
126
views
When would a columnstore index be unsuitable for an OLAP environment with large and wide tables?
Suppose that your SQL Server 2019 machine has a query that hits a disk-based table that is both long (let's say 3,000,000 rows) and wide (let's say 150 columns). Suppose that you retrieve most of the rows and most of the columns. Finally, suppose that the table is infrequently changed. Consider it a...
Suppose that your SQL Server 2019 machine has a query that hits a disk-based table that is both long (let's say 3,000,000 rows) and wide (let's say 150 columns). Suppose that you retrieve most of the rows and most of the columns. Finally, suppose that the table is infrequently changed. Consider it a typical reporting server table that is only updated once a day.
**Are there any circumstances at all where the above conditions apply and a traditional rowstore index would be preferable to a columnstore index, clustered or otherwise?**
I worry that I have been sold too easily on columnstore indexes and will begin using them on all of my large and wide tables.
J. Mini
(1225 rep)
Jun 22, 2024, 08:29 PM
• Last activity: Jun 22, 2024, 11:41 PM
0
votes
1
answers
53
views
what is the difference between data warehouse engine, data warehouse, olap engine, olap database, data storage engine vs data storage?
I need to clarify the keyword "engine" and what functionality stands behind it. I assume engine is a smaller component of a bigger thing - database. Like a 'car engine' is inside a car. So dwh engine can be plugged into bigger datawarehouse? Olap engine can be used as component of actual olap db? I...
I need to clarify the keyword "engine" and what functionality stands behind it. I assume engine is a smaller component of a bigger thing - database. Like a 'car engine' is inside a car.
So dwh engine can be plugged into bigger datawarehouse?
Olap engine can be used as component of actual olap db?
I found this data landscape image on Medium.com with 2 different categories: olap db and olap engine.
What is the difference since they have overlapping functionality?

ERJAN
(483 rep)
Apr 8, 2024, 07:58 PM
• Last activity: Apr 30, 2024, 11:27 AM
4
votes
2
answers
3563
views
Star Schema from Relational Database
I'm a newbie to data warehousing and I've been reading articles and watching videos on the principles but I'm a bit confused as to how I would take the design below and convert it into a star schema. In all the examples I've seen the fact table references the dim tables, so I'm assuming the question...
I'm a newbie to data warehousing and I've been reading articles and watching videos on the principles but I'm a bit confused as to how I would take the design below and convert it into a star schema.
In all the examples I've seen the fact table references the dim tables, so I'm assuming the questionId and responseId would be part of the fact table? Any advice would be much appreciated.

saj
(93 rep)
Jul 25, 2016, 01:49 PM
• Last activity: Nov 27, 2023, 11:05 PM
0
votes
0
answers
83
views
Handling updates in OLAP system for e-commerce system using immutable stores
I have a digital commerce system that we have built in-house and it has been working fine. We now want to implement OLAP to provide our customers with advanced reporting and BI features. While looking for the needed components I came across several "immutable" data stores which seem popular choices...
I have a digital commerce system that we have built in-house and it has been working fine. We now want to implement OLAP to provide our customers with advanced reporting and BI features.
While looking for the needed components I came across several "immutable" data stores which seem popular choices as data warehouses. Ex : Apache Druid, Clickhouse etc. These seem to work pretty well with a visualization tool like Metabase (or other similar ones).
What I am unable to figure is handling updates to records in such immutable stores. Like in any order management system, orders in our system also get updated (status, line items, quantities etc) over the course of their lifetime. In such a case how would one use these immutable stores?
Even if I were to use something like a star schema, I can see needing updates to both fact and dimension tables. Ex if a sale is recorded as a fact with its total revenue, the amount might need an update or the order might get cancelled, items can change etc.
The only way i can think of is to only push a transaction to the data warehouse after it is CLOSED aka frozen. In this case all metrics and reporting for in-process orders will have to be based on the main DB (aka OLTP DB).
Is that the only way out? Is there a better approach?
brahmana
(101 rep)
Oct 3, 2023, 05:06 AM
0
votes
1
answers
621
views
Restore from .db folder for Analysis Services
There is an objective to recover cube from long time ago retired MSAS instance. Issue I'm currently standing with is that actual .abf backups are now in a cold storage and recovering them would take a while; however, i do have filesystem backup of Data (as well as logs) folder of filesystem with .db...
There is an objective to recover cube from long time ago retired MSAS instance. Issue I'm currently standing with is that actual .abf backups are now in a cold storage and recovering them would take a while; however, i do have filesystem backup of Data (as well as logs) folder of filesystem with .db folder (seemingly containing all .dim, .ds and .dsv files).
How feasible it is to attempt to recover cube from said folder rather then proceed with searching for .abf files?
JagdCrab
(101 rep)
Oct 31, 2019, 09:03 PM
• Last activity: Aug 7, 2023, 05:07 AM
20
votes
2
answers
22562
views
Difference between star schema and data cube?
I am involved in a new project where I have to create a data cube from the existing relational database system. I understood that the existing system is not properly designed, and I am not sure where to start. My question are: - What is difference between Star Schema and data cube? - Where I have to...
I am involved in a new project where I have to create a data cube from the existing relational database system. I understood that the existing system is not properly designed, and I am not sure where to start.
My question are:
- What is difference between Star Schema and data cube?
- Where I have to start? From star schema or directly data cube?
- Is data cube generated from star schema?
Rathish Kumar B
(2320 rep)
May 5, 2017, 09:52 AM
• Last activity: Jul 26, 2023, 05:44 PM
-1
votes
1
answers
149
views
Real Time Analytics: Which database?
We are currently using MongoDB, and it is performing well for our needs. However, we are looking to support better real-time analytics and aggregations, which MongoDB doesn't handle effectively. Therefore, we are exploring other possibilities. The biggest problem is my data is so much flexible. We s...
We are currently using MongoDB, and it is performing well for our needs. However, we are looking to support better real-time analytics and aggregations, which MongoDB doesn't handle effectively. Therefore, we are exploring other possibilities.
The biggest problem is my data is so much flexible. We store:
- contacts
- events
Every contact has different attributes. For example, besides standard attributes such as first_name, last_name, email, phone, gender, etc. there could be additional/custom attributes that users can create.
Also, we have events such as order, add to cart, the coupon applied, etc. Every user can push different events -- like in Google Analytics, for example.
We want to support the following queries:
- Select all users from London
- Select all users from London who created between 3 and 5 orders whose minimum value is $100, in the last 90 days.
- Select all users from Germany who purchased T-Shirt in the last X days
and so on...
Currently, we don't have such aggregations (count, average, etc).
As you can see, our data is quite unstructured. We are exploring several options to improve this, including MongoDB + Spark, ClickHouse, MariaDB ColumnStore, Apache Druid, Apache Pinot, and others.
Since different attributes have input types, we could create two tables:
**contacts**
- workspace_id
- attribute_string_1
- attribute_string_2
- attribute_string_3
- ...
- attribute_integer_1
- attribute_integer_2
- attribute_integer_3
...
- created_at
- updated_at
and have a few hundreds of such columns.
The same is for **events**:
- workspace_id
- event
- event_attribute_string_1
- event_attribute_string_2
- event_attribute_string_3
..
- event_attribute_integer_1
- event_attribute_integer_2
- event_attribute_integer_3
..
etc.
This means every column can be used in the WHERE clause for filtering... plus we need to join them (or use dictionaries in ClickHouse, for example). Currently, we have a nested array in MongoDB for easier handling.
Questions:
1. What do you think might be the best choice for such a problem?
2. How to organize indexes to support such dynamical data?
3. What about time series databases?
Thanks
Nedim
(99 rep)
Feb 19, 2023, 12:12 PM
• Last activity: Feb 19, 2023, 02:03 PM
0
votes
1
answers
62
views
SQL server OLAP tables perfomance
I have some tables on my SQL Server database that I only run inserts / updates nightly from our services and they are Read-only during the day for our clients (there is no update / insert / delete from the end user). What is the best way I can setup / config this table so that that I can I have thes...
I have some tables on my SQL Server database that I only run inserts / updates nightly from our services and they are Read-only during the day for our clients (there is no update / insert / delete from the end user).
What is the best way I can setup / config this table so that that I can I have these inserts / updates faster at night (while no "client" uses our service) and faster reads during the day (when clients actually use the table for read purposes) ?
These tables are over 100m rows.
beto
(1 rep)
Sep 24, 2022, 03:56 AM
• Last activity: Sep 27, 2022, 09:35 AM
0
votes
0
answers
93
views
Setting up an HTAP Environment
I've had success in the past with setting up a CIX/PK on an identity column and then setting up supporting NCIX and NCCsIX on tables to help support an HTAP environment. I'm curious, going forward what would be the best combo of Columnstore and RowStore indexes? Should I create a clustered ColumnSto...
I've had success in the past with setting up a CIX/PK on an identity column and then setting up supporting NCIX and NCCsIX on tables to help support an HTAP environment.
I'm curious, going forward what would be the best combo of Columnstore and RowStore indexes?
Should I create a clustered ColumnStore index and then create supporting nonClustered RowStore indexes?
Or should I keep with my original design of a CIX/PK on an identity column with supporting NC RowStore & ColumnStore indexes? The main question I have in this scenario is if should I include my CIX/PK column in the NC ColumnStore index(s)?
My current position will be supporting a transactional app that will do batch/trickle inserts with few updates/deletes (currently working on them to do zero, but there is a vNext product that I want to get right the first time...) with a heavy emphasis on OLAP Selects. Most tables have a skinny schema with a few INT/Numeric/Float columns and an always ever increasing DateTime field.
TIA :)
Edit: (Thanks J.D.) Number of rows vary but tend towards the millions with some of our larger customers. Mid range 1-2 million. Large currently 8-9 million. Currently each customer has their own copy of "our" database so it's a nightmare to tune, but we are moving to Azure soonish so we can collect metrics/usage data on each of our customers so math and it's gonna get huge when we migrate over.
Miguel Sanchez
(11 rep)
May 27, 2022, 12:28 AM
• Last activity: May 27, 2022, 05:35 PM
0
votes
2
answers
2199
views
What version of MSOLAP provider am i using?
We've just upgraded a Server from SQL Server 2016 to 2019. We have linked servers connecting to an OLAP cube. When I look under providers for MSOLAP the label of the dialog box shows: > Microsoft OLE DB Provider for Analysis Services 14.0 If I look at a linked server The drop down for provider also...
We've just upgraded a Server from SQL Server 2016 to 2019.
We have linked servers connecting to an OLAP cube.
When I look under providers for MSOLAP the label of the dialog box shows:
> Microsoft OLE DB Provider for Analysis Services 14.0
If I look at a linked server The drop down for provider also says:
> Microsoft OLE DB Provider for Analysis Services 14.0
When I look in: C:\Program Files\Microsoft Analysis Services\AS OLEDB I have 2 folders:
- 130
- 140
If I look under 130 at the properties for MSOLAP130.dll it says it is version 13.0.5622.0
If I look under 140 at the properties for MSOLAP.dll it says it is version 15.1.65.24
Which DLL Is my linked server using? 13.xx or 15.xx?
Sir Swears-a-lot
(3253 rep)
Jul 12, 2021, 10:39 PM
• Last activity: Feb 22, 2022, 04:04 AM
0
votes
0
answers
460
views
Cube won't deploy
I was able to deploy a large cube (with 2 dimensions and 1 measure) in Analysis services server but only after setting configuration error options for both dimensions and the measure (through the properties dialog box of each one of them) to specify : 1. `KeyNotFound : IgnoreError` & 2. `NullKeyNotA...
I was able to deploy a large cube (with 2 dimensions and 1 measure) in Analysis services server but only after setting configuration error options for both dimensions and the measure (through the properties dialog box of each one of them) to specify :
1.
KeyNotFound : IgnoreError
&
2. NullKeyNotAllowed : IgnoreError
&
3. NullKeyConvertedToUnknown : ReportAndContinue
&
4. CalculationError : IgnoreError
My problem is now after setting the right dimension usage options and after adding an admin access, still when i try to browse the cube the measure returns no values (adding the dimensions as well to browse changes nothing).
The cube processing task ends succesfully after mentioning the number of measure rows read in the partition, also the table views data explorer for the measure returns the data correctly.
My question is how to make the cube browsing return the measure rows when i execute the sample query provided by VS2019 ?
PS : the measure i defined has the **float** type.
Hints : when i specify for the measure property KeyNotFound : ReportAndContinue
the processing of cube stops with error message
> saying reached limit number of errors
Thanks for the help.
UPDATE : FULL ERROR WHEN I PROCESS THE CUBE (translated from foreign language)
> Errors in the OLAP storage engine: Attribute key not found during
> processing of: Table: 'dbo_MyFacttable', Column: 'Field7', Value:
> '7811'; Table: 'dbo_MyFacttable', Column: 'Champ14', Value: '3.64000'.
> The attribute is 'Field1'. Errors in the OLAP storage engine: The
> record was skipped because the attribute key could not be found.
> Attribute: Field1 of the dimension: Customer Address of the database:
> MyDB, Cube: Mycube, Group of measures: Facts, Partition: Facts,
> Record: 1. Errors in the OLAP storage engine: The processing operation
> is is complete because the number of errors that occurred during the
> operation has reached the error limit. Errors in the OLAP storage
> engine: An error occurred while processing the 'Facts' partition of
> the 'Facts' measure group for the 'Mycube' cube from the MyDB
> database.
Patrick Schulz
(1 rep)
Aug 4, 2021, 12:49 PM
• Last activity: Aug 5, 2021, 04:04 PM
0
votes
1
answers
66
views
Data modeling for Warehouse
I am learning about database warehouses and data modeling. I came across this [blog post][1]. In the section, "***DIFFERENT MODELS HAVE DIFFERENT PURPOSES***" the author mentions that depending upon the need of the business, there can be two types of data modeling: 1. Relational 2. Dimensional Howev...
I am learning about database warehouses and data modeling. I came across this blog post . In the section, "***DIFFERENT MODELS HAVE DIFFERENT PURPOSES***" the author mentions that depending upon the need of the business, there can be two types of data modeling:
1. Relational
2. Dimensional
However, in both examples the author says that the modeling is done in 3NF form, I am confused, how is 3NF and star schema both used in dimensional model. Aren't they supposed to be fundamentally different?
Lastly, if I understand correct, the relational data modeling is done for OLTP systems and Dimensional is done for OLAP systems, right?
adam
(1 rep)
Apr 5, 2021, 01:55 AM
• Last activity: Apr 6, 2021, 01:39 PM
3
votes
2
answers
294
views
CUBE and calculating the number of dimensions
if a data presented with `4-dimension` in which each dimension is dependent to hierarchical 3-level aggregate like (country, city, street), then we can summarize it into `4096` ways! > we know for cube with `n` dimension in which none of dimension is > hierarchical we have `2^n` summarizing ways, bu...
if a data presented with
> This is my formula, however my answer is very different from 4096!
4-dimension
in which each dimension is dependent to hierarchical 3-level aggregate like (country, city, street), then we can summarize it into 4096
ways!
> we know for cube with n
dimension in which none of dimension is
> hierarchical we have 2^n
summarizing ways, but in this case that each
> dimension has hierarchical 3-level aggregate have 4^4=256
ways.
Why does the above statement mention that there are 4096
ways?
if I couldn't state my problem very well because of a flaw in my English usage, please see:


Betty Andersson
(31 rep)
Jan 21, 2021, 11:14 PM
• Last activity: Jan 25, 2021, 01:37 PM
3
votes
2
answers
1332
views
How does a column-oriented DBMS filter on multiple columns?
I'm learning how column-oriented DBMS / "columnars" work for OLAP situations. Let's say we have a log of millions of transactions with 3 columns: `timestamp`, `shop`, `product`, and we want to know the products sold in the shop `A` in a certain time range: SELECT DISTINCT product FROM data WHERE tim...
I'm learning how column-oriented DBMS / "columnars" work for OLAP situations.
Let's say we have a log of millions of transactions with 3 columns:
timestamp
, shop
, product
, and we want to know the products sold in the shop A
in a certain time range:
SELECT DISTINCT product FROM data
WHERE timestamp BETWEEN 1600010000 AND 1602678400
AND shop = 'A'
This will be stored like this (admittedly, this is more or less an abstraction):
timestamp [1600000000, 1600000005, 1600000005, 1600000017, 1600000018, ... ]
shop [A, A, B, D, C, ...]
product [X153, S76D, TYA6, LKJ6, SDH7, ...]
For this query:
* I totally get how we can achieve fast lookup *by timestamp*, since this column is sorted: with 2 [dichotomic searches](https://en.wikipedia.org/wiki/Dichotomic_search) we can find the **index** for which timestamp=1600010000 and 1602678400. With **less than 30 read operations of a few bytes**, it's done, we have rowid_start
, rowid_end
(I don't know if it's still called rowid in the context of a columnar) that make the boundaries of this time-range. The key thing is that we *haven't* had to read megabytes of data, but just a few bytes.
* **Question: then, how can a columnar filter by shop = 'A'
? Do we have to read **each entry** of the column shop
in the range rowid_start .. rowid_end
to test if it's A
or not?** This could potentially be hundreds of MB or GB of data.
TL;DR: once we have filtered by one column, how can a columnar do a second-column-filtering, without doing a FULL SCAN?
Basj
(171 rep)
Jan 5, 2021, 12:02 PM
• Last activity: Jan 18, 2021, 02:22 PM
0
votes
1
answers
91
views
How to combine Oracle RAC with in-memory database to isloate OLAP from OLTP
I am going through a document https://www.oracle.com/technetwork/database/options/database-in-memory-data-sheet-2338744.pdf There is a thing written "Oracle Real Application Clusters (RAC) can be combined with Oracle Database InMemory to isolate analytic workloads from OLTP workloads by running them...
I am going through a document
https://www.oracle.com/technetwork/database/options/database-in-memory-data-sheet-2338744.pdf
There is a thing written "Oracle Real Application Clusters (RAC) can be combined with Oracle Database InMemory to isolate analytic workloads from OLTP workloads by running them on different
servers of a cluster"
Can anyone explain how would it be possible as RAC has the same storage under it and how can different nodes take responsibility of OLAP and OLTP workload independently
user3102917
(23 rep)
Jan 4, 2021, 07:26 PM
• Last activity: Jan 5, 2021, 01:09 AM
Showing page 1 of 20 total questions