How does a column-oriented DBMS filter on multiple columns?
3
votes
2
answers
1332
views
I'm learning how column-oriented DBMS / "columnars" work for OLAP situations.
Let's say we have a log of millions of transactions with 3 columns:
timestamp
, shop
, product
, and we want to know the products sold in the shop A
in a certain time range:
SELECT DISTINCT product FROM data
WHERE timestamp BETWEEN 1600010000 AND 1602678400
AND shop = 'A'
This will be stored like this (admittedly, this is more or less an abstraction):
timestamp [1600000000, 1600000005, 1600000005, 1600000017, 1600000018, ... ]
shop [A, A, B, D, C, ...]
product [X153, S76D, TYA6, LKJ6, SDH7, ...]
For this query:
* I totally get how we can achieve fast lookup *by timestamp*, since this column is sorted: with 2 [dichotomic searches](https://en.wikipedia.org/wiki/Dichotomic_search) we can find the **index** for which timestamp=1600010000 and 1602678400. With **less than 30 read operations of a few bytes**, it's done, we have rowid_start
, rowid_end
(I don't know if it's still called rowid in the context of a columnar) that make the boundaries of this time-range. The key thing is that we *haven't* had to read megabytes of data, but just a few bytes.
* **Question: then, how can a columnar filter by shop = 'A'
? Do we have to read **each entry** of the column shop
in the range rowid_start .. rowid_end
to test if it's A
or not?** This could potentially be hundreds of MB or GB of data.
TL;DR: once we have filtered by one column, how can a columnar do a second-column-filtering, without doing a FULL SCAN?
Asked by Basj
(171 rep)
Jan 5, 2021, 12:02 PM
Last activity: Jan 18, 2021, 02:22 PM
Last activity: Jan 18, 2021, 02:22 PM