Sample Header Ad - 728x90

How does a column-oriented DBMS filter on multiple columns?

3 votes
2 answers
1332 views
I'm learning how column-oriented DBMS / "columnars" work for OLAP situations. Let's say we have a log of millions of transactions with 3 columns: timestamp, shop, product, and we want to know the products sold in the shop A in a certain time range: SELECT DISTINCT product FROM data WHERE timestamp BETWEEN 1600010000 AND 1602678400 AND shop = 'A' This will be stored like this (admittedly, this is more or less an abstraction): timestamp [1600000000, 1600000005, 1600000005, 1600000017, 1600000018, ... ] shop [A, A, B, D, C, ...] product [X153, S76D, TYA6, LKJ6, SDH7, ...] For this query: * I totally get how we can achieve fast lookup *by timestamp*, since this column is sorted: with 2 [dichotomic searches](https://en.wikipedia.org/wiki/Dichotomic_search) we can find the **index** for which timestamp=1600010000 and 1602678400. With **less than 30 read operations of a few bytes**, it's done, we have rowid_start, rowid_end (I don't know if it's still called rowid in the context of a columnar) that make the boundaries of this time-range. The key thing is that we *haven't* had to read megabytes of data, but just a few bytes. * **Question: then, how can a columnar filter by shop = 'A'? Do we have to read **each entry** of the column shop in the range rowid_start .. rowid_end to test if it's A or not?** This could potentially be hundreds of MB or GB of data. TL;DR: once we have filtered by one column, how can a columnar do a second-column-filtering, without doing a FULL SCAN?
Asked by Basj (171 rep)
Jan 5, 2021, 12:02 PM
Last activity: Jan 18, 2021, 02:22 PM