How does a column-oriented DBMS filter on multiple columns?

3 votes

2 answers

1332 views

                          I'm learning how column-oriented DBMS / "columnars" work for OLAP situations.

Let's say we have a log of millions of transactions with 3 columns: timestamp, shop, product, and we want to know the products sold in the shop A in a certain time range:

    SELECT DISTINCT product FROM data
    WHERE timestamp BETWEEN 1600010000 AND 1602678400 
                AND shop = 'A'

This will be stored like this (admittedly, this is more or less an abstraction):

    timestamp [1600000000, 1600000005, 1600000005, 1600000017, 1600000018, ... ]
    shop      [A, A, B, D, C, ...]
    product   [X153, S76D, TYA6, LKJ6, SDH7, ...]
For this query:

 * I totally get how we can achieve fast lookup *by timestamp*, since this column is sorted: with 2 [dichotomic searches](https://en.wikipedia.org/wiki/Dichotomic_search)  we can find the **index** for which timestamp=1600010000 and 1602678400. With **less than 30 read operations of a few bytes**, it's done, we have rowid_start, rowid_end (I don't know if it's still called rowid in the context of a columnar) that make the boundaries of this time-range. The key thing is that we *haven't* had to read megabytes of data, but just a few bytes.

* **Question: then, how can a columnar filter by shop = 'A'? Do we have to read **each entry** of the column shop in the range rowid_start .. rowid_end to test if it's A or not?** This could potentially be hundreds of MB or GB of data.

TL;DR: once we have filtered by one column, how can a columnar do a second-column-filtering, without doing a FULL SCAN?
                        

Asked by Basj (171 rep)

Jan 5, 2021, 12:02 PM
Last activity: Jan 18, 2021, 02:22 PM

How does a column-oriented DBMS filter on multiple columns?

Related Questions