Database Administrators

Q&A for database professionals who wish to improve their database skills

Latest Questions

0 votes

1 answers

963 views

How to check user settings in ClickHouse

clickhouse

I have a user created by SQL command. I know there is the `getSetting()` function to check user settings. e.g. : ```sql SELECT getSetting('async_insert'); ``` But how to check other users' settings if you are a DBA? Is there any view/function/request for this purpose?

I have a user created by SQL command. I know there is the getSetting() function to check user settings. e.g. :

SELECT getSetting('async_insert');

But how to check other users' settings if you are a DBA? Is there any view/function/request for this purpose?

Mikhail Aksenov (430 rep)

Dec 12, 2023, 02:20 PM • Last activity: Aug 6, 2025, 09:06 AM

0 votes

1 answers

35 views

Clickhouse - Oracle ODBC connection error

oracle linux odbc clickhouse

I am trying to create connection between my oracle and clickhouse databases, so I could query oracle through ch like this: ```SELECT * FROM odbc('DSN=OracleODBC-21', 'sys', 'test')```. I have successfully installed unixODBC, Oracle Instant Client, Oracle ODBC for client. Also, I configured my ```.od...

I am trying to create connection between my oracle and clickhouse databases, so I could query oracle through ch like this:

* FROM odbc('DSN=OracleODBC-21', 'sys', 'test')

. I have successfully installed unixODBC, Oracle Instant Client, Oracle ODBC for client. Also, I configured my

.odbc.ini

and

.ini

, so I can access oracle:

[oracle@host ~]$ isql -v OracleODBC-21
+---------------------------------------+
| Connected!                            |
...
SQL> select * from sys.test;
+-----------------------------------------+-----------------------------------------------------------------------------------------------------+
| ID                                      | DATA                                                                                                |
+-----------------------------------------+-----------------------------------------------------------------------------------------------------+
| 0                                       | 123                                                                                                 |
+-----------------------------------------+-----------------------------------------------------------------------------------------------------+

User

also can do this, but with some envs:

[oracle@host ~]$ sudo -u clickhouse bash -c "export LD_LIBRARY_PATH=/opt/oracle/instantclient_21_19; isql -v OracleODBC-21"
+---------------------------------------+
| Connected!                            |
...

But when I am trying to query oracle in ch:

host :) select * from odbc('DSN=OracleODBC-21','sys','test');

SELECT *
FROM odbc('DSN=OracleODBC-21', 'sys', 'test')

Query id: d263cc54-bd51-4a97-94c0-085177149947


Elapsed: 9.529 sec.

Received exception from server (version 25.6.2):
Code: 86. DB::Exception: Received from localhost:9000. DB::HTTPException. DB::HTTPException: Received error from remote server http://127.0.0.1:9018/columns_info?use_connection_pooling=1&version=1&connection_string=DSN%3DOracleODBC-21&schema=sys&table=test&external_table_functions_use_nulls=1 . HTTP status code: 500 'Internal Server Error', body length: 267 bytes, body: 'Error getting columns from ODBC 'std::exception. Code: 1001, type: nanodbc::database_error, e.what() = contrib/nanodbc/nanodbc/nanodbc.cpp:1275: IM004: [unixODBC][Driver Manager]Driver's SQLAllocHandle on SQL_HANDLE_HENV failed  (version 25.1.5.31 (official build))'
'. (RECEIVED_ERROR_FROM_REMOTE_IO_SERVER)

Will be grateful for any advice.

pashkin5000 (101 rep)

Jul 22, 2025, 05:58 AM • Last activity: Jul 22, 2025, 06:14 PM

2 votes

1 answers

763 views

Correlated subqueries. Count Visits after Last Purchase Date

mysql clickhouse

I'm pretty new to SQL and have been trying to solve this task for a while...still no luck. I would appreciate if someone here could help me out. I have a database with columns: - ClientID - VisitID - Date - PurchaseID (array) - etc. What I'm trying to achieve is to retrieve a list containing the fol...

                                  I'm pretty new to SQL and have been trying to solve this task for a while...still no luck. I would appreciate if someone here could help me out.

I have a database with columns:  

- ClientID  
- VisitID  
- Date  
- PurchaseID (array)   
- etc.  

What I'm trying to achieve is to retrieve a list containing the following data:  

- ClientID  
- Last Visit Date  
- First Visit Date  
- Last Purchase Date  
- Visits Count  
- Purchases Count  
- Visits After Last Purchase Count  


When trying to retrieve a value for Visits After Last Purchase Count this is where I am stuck.

    SELECT 
    ClientID, 
    FirstVisit, 
    LastVisit, 
    LastPurchaseDate, 
    Visits, 
    Purchases, 
    VisitsAfterPurchase
    FROM 
    (
    SELECT 
        h.ClientID, 
        max(h.Date) AS LastVisit, 
        min(h.Date) AS FirstVisit, 
        count(VisitID) AS Visits
    FROM s7_visits AS h 
    WHERE Date > '2017-12-01'
    GROUP BY h.ClientID
    LIMIT 100
    ) 
    ANY LEFT JOIN 
    (
    SELECT 
        d.ClientID, 
        max(d.Date) AS LastPurchaseDate, 
        sum(length(d.PurchaseID)) AS Purchases, 
        sum(
        (
            SELECT count(x.VisitID)
            FROM s7_visits AS x 
            WHERE x.ClientID = d.ClientID
            HAVING x.Date >= max(d.Date)
        )) AS VisitsAfterPurchase
    FROM s7_visits AS d 
    WHERE (length(PurchaseID) > 0) AND (Date > '2017-12-01')
    GROUP BY d.ClientID
    ) USING (ClientID)

The database system I'm using is Yandex ClickHouse .
The USING syntax is absolutely normal for ClickHouse. It is used instead of ON clause in other RDBMSs.

This query is giving me the following error:

>     DB::Exception: Column Date is not under aggregate function and not in GROUP BY..


Sample Data:

      +----------+---------+------------+------------+
      | CliendID | VisitID |    Date    | PurchaseID |
      +----------+---------+------------+------------+
      |      123 |     136 | 01.12.2017 |            |
      |      123 |     522 | 05.12.2017 |            |
      |      123 |     883 | 08.12.2017 |            |
      |      123 |     293 | 09.12.2017 | ['345']    |
      |      123 |     278 | 12.12.2017 |            |
      |      123 |     508 | 12.12.2017 |            |
      |      123 |     562 | 15.12.2017 |            |
      |      123 |     523 | 21.12.2017 |            |
      |      456 |     736 | 29.11.2017 |            |
      |      456 |     417 | 03.12.2017 |            |
      |      456 |     950 | 04.12.2017 |            |
      |      456 |     532 | 05.12.2017 | ['346']    |
      |      456 |     880 | 09.12.2017 |            |
      |      456 |     296 | 12.12.2017 |            |
      |      456 |     614 | 15.12.2017 |            |
      +----------+---------+------------+------------+


And the result should be:

      +----------+-----------------+------------------+--------------------+--------------+-----------------+----------------------------------+
      | ClientID | Last Visit Date | First Visit Date | Last Purchase Date | Visits Count | Purchases Count | Visits After Last Purchase Count |
      +----------+-----------------+------------------+--------------------+--------------+-----------------+----------------------------------+
      |      123 |      21.12.2017 |       01.12.2017 |         09.12.2017 |            8 |               1 |                                4 |
      |      456 |      15.12.2017 |       29.11.2017 |         05.12.2017 |            7 |               1 |                                3 |
      +----------+-----------------+------------------+--------------------+--------------+-----------------+----------------------------------+
                                

Edgard Gomez Sennovskaya (21 rep)

Dec 25, 2017, 07:40 PM • Last activity: Apr 24, 2025, 06:02 AM

0 votes

0 answers

23 views

PeerDB Initial Snapshot Performance Impact on Standby PostgreSQL

postgresql change-data-capture clickhouse

I have set up a Change Data Capture (CDC) pipeline using PeerDB to mirror tables from a PostgreSQL standby read replica to ClickHouse. • The PostgreSQL database contains terabytes of data. • The initial snapshot of the existing data needs to be loaded into ClickHouse. • PeerDB is configured to pull...

                                  I have set up a Change Data Capture (CDC) pipeline using PeerDB to mirror tables from a PostgreSQL standby read replica to ClickHouse.  
• The PostgreSQL database contains terabytes of data.  
• The initial snapshot of the existing data needs to be loaded into ClickHouse.  
• PeerDB is configured to pull from the standby read replica.  
Questions:
1. How long will the initial snapshot take? Are there any benchmarks or estimations based on database size?
2. Will the initial snapshot affect the standby PostgreSQL server’s performance?  
• Since it is a read replica, will PeerDB’s snapshot queries (e.g., COPY, SELECT * FROM) put significant load on it?  
• Would it impact replication lag from the primary database?  
3. Are there any best practices to optimize the initial snapshot process to minimize impact on the standby server?
                                

Tselmen Tugsbayar (1 rep)

Mar 17, 2025, 01:43 AM • Last activity: Mar 17, 2025, 06:12 AM

3 votes

2 answers

334 views

More efficient accumulator in SQL?

aggregate clickhouse

I'm writing a ledger system where every transaction can have multiple classifications. For example, if someone purchases a widget for $50, I can categorize that transaction as having an account of "Revenue" and an SKU as "SKU1". Users can then select the dimensions they wish to report on, and I can generate aggregates. When my database has 10M+ transactions, the following query is prohibitively slow. After about 10s I receive a Memory limit exceeded error on my 8GB laptop. Thus the question: I don't actually care about the individual rows, I only care about the accumulation of these values. In my test, I only expect about 10 rows returned after aggregation. Here is a fiddle: http://sqlfiddle.com/#!17/4a7d8/10/0

select
   year,
   sum(amount),
   t1.value as account,
   t2.value as sku
from 
    transactions 
left join
    tags t1 on transactions.id = t1.transaction_id and t1.name ='account'
left join
    tags t2 on transactions.id = t2.transaction_id and t2.name = 'sku'
group by
    year,
	t1.value,
    t2.value;

Here is the query plan:

Expression ((Projection + Before ORDER BY))
  Aggregating
    Expression (Before GROUP BY)
      Join (JOIN)
        Expression ((Before JOIN + (Projection + Before ORDER BY)))
          Join (JOIN)
            Expression (Before JOIN)
              ReadFromMergeTree (default.transactions)
            Expression ((Joined actions + (Rename joined columns + (Projection + Before ORDER BY))))
              ReadFromMergeTree (default.tags)
        Expression ((Joined actions + (Rename joined columns + (Projection + Before ORDER BY))))
          ReadFromMergeTree (default.tags)

And, finally, here is the schema:

CREATE TABLE default.transactions
(
    id Int32,
    date Date,
    amount Float32
)
ENGINE = MergeTree
PRIMARY KEY id
ORDER BY id
SETTINGS index_granularity = 8192

CREATE TABLE default.tags
(
    transaction_id Int32,
    name String,
    value String,
    INDEX idx_tag_value value TYPE set(0) GRANULARITY 4,
    INDEX idx_tag_name name TYPE set(0) GRANULARITY 4
)
ENGINE = MergeTree
PRIMARY KEY (transaction_id, name)
ORDER BY (transaction_id, name)
SETTINGS index_granularity = 8192

My questions are: - Is there a different schema, or different set of Clickhouse features I might use? - Should I instead pre-compute aggregates? - Is there a different DB which can perform this kind of calculation more efficiently?

poundifdef (141 rep)

Jun 10, 2022, 01:15 PM • Last activity: Mar 11, 2025, 05:02 AM

0 votes

0 answers

242 views

Calculate the sum of minutes between statuses Clickhouse

gaps-and-islands clickhouse

There is a table in ClickHouse that is constantly updated, format: ``` date_time | shop_id | item_id | status | balance --------------------------------------------------------------- 2022-09-09 13:00:01 | abc | 1234 | 0 | 0 2022-09-09 13:00:00 | abc | 1234 | 1 | 3 2022-09-09 12:50:00 | abc | 1234 |...

There is a table in ClickHouse that is constantly updated, format:

date_time           | shop_id | item_id | status | balance
---------------------------------------------------------------
2022-09-09 13:00:01 | abc     | 1234    | 0      | 0
2022-09-09 13:00:00 | abc     | 1234    | 1      | 3
2022-09-09 12:50:00 | abc     | 1234    | 1      | 10

The table stores statuses and balances for each item_id, when the balance is changed, a new record with status, time and balance is added. If the balance = 0, the status changes to 0. Need to calculate how much time (how many minutes) every item_id in the shop was available for the day. The status may change several times a day. Please help me calculate this.

Kirill_K (1 rep)

Sep 12, 2022, 09:59 AM • Last activity: Sep 3, 2024, 11:17 AM

-1 votes

1 answers

606 views

How to create pre-computed tables in order to speed up the query speed

mysql clickhouse

One of the issues that I am encountering presently is that we have certain very large tables (>10 Million rows).When we reference these large tables or create joins, the speed of query is extremely slow. One of the hypothesis for solving the issue is to create pre-computed tables, where the computat...

                                  One of the issues that I am encountering presently is that we have certain very large tables (>10 Million rows).When we reference these large tables or create joins, the speed of query is extremely slow. 

One of the hypothesis for solving the issue is to create pre-computed tables, where the computation for the use cases will be done already and instead of referencing the raw data, we will query the pre-computed table instead

Are there any resources in order to implement this ? Do we only use mySQL or can we also use Pandas or other such modules in order to accomplish the same

Which is the optimal way?

databasequestion (1 rep)

Sep 7, 2022, 01:49 PM • Last activity: Sep 7, 2022, 05:30 PM

-1 votes

1 answers

228 views

ClickHouse MV is not working perfectly as i need

database-design schema clickhouse

I’m new to ClickHouse and having an issue with MV. I have a record table which is the data source. I’m inserting all the data here. Then created another table called adv_general_report using mv_adv_general_report materialized view. This is my [schema][1]. [Records table data][2]. The odd part is aft...

                                  I’m new to ClickHouse and having an issue with MV. I have a record table which is the data source. I’m inserting all the data here. Then created another table called adv_general_report using mv_adv_general_report materialized view.

This is my schema .

Records table data .

The odd part is after inserting the data to record table, the sum of impression is perfectly adding to both adv_general_report and mv_adv_general_report materialized view but views and clicks are always showing zero.

You can see running this query. That showing amount of view

    SELECT sum(views) as views from records;

But if you run this

    select sum(views) as views from adv_general_report;

It’s 0 . also the select query used for a materialized view is showing the sum of view perfectly. Any idea why?

Aniruddha Chakraborty (101 rep)

Aug 30, 2021, 02:39 PM • Last activity: Aug 31, 2021, 08:44 PM

1 votes

1 answers

649 views

How to backup clickhouse over SSH?

backup restore compression ssh clickhouse

In postgreSQL, I usually run this command to backup and compress (since my country have really low bandwidth) from server to local: ``` mkdir -p tmp/backup ssh sshuser@dbserver -p 22 "cd /tmp; pg_dump -U dbuser -Fc -C dbname | xz - -c" \ | pv -r -b > tmp/backup/db_backup_`date +%Y-%m-%d_%H%M%S`.sql....

In postgreSQL, I usually run this command to backup and compress (since my country have really low bandwidth) from server to local:

mkdir -p tmp/backup
ssh sshuser@dbserver -p 22 "cd /tmp; pg_dump -U dbuser -Fc -C dbname | xz - -c" \
 | pv -r -b > tmp/backup/db_backup_date +%Y-%m-%d_%H%M%S.sql.xz

and to restore:

fname=ls -w 1 tmp/backup/*sql.xz | tail -n 1
echo $fname

echo "select 'drop table \"' || tablename || '\" cascade;' from pg_tables WHERE schemaname = 'public';" |
psql -U dbuser |
 tail -n +3 |
 head -n 2 |
 psql -U dbuser

# sudo -u postgres dropdb dbname
# sudo -u postgres createdb --owner dbuser dbname
xzcat $fname | pg_restore  --clean --if-exists --no-acl --no-owner -U dbuser -d dbname

How to do similar thing in Clickhouse (backup, compress on the fly, compress to a file)?

Kokizzu (1403 rep)

Jul 29, 2021, 11:48 AM • Last activity: Aug 30, 2021, 01:58 PM

0 votes

0 answers

53 views

How do i design a schema with proper DB engine to accumulate data depending on this need on clickhouse or in any other database?

database-design query schema clickhouse

We're a new Adtech company and I was planning to design a database where I'll pull all the data to a single table and then make new tables with a materialized views for others to generate multiple reports. Say we have Inventory, impression, views for multiple reasons. [![enter image description here...

                                  We're a new Adtech company and I was planning to design a database where I'll pull all the data to a single table and then make new tables with a materialized views for others to generate multiple reports.

Say we have Inventory, impression, views for multiple reasons.

Our main table looks like this, to recreate this

    CREATE TABLE report.empty_summing (times DateTime64,inventory_id String,city Nullable(String), country Nullable(String),inventory Int32 default 0, impression Int32 default 0, views Int32 default 0) ENGINE=SummingMergeTree() primary key inventory_id;

When a request comes from google ADX to our Adengine , it has a unique id which is "inventory_id" and other parameters like country, city..... other string type parameters are inserted.

When 3 types of data are inserted it looks like this.

You can see that Every row have their values inserted but I want to

Our inventory request insert looks like this.

    INSERT INTO report.empty_summing (times,inventory_id,country,city,inventory,impression,views) VALUES (now(),'7120426e6abd0b04ec8c777460a78bdf4b9de0','Bangladesh','Dhaka',1,0,0);

Our impression insert looks like this.

    INSERT INTO report.empty_summing (times,inventory_id,impression) VALUES (now(),'7120426e6abd0b04ec8c777460a78bdf4b9de0',1);

Our view insert looks like this.

    INSERT INTO report.empty_summing (times,inventory_id,views) VALUES (now(),'7120426e6abd0b04ec8c777460a78bdf4b9de0',1);

You can see that "inventory_id" is the same for all these rows. is there any DB engine or any technique I can use where data will be merged and look like this?

Help is much appreciated. thanks in advance!

Aniruddha Chakraborty (101 rep)

Aug 24, 2021, 10:20 AM • Last activity: Aug 24, 2021, 10:48 AM

1 votes

1 answers

3083 views

Clickhouse OPTIMIZE performance for deduplication

clickhouse

I want to try and understand the performance of the `OPTIMIZE` query in [Clickhouse][1]. I am planning on using it to remove duplicates right after a bulk insert from a `MergeTree`, hence I have the options of: `OPTIMIZE TABLE db.table DEDUPLICATE` or `OPTIMIZE TABLE db.table FINAL DEDUPLICATE` I un...

                                  I want to try and understand the performance of the OPTIMIZE query in Clickhouse .

I am planning on using it to remove duplicates right after a bulk insert from a MergeTree, hence I have the options of:

OPTIMIZE TABLE db.table DEDUPLICATE

or 

OPTIMIZE TABLE db.table FINAL DEDUPLICATE

I understand that the first state only deduplicates the insert if it hasn't already merged, whereas the second will do it to the whole table. However I am concerned about performance; from dirty analysis of OPTIMIZE TABLE db.table FINAL DEDUPLICATE on different size tables I can see it going to get exponentially worse as the table gets bigger (0.1s for 0.1M rows, 1s for 0.3M rows, 12s for 10M rows). I am assuming OPTIMIZE TABLE db.table DEDUPLICATE is based however on the insert size and table size, so should be more performative?

Can anyone point to some literature on these performances?

In addition, do these problems go away if I replace the table with a ReplacingMergeTree? I imagine the same process will happen under the hood, so doesn't matter either way.

AmyChodorowski (113 rep)

Aug 19, 2021, 11:47 AM • Last activity: Aug 23, 2021, 04:40 AM

1 votes

1 answers

1690 views

Mounting Clickhouse data directory to another partition: DB::Exception: Settings profile `default` not found

clickhouse

I'm trying to move clickhouse data directory to another partition `/dev/sdb1`. So here's what I've done: ``` sudo systemctl stop clickhouse-server mv /var/lib/clickhouse /var/lib/clickhouse-orig mkdir /var/lib/clickhouse chown clickhouse:clickhouse /var/lib/clickhouse mount -o user /dev/sdb1 /var/li...

I'm trying to move clickhouse data directory to another partition /dev/sdb1. So here's what I've done:

sudo systemctl stop clickhouse-server
mv /var/lib/clickhouse /var/lib/clickhouse-orig
mkdir /var/lib/clickhouse
chown clickhouse:clickhouse /var/lib/clickhouse
mount -o user /dev/sdb1 /var/lib/clickhouse 
cp -Rv /var/lib/clickhouse-orig/* /var/lib/clickhouse/
chown -Rv clickhouse:clickhouse /var/lib/clickhouse
sudo systemctl start clickhouse-server

but it shows an error when it started:

Processing configuration file '/etc/clickhouse-server/config.xml'.
Sending crash reports is disabled
Starting ClickHouse 21.6.4.26 with revision 54451, build id: 12B138DBA4B3F1480CE8AA18884EA895F9EAD439, PID 10431
starting up
OS Name = Linux, OS Version = 5.4.0-1044-gcp, OS Architecture = x86_64
Calculated checksum of the binary: 26864E69BE34BA2FCCE2BD900CF631D4, integrity check passed.
Setting max_server_memory_usage was set to 882.18 MiB (980.20 MiB available * 0.90 max_server_memory_usage_to_ram_ratio)
DB::Exception: Settings profile default not found
shutting down
Stop SignalListener thread

**EDIT** apparently even without new partition it doesn't start, so probably the config.xml or the macro.xml is the culprit

Kokizzu (1403 rep)

Jun 15, 2021, 07:50 AM • Last activity: Jun 15, 2021, 08:35 AM

1 votes

1 answers

1988 views

Clickhouse Replication without Sharding

replication clickhouse

How to make replication (1 master, 2 slave for example) in ClickHouse without sharding? All I can see from the examples are always have sharding: - [Altinity Presentation](https://www.slideshare.net/Altinity/introduction-to-the-mysteries-of-clickhouse-replication-by-robert-hodges-and-altinity-engine...

                                  How to make replication (1 master, 2 slave for example) in ClickHouse without sharding?
All I can see from the examples are always have sharding:
- [Altinity Presentation](https://www.slideshare.net/Altinity/introduction-to-the-mysteries-of-clickhouse-replication-by-robert-hodges-and-altinity-engineering-team) 
- [Docker Compose Example](https://github.com/abraithwaite/clickhouse-replication-example/blob/master/docker-compose.yaml) 
- [ProgrammerSought Blog](https://www.programmersought.com/article/9452156798/) 
- [QuidQuid Blog](http://blog.quidquid.fr/2020/06/clickhouse-multi-master-replication/) 
- [FatalErrors Blog](https://www.fatalerrors.org/a/clickhouse-replicas-and-shards.html) 
- [zergon321 article on dev.to](https://dev.to/zergon321/creating-a-clickhouse-cluster-part-i-sharding-4j20) 
- [Clickhouse issue 2161](https://github.com/ClickHouse/ClickHouse/issues/2161)  but no example
                                

Kokizzu (1403 rep)

May 31, 2021, 08:52 AM • Last activity: May 31, 2021, 09:44 AM

0 votes

1 answers

117 views

Is Azure Managed Disks enough to ensure high-durability for a database?

azure clickhouse

I want to set up a database in a high durability set-up on Azure. I've previously relied on DB-as-a-service offerings, but can't do that in this case, so I'd like your feedback on the plan below. Is this enough to ensure reliable storage of data? 1) An Azure Web App takes in metric data from the web...

                                  I want to set up a database in a high durability set-up on Azure. I've previously relied on DB-as-a-service offerings, but can't do that in this case, so I'd like your feedback on the plan below. Is this enough to ensure reliable storage of data?

1) An Azure Web App takes in metric data from the web, does some minor processing and sampling, and sends the data in batches to VM2.
2) VM2 runs the Clickhouse database, and stores data on an Azure Managed Disk
3) Some periodical job takes snapshots of the disk using Clickhouse built-in backup functionality and stores them to cold storage

The periodical backup is meant to mitigate human error, i.e. accidentally running "DROP TABLE xx" on the wrong data.

The big question is if managed disks are an acceptable substitute for database replication, to ensure data durability. Azure Managed Disks are advertised as being very durable forms of storage, with built in triple-redundant replication. They are advertised as good for database use. It seems that this should be enough to take away any concerns of data loss due to hardware failure. Is this correct? Do you see any potential problems with this?

The recovery plan is that if VM2 fails, some monitoring process catches this and spins up a new VM2 instance attached to the same managed disk. The Web App similarly restarts if it fails.

I understand that this setup isn't high-availability, if a VM fails there will be some window of time before it is able to store new data. This is acceptable to me. But I want to ensure that data that gets stored will not be lost, i.e. is durably stored with very high probability. Is this enough to ensure that? Do you see any problems?

ServableSoup (3 rep)

Apr 5, 2021, 11:50 AM • Last activity: Apr 5, 2021, 12:27 PM

2 votes

1 answers

151 views

In what cases is using ClickHouseDb and the like a neccessity?

postgresql relational-theory time-series-database clickhouse

An open source for website analytics - https://github.com/plausible/analytics They use Postgresql and ClickHouseDb. When it comes to web analytics, there're tons of events going that need to be tracked. From the point of view of database, why is using ClickHouseDb in this project neccessity? Why wou...

                                  An open source for website analytics - https://github.com/plausible/analytics 

They use Postgresql and ClickHouseDb.

When it comes to web analytics, there're tons of events going that need to be tracked. From the point of view of database, why is using ClickHouseDb in this project neccessity? Why wouldn't Postgresql, which is relational database, alone do?

Yes, ClickHouseDb has been created specifically for analytical processing. But still, why wouldn't Postgresql **alone** do? Are Postgresql, MySql and the like uncapable of handling lots of inserts that occurr simateneously?

kosmosu05 (23 rep)

Aug 22, 2020, 05:02 AM • Last activity: Aug 22, 2020, 02:07 PM

0 votes

1 answers

2120 views

Clickhouse create database structure for json data

clickhouse

New to clickhouse and stuck on the database creation structure for importing json data which is nested Take for example the json data that looks like the following when there is data populated ``` "FirewallMatchesActions": [ "allow" ], "FirewallMatchesRuleIDs": [ "1234abc" ], "FirewallMatchesSources...

New to clickhouse and stuck on the database creation structure for importing json data which is nested Take for example the json data that looks like the following when there is data populated

"FirewallMatchesActions": [
    "allow"
  ],
  "FirewallMatchesRuleIDs": [
    "1234abc"
  ],
  "FirewallMatchesSources": [
    "firewallRules"
  ],

"FirewallMatchesActions": [
    "allow",
    "block"
  ],
  "FirewallMatchesRuleIDs": [
    "1234abc",
    "1235abb"
  ],
  "FirewallMatchesSources": [
    "firewallRules"
  ],

but there maybe json data which doesn't have them populated

"FirewallMatchesActions": [],
  "FirewallMatchesRuleIDs": [],
  "FirewallMatchesSources": [],

what would the clickhouse create database structure look like ?

p4guru (296 rep)

Jun 12, 2020, 01:52 AM • Last activity: Jun 12, 2020, 06:14 AM

3 votes

3 answers

283 views

Continuously move data from server A to Server B while deleting the data in server A?

mysql php nosql clickhouse

I'm developing an ad server that is expected to handle ad impressions/billion clicks per day. The most difficult challenge I am facing is moving data from one server to another. Basically the flow is like this : 1. Multiple front facing load balancers distributes traffic (http load balancing) to sev...

                                  I'm developing an ad server that is expected to handle ad impressions/billion clicks per day.

The most difficult challenge I am facing is moving data from one server to another.

Basically the flow is like this :

 1. Multiple front facing load balancers distributes traffic (http load balancing) to several servers called the traffic handler node. 

 2. These traffic handler nodes job is to store the click logs in mysql table (data like geo , device, offer id, userid etc) and then redirect traffic to the offer landing page.

 3. Every minute cron job runs in all traffic nodes that which transfers all the clicks logs to reporting server (server where all reports generation is done) in batches of 10000 rows per minute , and then delete the data after confirming that the data is successfully received by the reporting server. The reporting servers uses clickhouse database engine

I need to replace the mysql database engine from the traffic nodes as I'm facing a lot of issues with MySQL. Between the heavy inserts and then the heavy deletes it's getting slow. Plus, the data is being transferred via cron job so there is 2 minutes average delay.

I can't use clickhouse in these server as Yandex clickhouse do not support updates and deletes yet and the click logs are supposed to be updated many times (how many events happened on the visit)

I'm looking at kakfa but again I'm not sure how to achieve one way data transfer and then deletion of data.

Maybe my whole approach is wrong. I would be very grateful for any expert to guide in right direction.

Sourabh Swarnkar (33 rep)

Mar 19, 2018, 05:28 PM • Last activity: Mar 21, 2018, 07:10 PM

Showing page 1 of 17 total questions