Database Administrators
Q&A for database professionals who wish to improve their database skills
Latest Questions
0
votes
1
answers
963
views
How to check user settings in ClickHouse
I have a user created by SQL command. I know there is the `getSetting()` function to check user settings. e.g. : ```sql SELECT getSetting('async_insert'); ``` But how to check other users' settings if you are a DBA? Is there any view/function/request for this purpose?
I have a user created by SQL command. I know there is the
getSetting()
function to check user settings. e.g. :
SELECT getSetting('async_insert');
But how to check other users' settings if you are a DBA? Is there any view/function/request for this purpose?
Mikhail Aksenov
(430 rep)
Dec 12, 2023, 02:20 PM
• Last activity: Aug 6, 2025, 09:06 AM
0
votes
1
answers
35
views
Clickhouse - Oracle ODBC connection error
I am trying to create connection between my oracle and clickhouse databases, so I could query oracle through ch like this: ```SELECT * FROM odbc('DSN=OracleODBC-21', 'sys', 'test')```. I have successfully installed unixODBC, Oracle Instant Client, Oracle ODBC for client. Also, I configured my ```.od...
I am trying to create connection between my oracle and clickhouse databases, so I could query oracle through ch like this:
* FROM odbc('DSN=OracleODBC-21', 'sys', 'test')
. I have successfully installed unixODBC, Oracle Instant Client, Oracle ODBC for client.
Also, I configured my .odbc.ini
and .ini
, so I can access oracle:
[oracle@host ~]$ isql -v OracleODBC-21
+---------------------------------------+
| Connected! |
...
SQL> select * from sys.test;
+-----------------------------------------+-----------------------------------------------------------------------------------------------------+
| ID | DATA |
+-----------------------------------------+-----------------------------------------------------------------------------------------------------+
| 0 | 123 |
+-----------------------------------------+-----------------------------------------------------------------------------------------------------+
User
also can do this, but with some envs:
[oracle@host ~]$ sudo -u clickhouse bash -c "export LD_LIBRARY_PATH=/opt/oracle/instantclient_21_19; isql -v OracleODBC-21"
+---------------------------------------+
| Connected! |
...
But when I am trying to query oracle in ch:
host :) select * from odbc('DSN=OracleODBC-21','sys','test');
SELECT *
FROM odbc('DSN=OracleODBC-21', 'sys', 'test')
Query id: d263cc54-bd51-4a97-94c0-085177149947
Elapsed: 9.529 sec.
Received exception from server (version 25.6.2):
Code: 86. DB::Exception: Received from localhost:9000. DB::HTTPException. DB::HTTPException: Received error from remote server http://127.0.0.1:9018/columns_info?use_connection_pooling=1&version=1&connection_string=DSN%3DOracleODBC-21&schema=sys&table=test&external_table_functions_use_nulls=1 . HTTP status code: 500 'Internal Server Error', body length: 267 bytes, body: 'Error getting columns from ODBC 'std::exception. Code: 1001, type: nanodbc::database_error, e.what() = contrib/nanodbc/nanodbc/nanodbc.cpp:1275: IM004: [unixODBC][Driver Manager]Driver's SQLAllocHandle on SQL_HANDLE_HENV failed (version 25.1.5.31 (official build))'
'. (RECEIVED_ERROR_FROM_REMOTE_IO_SERVER)
Will be grateful for any advice.
pashkin5000
(101 rep)
Jul 22, 2025, 05:58 AM
• Last activity: Jul 22, 2025, 06:14 PM
2
votes
1
answers
763
views
Correlated subqueries. Count Visits after Last Purchase Date
I'm pretty new to SQL and have been trying to solve this task for a while...still no luck. I would appreciate if someone here could help me out. I have a database with columns: - ClientID - VisitID - Date - PurchaseID (array) - etc. What I'm trying to achieve is to retrieve a list containing the fol...
I'm pretty new to SQL and have been trying to solve this task for a while...still no luck. I would appreciate if someone here could help me out.
I have a database with columns:
- ClientID
- VisitID
- Date
- PurchaseID (array)
- etc.
What I'm trying to achieve is to retrieve a list containing the following data:
- ClientID
- Last Visit Date
- First Visit Date
- Last Purchase Date
- Visits Count
- Purchases Count
- Visits After Last Purchase Count
When trying to retrieve a value for
Visits After Last Purchase Count
this is where I am stuck.
SELECT
ClientID,
FirstVisit,
LastVisit,
LastPurchaseDate,
Visits,
Purchases,
VisitsAfterPurchase
FROM
(
SELECT
h.ClientID,
max(h.Date) AS LastVisit,
min(h.Date) AS FirstVisit,
count(VisitID) AS Visits
FROM s7_visits AS h
WHERE Date > '2017-12-01'
GROUP BY h.ClientID
LIMIT 100
)
ANY LEFT JOIN
(
SELECT
d.ClientID,
max(d.Date) AS LastPurchaseDate,
sum(length(d.PurchaseID)) AS Purchases,
sum(
(
SELECT count(x.VisitID)
FROM s7_visits AS x
WHERE x.ClientID = d.ClientID
HAVING x.Date >= max(d.Date)
)) AS VisitsAfterPurchase
FROM s7_visits AS d
WHERE (length(PurchaseID) > 0) AND (Date > '2017-12-01')
GROUP BY d.ClientID
) USING (ClientID)
The database system I'm using is Yandex ClickHouse .
The USING
syntax is absolutely normal for ClickHouse. It is used instead of ON
clause in other RDBMSs.
This query is giving me the following error:
> DB::Exception: Column Date is not under aggregate function and not in GROUP BY..
Sample Data:
+----------+---------+------------+------------+
| CliendID | VisitID | Date | PurchaseID |
+----------+---------+------------+------------+
| 123 | 136 | 01.12.2017 | |
| 123 | 522 | 05.12.2017 | |
| 123 | 883 | 08.12.2017 | |
| 123 | 293 | 09.12.2017 | ['345'] |
| 123 | 278 | 12.12.2017 | |
| 123 | 508 | 12.12.2017 | |
| 123 | 562 | 15.12.2017 | |
| 123 | 523 | 21.12.2017 | |
| 456 | 736 | 29.11.2017 | |
| 456 | 417 | 03.12.2017 | |
| 456 | 950 | 04.12.2017 | |
| 456 | 532 | 05.12.2017 | ['346'] |
| 456 | 880 | 09.12.2017 | |
| 456 | 296 | 12.12.2017 | |
| 456 | 614 | 15.12.2017 | |
+----------+---------+------------+------------+
And the result should be:
+----------+-----------------+------------------+--------------------+--------------+-----------------+----------------------------------+
| ClientID | Last Visit Date | First Visit Date | Last Purchase Date | Visits Count | Purchases Count | Visits After Last Purchase Count |
+----------+-----------------+------------------+--------------------+--------------+-----------------+----------------------------------+
| 123 | 21.12.2017 | 01.12.2017 | 09.12.2017 | 8 | 1 | 4 |
| 456 | 15.12.2017 | 29.11.2017 | 05.12.2017 | 7 | 1 | 3 |
+----------+-----------------+------------------+--------------------+--------------+-----------------+----------------------------------+
Edgard Gomez Sennovskaya
(21 rep)
Dec 25, 2017, 07:40 PM
• Last activity: Apr 24, 2025, 06:02 AM
0
votes
0
answers
23
views
PeerDB Initial Snapshot Performance Impact on Standby PostgreSQL
I have set up a Change Data Capture (CDC) pipeline using PeerDB to mirror tables from a PostgreSQL standby read replica to ClickHouse. • The PostgreSQL database contains terabytes of data. • The initial snapshot of the existing data needs to be loaded into ClickHouse. • PeerDB is configured to pull...
I have set up a Change Data Capture (CDC) pipeline using PeerDB to mirror tables from a PostgreSQL standby read replica to ClickHouse.
• The PostgreSQL database contains terabytes of data.
• The initial snapshot of the existing data needs to be loaded into ClickHouse.
• PeerDB is configured to pull from the standby read replica.
Questions:
1. How long will the initial snapshot take? Are there any benchmarks or estimations based on database size?
2. Will the initial snapshot affect the standby PostgreSQL server’s performance?
• Since it is a read replica, will PeerDB’s snapshot queries (e.g., COPY, SELECT * FROM) put significant load on it?
• Would it impact replication lag from the primary database?
3. Are there any best practices to optimize the initial snapshot process to minimize impact on the standby server?
Tselmen Tugsbayar
(1 rep)
Mar 17, 2025, 01:43 AM
• Last activity: Mar 17, 2025, 06:12 AM
3
votes
2
answers
334
views
More efficient accumulator in SQL?
I'm writing a ledger system where every transaction can have multiple classifications. For example, if someone purchases a widget for $50, I can categorize that transaction as having an account of "Revenue" and an SKU as "SKU1". Users can then select the dimensions they wish to report on, and I can...
I'm writing a ledger system where every transaction can have multiple classifications. For example, if someone purchases a widget for $50, I can categorize that transaction as having an account of "Revenue" and an SKU as "SKU1".
Users can then select the dimensions they wish to report on, and I can generate aggregates.
When my database has 10M+ transactions, the following query is prohibitively slow. After about 10s I receive a
Memory limit exceeded
error on my 8GB laptop.
Thus the question: I don't actually care about the individual rows, I only care about the accumulation of these values. In my test, I only expect about 10 rows returned after aggregation.
Here is a fiddle: http://sqlfiddle.com/#!17/4a7d8/10/0
select
year,
sum(amount),
t1.value as account,
t2.value as sku
from
transactions
left join
tags t1 on transactions.id = t1.transaction_id and t1.name ='account'
left join
tags t2 on transactions.id = t2.transaction_id and t2.name = 'sku'
group by
year,
t1.value,
t2.value;
Here is the query plan:
Expression ((Projection + Before ORDER BY))
Aggregating
Expression (Before GROUP BY)
Join (JOIN)
Expression ((Before JOIN + (Projection + Before ORDER BY)))
Join (JOIN)
Expression (Before JOIN)
ReadFromMergeTree (default.transactions)
Expression ((Joined actions + (Rename joined columns + (Projection + Before ORDER BY))))
ReadFromMergeTree (default.tags)
Expression ((Joined actions + (Rename joined columns + (Projection + Before ORDER BY))))
ReadFromMergeTree (default.tags)
And, finally, here is the schema:
CREATE TABLE default.transactions
(
id
Int32,
date
Date,
amount
Float32
)
ENGINE = MergeTree
PRIMARY KEY id
ORDER BY id
SETTINGS index_granularity = 8192
CREATE TABLE default.tags
(
transaction_id
Int32,
name
String,
value
String,
INDEX idx_tag_value value TYPE set(0) GRANULARITY 4,
INDEX idx_tag_name name TYPE set(0) GRANULARITY 4
)
ENGINE = MergeTree
PRIMARY KEY (transaction_id, name)
ORDER BY (transaction_id, name)
SETTINGS index_granularity = 8192
My questions are:
- Is there a different schema, or different set of Clickhouse features I might use?
- Should I instead pre-compute aggregates?
- Is there a different DB which can perform this kind of calculation more efficiently?
poundifdef
(141 rep)
Jun 10, 2022, 01:15 PM
• Last activity: Mar 11, 2025, 05:02 AM
0
votes
0
answers
242
views
Calculate the sum of minutes between statuses Clickhouse
There is a table in ClickHouse that is constantly updated, format: ``` date_time | shop_id | item_id | status | balance --------------------------------------------------------------- 2022-09-09 13:00:01 | abc | 1234 | 0 | 0 2022-09-09 13:00:00 | abc | 1234 | 1 | 3 2022-09-09 12:50:00 | abc | 1234 |...
There is a table in ClickHouse that is constantly updated, format:
date_time | shop_id | item_id | status | balance
---------------------------------------------------------------
2022-09-09 13:00:01 | abc | 1234 | 0 | 0
2022-09-09 13:00:00 | abc | 1234 | 1 | 3
2022-09-09 12:50:00 | abc | 1234 | 1 | 10
The table stores statuses and balances for each item_id
, when the balance is changed, a new record with status, time and balance is added. If the balance = 0
, the status
changes to 0.
Need to calculate how much time (how many minutes) every item_id
in the shop was available for the day. The status may change several times a day.
Please help me calculate this.
Kirill_K
(1 rep)
Sep 12, 2022, 09:59 AM
• Last activity: Sep 3, 2024, 11:17 AM
-1
votes
1
answers
606
views
How to create pre-computed tables in order to speed up the query speed
One of the issues that I am encountering presently is that we have certain very large tables (>10 Million rows).When we reference these large tables or create joins, the speed of query is extremely slow. One of the hypothesis for solving the issue is to create pre-computed tables, where the computat...
One of the issues that I am encountering presently is that we have certain very large tables (>10 Million rows).When we reference these large tables or create joins, the speed of query is extremely slow.
One of the hypothesis for solving the issue is to create pre-computed tables, where the computation for the use cases will be done already and instead of referencing the raw data, we will query the pre-computed table instead
Are there any resources in order to implement this ? Do we only use mySQL or can we also use Pandas or other such modules in order to accomplish the same
Which is the optimal way?
databasequestion
(1 rep)
Sep 7, 2022, 01:49 PM
• Last activity: Sep 7, 2022, 05:30 PM
-1
votes
1
answers
228
views
ClickHouse MV is not working perfectly as i need
I’m new to ClickHouse and having an issue with MV. I have a record table which is the data source. I’m inserting all the data here. Then created another table called adv_general_report using mv_adv_general_report materialized view. This is my [schema][1]. [Records table data][2]. The odd part is aft...
I’m new to ClickHouse and having an issue with MV. I have a record table which is the data source. I’m inserting all the data here. Then created another table called adv_general_report using mv_adv_general_report materialized view.
This is my schema .
Records table data .
The odd part is after inserting the data to record table, the sum of impression is perfectly adding to both adv_general_report and mv_adv_general_report materialized view but views and clicks are always showing zero.
You can see running this query. That showing amount of view
SELECT sum(views) as views from records;
But if you run this
select sum(views) as views from adv_general_report;
It’s 0 . also the select query used for a materialized view is showing the sum of view perfectly. Any idea why?


Aniruddha Chakraborty
(101 rep)
Aug 30, 2021, 02:39 PM
• Last activity: Aug 31, 2021, 08:44 PM
1
votes
1
answers
649
views
How to backup clickhouse over SSH?
In postgreSQL, I usually run this command to backup and compress (since my country have really low bandwidth) from server to local: ``` mkdir -p tmp/backup ssh sshuser@dbserver -p 22 "cd /tmp; pg_dump -U dbuser -Fc -C dbname | xz - -c" \ | pv -r -b > tmp/backup/db_backup_`date +%Y-%m-%d_%H%M%S`.sql....
In postgreSQL, I usually run this command to backup and compress (since my country have really low bandwidth) from server to local:
mkdir -p tmp/backup
ssh sshuser@dbserver -p 22 "cd /tmp; pg_dump -U dbuser -Fc -C dbname | xz - -c" \
| pv -r -b > tmp/backup/db_backup_date +%Y-%m-%d_%H%M%S
.sql.xz
and to restore:
fname=ls -w 1 tmp/backup/*sql.xz | tail -n 1
echo $fname
echo "select 'drop table \"' || tablename || '\" cascade;' from pg_tables WHERE schemaname = 'public';" |
psql -U dbuser |
tail -n +3 |
head -n 2 |
psql -U dbuser
# sudo -u postgres dropdb dbname
# sudo -u postgres createdb --owner dbuser dbname
xzcat $fname | pg_restore --clean --if-exists --no-acl --no-owner -U dbuser -d dbname
How to do similar thing in Clickhouse (backup, compress on the fly, compress to a file)?
Kokizzu
(1403 rep)
Jul 29, 2021, 11:48 AM
• Last activity: Aug 30, 2021, 01:58 PM
0
votes
0
answers
53
views
How do i design a schema with proper DB engine to accumulate data depending on this need on clickhouse or in any other database?
We're a new Adtech company and I was planning to design a database where I'll pull all the data to a single table and then make new tables with a materialized views for others to generate multiple reports. Say we have Inventory, impression, views for multiple reasons. [


Aniruddha Chakraborty
(101 rep)
Aug 24, 2021, 10:20 AM
• Last activity: Aug 24, 2021, 10:48 AM
1
votes
1
answers
3083
views
Clickhouse OPTIMIZE performance for deduplication
I want to try and understand the performance of the `OPTIMIZE` query in [Clickhouse][1]. I am planning on using it to remove duplicates right after a bulk insert from a `MergeTree`, hence I have the options of: `OPTIMIZE TABLE db.table DEDUPLICATE` or `OPTIMIZE TABLE db.table FINAL DEDUPLICATE` I un...
I want to try and understand the performance of the
OPTIMIZE
query in Clickhouse .
I am planning on using it to remove duplicates right after a bulk insert from a MergeTree
, hence I have the options of:
OPTIMIZE TABLE db.table DEDUPLICATE
or
OPTIMIZE TABLE db.table FINAL DEDUPLICATE
I understand that the first state only deduplicates the insert if it hasn't already merged, whereas the second will do it to the whole table. However I am concerned about performance; from dirty analysis of OPTIMIZE TABLE db.table FINAL DEDUPLICATE
on different size tables I can see it going to get exponentially worse as the table gets bigger (0.1s for 0.1M rows, 1s for 0.3M rows, 12s for 10M rows). I am assuming OPTIMIZE TABLE db.table DEDUPLICATE
is based however on the insert size and table size, so should be more performative?
Can anyone point to some literature on these performances?
In addition, do these problems go away if I replace the table with a ReplacingMergeTree
? I imagine the same process will happen under the hood, so doesn't matter either way.
AmyChodorowski
(113 rep)
Aug 19, 2021, 11:47 AM
• Last activity: Aug 23, 2021, 04:40 AM
1
votes
1
answers
1690
views
Mounting Clickhouse data directory to another partition: DB::Exception: Settings profile `default` not found
I'm trying to move clickhouse data directory to another partition `/dev/sdb1`. So here's what I've done: ``` sudo systemctl stop clickhouse-server mv /var/lib/clickhouse /var/lib/clickhouse-orig mkdir /var/lib/clickhouse chown clickhouse:clickhouse /var/lib/clickhouse mount -o user /dev/sdb1 /var/li...
I'm trying to move clickhouse data directory to another partition
/dev/sdb1
. So here's what I've done:
sudo systemctl stop clickhouse-server
mv /var/lib/clickhouse /var/lib/clickhouse-orig
mkdir /var/lib/clickhouse
chown clickhouse:clickhouse /var/lib/clickhouse
mount -o user /dev/sdb1 /var/lib/clickhouse
cp -Rv /var/lib/clickhouse-orig/* /var/lib/clickhouse/
chown -Rv clickhouse:clickhouse /var/lib/clickhouse
sudo systemctl start clickhouse-server
but it shows an error when it started:
Processing configuration file '/etc/clickhouse-server/config.xml'.
Sending crash reports is disabled
Starting ClickHouse 21.6.4.26 with revision 54451, build id: 12B138DBA4B3F1480CE8AA18884EA895F9EAD439, PID 10431
starting up
OS Name = Linux, OS Version = 5.4.0-1044-gcp, OS Architecture = x86_64
Calculated checksum of the binary: 26864E69BE34BA2FCCE2BD900CF631D4, integrity check passed.
Setting max_server_memory_usage was set to 882.18 MiB (980.20 MiB available * 0.90 max_server_memory_usage_to_ram_ratio)
DB::Exception: Settings profile default
not found
shutting down
Stop SignalListener thread
**EDIT**
apparently even without new partition it doesn't start, so probably the config.xml
or the macro.xml
is the culprit
Kokizzu
(1403 rep)
Jun 15, 2021, 07:50 AM
• Last activity: Jun 15, 2021, 08:35 AM
1
votes
1
answers
1988
views
Clickhouse Replication without Sharding
How to make replication (1 master, 2 slave for example) in ClickHouse without sharding? All I can see from the examples are always have sharding: - [Altinity Presentation](https://www.slideshare.net/Altinity/introduction-to-the-mysteries-of-clickhouse-replication-by-robert-hodges-and-altinity-engine...
How to make replication (1 master, 2 slave for example) in ClickHouse without sharding?
All I can see from the examples are always have sharding:
- [Altinity Presentation](https://www.slideshare.net/Altinity/introduction-to-the-mysteries-of-clickhouse-replication-by-robert-hodges-and-altinity-engineering-team)
- [Docker Compose Example](https://github.com/abraithwaite/clickhouse-replication-example/blob/master/docker-compose.yaml)
- [ProgrammerSought Blog](https://www.programmersought.com/article/9452156798/)
- [QuidQuid Blog](http://blog.quidquid.fr/2020/06/clickhouse-multi-master-replication/)
- [FatalErrors Blog](https://www.fatalerrors.org/a/clickhouse-replicas-and-shards.html)
- [zergon321 article on dev.to](https://dev.to/zergon321/creating-a-clickhouse-cluster-part-i-sharding-4j20)
- [Clickhouse issue 2161](https://github.com/ClickHouse/ClickHouse/issues/2161) but no example
Kokizzu
(1403 rep)
May 31, 2021, 08:52 AM
• Last activity: May 31, 2021, 09:44 AM
0
votes
1
answers
117
views
Is Azure Managed Disks enough to ensure high-durability for a database?
I want to set up a database in a high durability set-up on Azure. I've previously relied on DB-as-a-service offerings, but can't do that in this case, so I'd like your feedback on the plan below. Is this enough to ensure reliable storage of data? 1) An Azure Web App takes in metric data from the web...
I want to set up a database in a high durability set-up on Azure. I've previously relied on DB-as-a-service offerings, but can't do that in this case, so I'd like your feedback on the plan below. Is this enough to ensure reliable storage of data?
1) An Azure Web App takes in metric data from the web, does some minor processing and sampling, and sends the data in batches to VM2.
2) VM2 runs the Clickhouse database, and stores data on an Azure Managed Disk
3) Some periodical job takes snapshots of the disk using Clickhouse built-in backup functionality and stores them to cold storage
The periodical backup is meant to mitigate human error, i.e. accidentally running "DROP TABLE xx" on the wrong data.
The big question is if managed disks are an acceptable substitute for database replication, to ensure data durability. Azure Managed Disks are advertised as being very durable forms of storage, with built in triple-redundant replication. They are advertised as good for database use. It seems that this should be enough to take away any concerns of data loss due to hardware failure. Is this correct? Do you see any potential problems with this?
The recovery plan is that if VM2 fails, some monitoring process catches this and spins up a new VM2 instance attached to the same managed disk. The Web App similarly restarts if it fails.
I understand that this setup isn't high-availability, if a VM fails there will be some window of time before it is able to store new data. This is acceptable to me. But I want to ensure that data that gets stored will not be lost, i.e. is durably stored with very high probability. Is this enough to ensure that? Do you see any problems?
ServableSoup
(3 rep)
Apr 5, 2021, 11:50 AM
• Last activity: Apr 5, 2021, 12:27 PM
2
votes
1
answers
151
views
In what cases is using ClickHouseDb and the like a neccessity?
An open source for website analytics - https://github.com/plausible/analytics They use Postgresql and ClickHouseDb. When it comes to web analytics, there're tons of events going that need to be tracked. From the point of view of database, why is using ClickHouseDb in this project neccessity? Why wou...
An open source for website analytics - https://github.com/plausible/analytics
They use Postgresql and ClickHouseDb.
When it comes to web analytics, there're tons of events going that need to be tracked. From the point of view of database, why is using ClickHouseDb in this project neccessity? Why wouldn't Postgresql, which is relational database, alone do?
Yes, ClickHouseDb has been created specifically for analytical processing. But still, why wouldn't Postgresql **alone** do? Are Postgresql, MySql and the like uncapable of handling lots of inserts that occurr simateneously?
kosmosu05
(23 rep)
Aug 22, 2020, 05:02 AM
• Last activity: Aug 22, 2020, 02:07 PM
0
votes
1
answers
2120
views
Clickhouse create database structure for json data
New to clickhouse and stuck on the database creation structure for importing json data which is nested Take for example the json data that looks like the following when there is data populated ``` "FirewallMatchesActions": [ "allow" ], "FirewallMatchesRuleIDs": [ "1234abc" ], "FirewallMatchesSources...
New to clickhouse and stuck on the database creation structure for importing json data which is nested
Take for example the json data that looks like the following
when there is data populated
"FirewallMatchesActions": [
"allow"
],
"FirewallMatchesRuleIDs": [
"1234abc"
],
"FirewallMatchesSources": [
"firewallRules"
],
or
"FirewallMatchesActions": [
"allow",
"block"
],
"FirewallMatchesRuleIDs": [
"1234abc",
"1235abb"
],
"FirewallMatchesSources": [
"firewallRules"
],
but there maybe json data which doesn't have them populated
"FirewallMatchesActions": [],
"FirewallMatchesRuleIDs": [],
"FirewallMatchesSources": [],
what would the clickhouse create database structure look like ?
p4guru
(296 rep)
Jun 12, 2020, 01:52 AM
• Last activity: Jun 12, 2020, 06:14 AM
3
votes
3
answers
283
views
Continuously move data from server A to Server B while deleting the data in server A?
I'm developing an ad server that is expected to handle ad impressions/billion clicks per day. The most difficult challenge I am facing is moving data from one server to another. Basically the flow is like this : 1. Multiple front facing load balancers distributes traffic (http load balancing) to sev...
I'm developing an ad server that is expected to handle ad impressions/billion clicks per day.
The most difficult challenge I am facing is moving data from one server to another.
Basically the flow is like this :
1. Multiple front facing load balancers distributes traffic (http load balancing) to several servers called the traffic handler node.
2. These traffic handler nodes job is to store the click logs in mysql table (data like geo , device, offer id, userid etc) and then redirect traffic to the offer landing page.
3. Every minute cron job runs in all traffic nodes that which transfers all the clicks logs to reporting server (server where all reports generation is done) in batches of 10000 rows per minute , and then delete the data after confirming that the data is successfully received by the reporting server. The reporting servers uses clickhouse database engine
I need to replace the mysql database engine from the traffic nodes as I'm facing a lot of issues with MySQL. Between the heavy inserts and then the heavy deletes it's getting slow. Plus, the data is being transferred via cron job so there is 2 minutes average delay.
I can't use clickhouse in these server as Yandex clickhouse do not support updates and deletes yet and the click logs are supposed to be updated many times (how many events happened on the visit)
I'm looking at kakfa but again I'm not sure how to achieve one way data transfer and then deletion of data.
Maybe my whole approach is wrong. I would be very grateful for any expert to guide in right direction.
Sourabh Swarnkar
(33 rep)
Mar 19, 2018, 05:28 PM
• Last activity: Mar 21, 2018, 07:10 PM
Showing page 1 of 17 total questions