Database Administrators

Q&A for database professionals who wish to improve their database skills

Latest Questions

0 votes

1 answers

202 views

With Cassandra running into issues with too many tombstones, does BigTable have similar anti-patterns?

I've been using Cassandra and have run into various problems with Tombstones¹. These would cause detrimental issues when I would later run a query. For example, if I overwrite the same few rows over and over again, even though I still have 5 **valid** rows in my database, it can take minutes to...

                                  I've been using Cassandra and have run into various problems with Tombstones¹. These would cause detrimental issues when I would later run a query. For example, if I overwrite the same few rows over and over again, even though I still have 5 **valid** rows in my database, it can take minutes to read them because the Cassandra system would have to read over all the Tombstones to finally reach the useful data.

What I'm wondering is whether Google Bigtable has a similar anti-pattern?

In my situation, I will be using the Bigtable to do many writes, up to 10,000 a minute, and once an hour, another app reads the data to update its caches to the latest. Many of the writes are actually updates to existing rows (same key).

¹ _Tombstones: what's left behind when deleting or updating a row in a Cassandra database. These get removed from the database when a compression of the table occurs, assuming it is given time to run said compression._

Alexis Wilke (135 rep)

Feb 7, 2022, 11:17 PM • Last activity: Aug 2, 2024, 05:37 AM

0 votes

1 answers

920 views

How does instagram handle top likes when there are millions of likes for a particular post?

bigtable

Under each Instagram post is a text like this : > friendA, friendB and 32,312 others liked this post Saving count of likes in a column can be an easy way to get related data. But the fact that Instagram shows specific results based on each persons friends, makes me wonder how do they do that ? retri...

                                  Under each Instagram post is a text like this :

> friendA, friendB and 32,312 others liked this post

Saving count of likes in a column can be an easy way to get related data. 

But the fact that Instagram shows specific results based on each persons friends, makes me wonder how do they do that ? retrieving and processing such a many to many relationship of user's follows with potentially millions of likes for a post sounds like a huge resource heavy action that even Facebook might have trouble handling.

And it certainly happens server side for two reasons:

A) When there are millions of likes, it would've taken minutes before all like data was downloaded by user.

B) You can see that Instagram returns top likes from backend. for example for this post Justin Bieber's page  there is a request to this endpoint Instagram Api endpoint  which returns an object including :

    like_count	3213371
    has_liked	false
    top_likers	[ "friendA" ]

So my question is what tricks do they use to have this functionality at scales ?

Pooya Estakhri (149 rep)

Jun 27, 2022, 06:56 AM • Last activity: Jun 27, 2022, 12:22 PM

-1 votes

2 answers

1429 views

How to speed up an insertion from a huge table with postgres?

postgresql insert bigtable select-into

I have 5 tables in my database with respectively a size of 70Gb, 500Mb, 400 Mb, 110Mb and 20 Mb. I want to create a new table that contains all columns of all tables, so I tried 2 queries, the first one is : ```SQL select into new_table as select .. from t1 , t2 , t3 , t4 ,t5 where t1.id2 = t2.id an...

select into new_table as select .. from t1 , t2 , t3 , t4 ,t5 where t1.id2 = t2.id and t1.id3 = t3.id and t1.id4 = t4.id and t1.id5 = t5.id

and the second one is :

insert into new_table select .. from t1 , t2 , t3 , t4 ,t5 where t1.id2 = t2.id and t1.id3 = t3.id and t1.id4 = t4.id and t1.id5 = t5.id

Before executing these two queries on my big data tables, I tried both on a total 1G database, the first on took only 7s and the second one approximately 10 mn. Now, executing the first one on my huge database, made my disk full even though I had 250Gb free space before running the query, and without finishing the query so I got the follow error :

ERROR:  could not write to temporary file: No space left on device

The second one, is taking a lot of time and consuming my free disk space slowly and, as the first one, not returning the result. What are the difference between these two queries ? Is there a way to make the insert into non transactional so as I can follow my insert steps. And I guess Postgres uses logs (journalization) so is there a way to deactivate that in order to speed up the insertion ? or I should follow another method in order to get a desired result without filling up all disk. Ps : No triggers, only a primary key on each table.

Islacine (35 rep)

Mar 2, 2020, 12:47 AM • Last activity: Apr 3, 2022, 12:04 PM

-1 votes

1 answers

43 views

suggestion needed for big data development

sql-server bigtable

I am trying to find out what is state of the art with database, python, and big data. My starting point began with a SQL server, and multiprocessing pandas, and dask. Imagine I need to maintain a database with more than 1 billion rows, and I need to keep inserting into it and even perform multiproce...

                                  I am trying to find out what is state of the art with database, python, and big data.

My starting point began with a SQL server, and multiprocessing pandas, and dask. Imagine I need to maintain a database with more than 1 billion rows, and I need to keep inserting into it and even perform multiprocessing, larger than memory complex analysis on them.

Some drawbacks includes that, SQL server is very slow in inserting data and extracting data.

Inserting 100k rows takes 1 second, reading 1M head rows takes 5s+. The speed is very dissatisfactory compared with dask with parquet. However, for dask with parquet, I cannot keep inserting into this "more than 1 billion rows database". Multiindex/None-clustered index is also not supported even making some previously fast sql join slower....

I looked around and found apache sql, pyspark. But I'm a bit unsure if that is the correct step forward. Any suggestion? Thanks!

thinker (121 rep)

Aug 1, 2021, 04:23 AM • Last activity: Aug 1, 2021, 04:59 AM

0 votes

0 answers

374 views

Choosing right database for storing bank transactions

google-cloud-sql google-bigquery bigtable firebase

I am starting a new project within GCP and I am trying to choose a right tool for storing bank transactions: 1. I don't need transactions, these will be basically write-only, no updates 2. I don't need joins and relations 3. I need grouping and sums 4. Ideally I would like to use a managed tool with...

                                  I am starting a new project within GCP and I am trying to choose a right tool for storing bank transactions:

1. I don't need transactions, these will be basically write-only, no updates
2. I don't need joins and relations
3. I need grouping and sums
4. Ideally I would like to use a managed tool within GCP
5. I need to be able to have unique records (identified by ID or by the uniqueness of the record)

GCP solutions:

1. Firestore
    - I really like the way of authorisation via service accounts, no need for users/passwords
    - Fulfils 1, 2, 4, 5
    - Completely unsuitable to fulfil point 3
2. CloudSQL
    - fulfils 3 and 5, but I find it too complex for my need
    - old-way authorisation of MySQL - one cannot use GCP authorisation
3. Elasticsearch
    - fulfils 3 and 5 but also too complex
    - Not within GCP, not managed, cannot use GCP auth
4. BigQuery
    - Seems to fulfil all 1, 2, 3, 4
    - Unfortunately it seems it cannot prevent having duplicate records
    - Data becomes available after long time
5. BigTable
    - Seems too complex for such simple requirements

Could you think of another approach?

                                

Vojtěch (89 rep)

Jul 25, 2020, 03:01 PM • Last activity: Jul 29, 2020, 01:20 PM

2 votes

1 answers

1253 views

Can you use BigQuery to run on top of Bigtable

google-bigquery bigtable

I need to run BigQuery on top of Bigtable live, not as an export. I have found the information stating it was in beta but only as an export function. I would like to run BigQuery against Bigtable data without exporting. Has anyone done this or is it possible? The importance is that running it realti...

                                  I need to run BigQuery on top of Bigtable live, not as an export. I have found the information stating it was in beta but only as an export function. I would like to run BigQuery against Bigtable data without exporting. Has anyone done this or is it possible? The importance is that running it realtime no data is lost by doing a point in time export. 
                                

cmydata (21 rep)

Nov 7, 2018, 10:51 PM • Last activity: Jul 28, 2019, 06:57 PM

3 votes

1 answers

835 views

Replacing an aggregation system of statistics

database-design google-bigquery bigtable

We currently have a database under MySQL, storing aggregated statistics in different tables (recent hours, hours, days, months). The tables are updated by workers running at different rates depending on the freshness required for the data. Then those aggregates are queried by the applications with u...

                                  We currently have a database under MySQL, storing aggregated statistics in different tables (recent hours, hours, days, months).  
The tables are updated by workers running at different rates depending on the freshness required for the data.  
Then those aggregates are queried by the applications with usually queries involving more even more aggregation.

This solution is showing limits in performance, scaling and flexibility when it comes to queries the data.

Our goal is to replace it with a system based on events sourcing.

Our first prototype use Dataflow (a bit like MapReduce but working in streaming) to pre-compute aggregates for a part of the data and put those in BigTable and put raw events (partitioned) in BigQuery for the aggregations we can’t pre-compute.

The system is globally working but its cost is prohibitive, with an estimated of 25K$/month for BigQuery only.  
The cost is mainly due to the high number of queries for which we cannot pre-compute an efficient aggregate (usually we cannot pre-compute an aggregate because we need an earlier event that dataflow doesn’t have at the processing time of the event) .  

As alternate back end of BigQuery we tested few other options like [kudu](https://kudu.apache.org/) , [clickhouse](https://clickhouse.yandex/)  , spanner…  
So far only clickhouse actually performed very well with sub second response time, but maintaining a clickhouse cluster seems to be a bit hazardous. 

From there where else could we look? Did we do a fundamental error in the design that would explain the poor performance? Is it even possible to balance cost usability of such system  
I have the feeling that the most used solution is still big hadoop clusters. 

Few technical info under normal load:

 - ~300 event/s (with regular pick at 800 event/s)
 - ~34000000 event/day
 - an event encoded in protobuf weight ~200B
 - an event has 20 properties
 - computing the last state can require 3-10 events spread over a range time of 3 months
                                

Dysosmus (131 rep)

Mar 1, 2017, 09:59 PM • Last activity: Jul 1, 2019, 02:34 PM

12 votes

1 answers

287 views

What are the performance implications of polymodels versus normal Bigtable models in Google App Engine?

performance google-app-engine bigtable

What produces the best performance in regular Google App Engine use, a polymodel or a normal "Bigtable" model? The polymodel, effectively, creates a column in the parent table called "class" which provides the inheritance tracking. Whereas a normal Bigtable, inherited from a parent class, creates a...

                                  What produces the best performance in regular Google App Engine use, a polymodel or a normal "Bigtable" model?

The polymodel, effectively, creates a column in the parent table called "class" which provides the inheritance tracking. Whereas a normal Bigtable, inherited from a parent class, creates a new and separate data structure, without the ability to query the parent and find all children of all subtyped classes.

Brian Ballsun-Stanton (4731 rep)

Jan 3, 2011, 10:41 PM • Last activity: Jun 11, 2019, 02:36 PM

9 votes

1 answers

2810 views

Why can't we perform joins in a distributed database like Bigtable?

join nosql bigtable distributed-databases

From [this answer to "What database does Google use?"][1]: > Bigtable is not a relational database. **It does not support joins nor does it support rich SQL-like queries.** Is this obvious? I do not understand why. [1]: https://stackoverflow.com/questions/362956/what-database-does-google-use/362970#...

                                  From this answer to "What database does Google use?" :

> Bigtable is not a relational database. **It does not support joins nor does it support rich SQL-like queries.** 

Is this obvious? I do not understand why.

Lazer (3361 rep)

Aug 12, 2011, 10:16 PM • Last activity: Jul 17, 2017, 02:38 PM

Showing page 1 of 9 total questions