Database Administrators

Q&A for database professionals who wish to improve their database skills

Latest Questions

-1 votes

1 answers

61 views

Best Database for Bulk Reads and Fast Key Lookups for Genomic Data

postgresql database-design nosql big-data

I'm developing a system to store and query genomic data using a relational database (PostgreSQL being my primary candidate). The dataset is quite large around 9 billion records with over 300 columns, totaling more than 30TB of data. Here's an example of the data structure in Go: type Variant struct...

                                  I'm developing a system to store and query genomic data using a relational database (PostgreSQL being my primary candidate). 

The dataset is quite large around 9 billion records with over 300 columns, totaling more than 30TB of data.

Here's an example of the data structure in Go:

    type Variant struct {
        VariantVcf  pgtype.Text   json:"variant_vcf"
        Chromosome  pgtype.Text   json:"chromosome"
        Position    pgtype.Text   json:"position"
        BravoAn     pgtype.Int8   json:"bravo_an"
        // ... many additional fields
    }

Here are some sample SQL queries:

Single ID Lookup:

    SELECT * FROM table WHERE variant_vcf = '1-14809-A-T';
    
Bulk ID Lookup:

    SELECT * FROM variants 
    WHERE variant_vcf IN ('1-14809-A-T', '2-23456-G-C', 'X-78901-T-G');

Questions for the Community:

Although I plan to stick with relational databases, are there scenarios where a key-value store or another specialized database might outperform PostgreSQL in these cases, especially for bulk key-based retrievals?

We expect around 2000 concurrent users (conservative), each performing 1-4 lookups on average. The majority of these lookups fall into

Bulk key based Lookups (e.g., 1k–10M IDs per request)

mad (1 rep)

Mar 27, 2025, 10:29 PM • Last activity: Mar 28, 2025, 08:58 PM

0 votes

3 answers

291 views

How to Insert 150,000 Rows in SQL Server Efficiently? (Error: Max 1000 Row Limit)

sql-server performance database-design insert big-data

I am working with a large dataset of 150,000 rows that needs to be inserted into a SQL Server table. However, I am encountering the following error when trying to insert all values at once using a single **INSERT INTO ... VALUES** statement: Msg 10738, Level 15, State 1, Line 1007 *The number of row...

                                  I am working with a large dataset of 150,000 rows that needs to be inserted into a SQL Server table. However, I am encountering the following error when trying to insert all values at once using a single **INSERT INTO ... VALUES** statement:

Msg 10738, Level 15, State 1, Line 1007
*The number of row value expressions in the INSERT statement exceeds the maximum allowed number of 1000 row values.*

**Question**:

 1. What is the best way to efficiently insert 150,000 rows into SQL Server?
 2. Should I use BULK INSERT, bcp, or an alternative approach?
 3. Any performance considerations or best practices for handling large inserts like this?



Would appreciate any insights from experts who have dealt with large-scale data inserts in SQL Server. 

I am thinking to add like this... Any suggestions are welcome!

    INSERT INTO tableName (Column1, Column2, ...) VALUES 
    (value1, value2, ...),
    (value3, value4, ...),
    ...
    (value1000, value1001, ...);  -- First batch
    
    INSERT INTO tableName (Column1, Column2, ...) VALUES 
    (value1001, value1002, ...),
    (value1003, value1004, ...),
    ...
    (value2000, value2001, ...);  -- Second batch


Thanks in advance!
                                

Jamal Ashraf (101 rep)

Mar 24, 2025, 11:37 AM • Last activity: Mar 25, 2025, 12:24 PM

1 votes

2 answers

308 views

Efficient way to store and retrieve big amount of data

database-design application-design storage-engine big-data

I need to store a big amount of data (about 2.5 billion of new data rows per month), but also I need a very fast way to retrieve the latest value per group on specific time point. The data looking very simple: | ParameterId | Value | DateTime | |--------------|-------|---------------------| | 1 | 12...

                                  I need to store a big amount of data (about 2.5 billion of new data rows per month), but also I need a very fast way to retrieve the latest value per group on specific time point. The data looking very simple:

| ParameterId  | Value | DateTime            |
|--------------|-------|---------------------|
| 1            | 12.5  | 2023-04-21 14:35:03 |
| 2            | 56.81 | 2024-03-01 16:21:17 |
| 1            | 12.5  | 2024-05-22 14:35:03 |
| 1            | 71.4  | 2024-05-31 18:27:03 |

For example, we need the latest values by each parameter on 2024-04-31 17:40. The result will be as follows:

| ParameterId  | Value | DateTime            |
|--------------|-------|---------------------|
| 1            | 12.5  | 2023-04-21 14:35:03 |
| 2            | 56.81 | 2024-03-01 16:21:17 |

This looks like it can be solved by a simple database storage, but I've some restrictions:
1. The disk storage is limited. That's why indexes cannot be used as it's almost an x2 of the data space
2. The max query time for any request is 5 seconds.
3. We have only 1 server

I've already tried to use TimescaleDB (table partitioning), but because I've only right datetime condition (<= dt) it's very inefficient to search for the value in old chunks from the first one.

Technically it's possible, because we have an old software developed by third-party company 10 years ago and it still working, but nobody knows how...

                                

the_it_guy (11 rep)

Jun 21, 2024, 04:56 PM • Last activity: Aug 21, 2024, 04:22 PM

0 votes

0 answers

37 views

Data management pattern for temporarily storing/retrieving excessive quantities of data

database-recommendation big-data

I have been tasked with creating an application to ingest raw binary data from a software-defined radio (SDR) and display it in a web frontend. One of the purposes of this app is to allow a user to take a "capture" from a radio, allowing them to download a short historical period of raw data. I am s...

                                  I have been tasked with creating an application to ingest raw binary data from a software-defined radio (SDR) and display it in a web frontend. One of the purposes of this app is to allow a user to take a "capture" from a radio, allowing them to download a short historical period of raw data. I am struggling with the best way to store and process this data.

The goal of this app is to be able to demonstrate frequencies well into the megahertz, requiring sample rates that could approach the gigahertz range. Given that a single complex sample requires eight bytes, the rate of ingest could easily surpass 1 GB/s. This will stretch the memory of my program. I believe it is far too great of a data rate to write to simply write to disk.

I need some advice on database systems to manage this high-volume data pattern. The important part is that no data analysis is necessary (all data is strictly binary), and some delay in data recall can be tolerated.

**Edit:** I realized that I should clarify better how I am thinking of storing the data. Here's the preconditions and my thoughts:

* Binary data is received incrementally from the radio.
* Data needs to be able to be queried in some way related to time. However, this does not *necessarily* mean that timestamps need to be stored. If there is little to no data loss, then the size of a required return payload can be deduced based on the number of samples requested (since each sample is known to be eight bytes).
* Rather than exact timing, the important part is that data is stored in a continuous, non-lossy manner, to produce accurate waveforms. For example, supposing that data is stored in 1GB incremental records, the system can approximate how many records are needed for one request, and then trim the data server-side.
* The data can be regularly flushed once a certain capacity is reached, if this is beneficial to performance. Captures need not be more than 10GB each.
* Data will be associated relationally with a radio, whether this be through a table-based structure or simply an ID. (this is the easy part)
                                

moltarze (101 rep)

May 21, 2024, 06:12 PM • Last activity: May 22, 2024, 03:56 PM

0 votes

1 answers

321 views

Optimizing Performance for Large PostgreSQL Database with Parent and Child Tables

postgresql big-data

# Description I am currently facing performance challenges in a PostgreSQL database scenario involving a parent table and 30 child tables. Notably, none of these tables are partitioned, and some of the child tables have substantial individual sizes, with a cumulative size of all tables reaching 5TB. Despite having indexes on both the parent and child tables, executing queries, such as the one below, takes an extended period, often several hours. # Question I am seeking guidance on optimizing performance in this context. Are there specific configurations, aside from indexes, that could significantly improve query speed for such a large and complex database structure? Moreover, I am curious if PostgreSQL might have inherent limitations in efficiently handling databases of this size and weight. If so, are there alternative strategies that could be considered for better performance? There has been speculation about PostgreSQL facing challenges with disk I/O, particularly when compared to other databases like Oracle or NoSQL. Is this speculation accurate? ## Infos ### Version Running on Google Cloud SQL PostgreSQL 13.12 on x86_64-pc-linux-gnu, compiled by Debian clang version 12.0.1, 64-bit ### Table

CREATE TABLE mytable (
    id_pos int8 NOT NULL PRIMARY KEY,
    date_insert DATE DEFAULT NOW()
);

CREATE TABLE mytable_child1 (
    id_pos int8 NOT NULL PRIMARY KEY,
    date_insert DATE DEFAULT NOW(),
    other_field varchar(10) NOT NULL
) INHERITS (mytable);

### Indexes On each tables (parent and child) I've this indexes

CREATE INDEX IF NOT EXISTS mytable_date_insert_idx ON mytable USING btree (date_insert);

CREATE INDEX IF NOT EXISTS mytable_child1_date_insert_idx ON mytable_child1 USING btree (date_insert);

### Query

SELECT * FROM mytable WHERE date_insert >= CURRENT_DATE - INTERVAL 1 MONTH;

Marco Cesarato (3 rep)

Feb 2, 2024, 04:09 PM • Last activity: Feb 2, 2024, 05:51 PM

2 votes

1 answers

241 views

Should I use this partitioning/indexing strategy for a table that grows by about a billion rows per year?

index mariadb partitioning big-data

I'm a software engineer with about 10 years of experience building Laravel web applications for small businesses. I'm pretty comfortable with database design, but it's not my greatest strength. One of the applications I've built is a third-party tool for monitoring email newsletter opens and clicks. This app stores open and click events, and users of the tool query these events to see how their email newsletters are doing. The current row count for the analytics events table is about 1.5 billion, which is about a year and a half of data. I plan to migrate to a new database design and would like to see if my understanding of database design is on track. These are the tables for the new database:

Table contacts {
  id integer
  email_address string
  created_at datetime
  updated_at datetime
  indexes {
    id [pk]
    email_address
    created_at
  }
}

Table contact_opens {
  id unsignedBigInteger
  contact_id integer
  sent_at datetime
  opened_at datetime
  indexes {
    id [pk]
    (contact_id, sent_at)
    sent_at
    opened_at
  }
}

Table contact_clicks {
  id unsignedBigInteger
  contact_id integer
  sent_at datetime
  clicked_at datetime
  indexes {
    id [pk]
    (contact_id, sent_at)
    sent_at
    clicked_at
  }
}

I plan to use MariaDB on a dedicated linode server with 64gb of RAM, partition the database on a calendar month basis, and tune InnoDB as follows:

innodb_buffer_pool_size = 48GB
InnoDB File-Per-Table = ON
innodb_flush_log_at_trx_commit = 2
innodb_flush_method = 0_DIRECT
innodb_log_file_size = 8GB
table_open_cache = 20

The queries against this database will be segmenting Contacts based on Opens and Clicks. For example: - Contacts with 5 or more opens in the last 30 days - Contacts with at least 1 click in the last 5 days - Contacts with 1 click on each of the last 30 days - etc., standard email newsletter segmenting queries (as seen on Mailchimp) The queries will return anywhere from 10k-300k contacts at a time. I've done a fair amount of research and it appears that indexing, partitioning, and/or sharding appears to be the best option to help speed up the queries. It seems that a combination of indexing and partitioning would be best suited for this particular use case. **Does this use sound like a sound database design where the opens table grows by about a billion rows per year, and the clicks table grows by about 300 million rows per year?** **Update 1:** The schema will not change. It is changing slightly from how it was designed years ago, but only to accomodate the changes outlined in this quesiton to make the queries faster. **Update 2:** The 300k~ contact queries are for mass mailings (email newsletters). Updated the data types in the MySQL tables above.

Ken Schnetz (23 rep)

Nov 14, 2023, 11:32 PM • Last activity: Nov 22, 2023, 01:36 AM

0 votes

1 answers

623 views

What Databases would be recommended for historical data with over 500M rows and data ingested daily?

database-design big-data

I'm wondering what Databases would best be suited for storing historical data for multiple clients with over 500M rows and incremental data ingested daily per client? Since data would be used for analysis and reporting quick query performance would also need to be taken into consideration. Thank you...

                                  I'm wondering what Databases would best be suited for storing historical data for multiple clients with over 500M rows and incremental data ingested daily per client?  Since data would be used for analysis and reporting quick query performance would also need to be taken into consideration.  

Thank you for any suggestions.

izuu (1 rep)

Oct 4, 2023, 03:58 PM • Last activity: Oct 4, 2023, 05:46 PM

1 votes

2 answers

771 views

SQL Server database with 100 terabyte of data

performance sql-server-2019 performance-tuning enterprise-edition big-data

We have been assigned a new application where the vendor is expecting the database to grow to around 100 TB. They have mentioned SQL Server Standard Edition is compatible. My question is whether SQL Server Standard Edition can adequately handle a database of that size? The Enterprise Edition can han...

                                  We have been assigned a new application where the vendor is expecting the database to grow to around 100 TB. They have mentioned SQL Server Standard Edition is compatible. 

My question is whether SQL Server Standard Edition can adequately handle a database of that size? 

The Enterprise Edition can handle online index rebuild, but overall can SQL Server can handle such big database?

The biggest database I have managed is 3 TB and I am worried how a 100 TB database will perform with SQL Server 2019.

Any advises from experts will be greatly appreciated.

SQL_NoExpert (1117 rep)

Sep 28, 2023, 02:54 PM • Last activity: Sep 29, 2023, 10:03 AM

0 votes

1 answers

1239 views

How should I configure my postgresql to create an Index for a big table (655GB) as fast as possible?

postgresql index performance-tuning big-data

I'm importing a big dataset of medication usage (14 years of public health system data from Brazil) for a data science project. When I first started, I created one of the tables with all the indexes (one int4, one int8, one bpchar(2) and one bpchar(4)), but this caused the COPY transactions to becom...

                                  I'm importing a big dataset of medication usage (14 years of public health system data from Brazil) for a data science project.

When I first started, I created one of the tables with all the indexes (one int4, one int8, one bpchar(2) and one bpchar(4)), but this caused the COPY transactions to become very slow, so I dropped the indexes to speed up data ingestion.

It worked, and I finished importing the 655GB (roughly 2.5 billion rows) last week. However, when trying to create the indexes, the code ran for three days straight and never finished, and I ended up losing the progress for the indexes. _I'm working on getting an UPS for this._

I tried creating all indexes at the same time, in parallel, to finish as soon as possible, but I believe the concurrency between different indexes ended up creating an IO bottleneck (using 100% of my HDDs IO capacity this whole time).

I have followed some tips I found in this link  to try and speed up creation, as follows:

- I've set the table parallel_workers setting to 6, as well as pg variable max_parallel_maintenance_workers to 6.
- I have set the pg variable maintenance_work_mem to 1.5GB. For some reason my instance only lets me raise it to just below 2GB, and I haven't found out how to increase this limit yet.
- I haven't experimented with checkpoint_timeout, max_wal_size or min_wal_size, because I didn't really understand what they mean, so I decided to leave them alone for now.

I'm running this in my personal computer: 32gb RAM, Ryzen 5, a dedicated 4TB HDD for the database. On Windows 10 Home.
I have the intention of acquiring one or two additional SSD to house the database, since IO has been such a bottleneck.

I believe I should probably stick to one index at a time, focusing all resources to its creation, but I am not sure, and I don't want to waste another week just to find out.
This is the first time I'm creating and managing a database by myself, and most things I know are from stackexchange and postgresql.org/docs.

----

Was I right to drop the indexes before COPY?

How should I approach the creation of the indexes, what postgresql settings should I customize (and how) to do it as fast as possible?

And lastly, should I get two SSDs for partitions? Or an SSD for the data and the existing HDD for the WAL would be enough? Or something different? I really don't know how to deal with this, and would prefer to know more before buying more hardware.

Jaejatae (103 rep)

Oct 29, 2022, 09:04 PM • Last activity: Oct 30, 2022, 02:29 PM

0 votes

0 answers

118 views

Mysql UNION ALL query - Too slow using longtext

mysql union big-data

I have this query ``` select t.field1 from ( select t1.field1, t1.field2 from db1.table t1 UNION ALL select t2.field1, t2.field2 from db2.table t2 UNION ALL select t3.field1, t3.field2 from db3.table t3 ) as t ``` field1 is a normal varchar(255) field, but field2 is a longtext field with > 3000 char...

I have this query

select t.field1 from (
select t1.field1, t1.field2 from db1.table t1
UNION ALL 
select t2.field1, t2.field2 from db2.table t2
UNION ALL 
select t3.field1, t3.field2 from db3.table t3
) as t

field1 is a normal varchar(255) field, but field2 is a longtext field with > 3000 characters of data. If I am using this query > 20 sec, but if I am using without field2, this query < 2 sec. Theese tables have 100.000 rows but it will expand. What is the solution of this problem?

Eric H (1 rep)

Oct 15, 2022, 09:57 AM • Last activity: Oct 15, 2022, 10:05 AM

8 votes

3 answers

4962 views

Database design to handle millions of rows in MySQL

mysql scalability big-data

We are running an application that is collecting data much faster than we anticipated. Trying to addapt to that, we are doing a redesig of the database. After reading [this][1], [this][2] and [this][3], I am not sure what the best approach for our design is... considering our HW is very humble. Ther...

                                  We are running an application that is  collecting data much faster than we anticipated. Trying to addapt to that, we are doing a redesig of the database.
After reading this , this  and this , I am not sure what the best approach for our design is... considering our HW is very humble.

There are three main tables that are causing problems:
 - SCANS
 - DOMAINS
 - DOCUMENTS
 - VALUES

Currently we have one single table to store data. The relation between them is:

 - 1 **SCAN** -> (avg 4x) **DOMAINS** -> (avg 3000) MANY **DOCUMENTS** -> (avg 51000) MANY **VALUES**
    - 1 SCAN points to average 4 entries on DOMAINS.
    - 4 entries on DOMAINS point to average 12.000 entries ON DOCUMENTS
    - 12000 entries ON DOCUMENTS point TO average 204000 entries on VALUES

We are currently performing around 100 scans/day. That is inserting around 20.400.000 items per day into VALUES.

We are considering to split VALUES table as one "VALUE_table_per_month":

 - **VALUES_year_month** with the intention to distribute the load between them. But if we multiply the number of scanners, this mechanism is not escalable.
 - **VALUES_year_month_day** then we will end up with so many tables into the same DB.

In both cases, if we increase the number of scans per day, none of the solutions seems scalable.

At this point, to keep all the data into a centralized DB does not seem the best option for scalability reasons... but at the same time, a distributed system will increase the load time significantly.

What would be a reasonable approach? I am sure we are not the first team to find this issue! :P

**EDIT**

*How much data do we read per query?*

That depends on the SCAN. Not all scans have the same amount of data. The range varies between:

 - 1 SCAN --> 200 VALUES
 - 1 SCAN --> 200.000 VALUES

The information is presented on a front end to the end user. So we have splitted how the queries are requested to the backend to avoid overload the server, but in some cases it is not enought due the high number of VALUES.

*When is the data read?*

That entirely depends on the end users. Some days they read 10s of SCANS a day, others none and others 100s.


**EDIT II**
ANALYZE DESCRIBE results from two queries. First one quick and second one slow.

    EXPLAIN ANALYZE 
    SELECT value,
            url,
            filetype, 
            severity,
            COUNT(id_value) AS data_count
    FROM VALUES
    WHERE (weigth = 150 OR weigth = 100) 
    AND id_analysis = 23 
    AND is_hidden = 0 
    AND is_hidden_by_user = 0 
    GROUP BY value 
    ORDER BY data_count DESC

 **Result 1:**

    | -> Sort row IDs: data_count DESC  (actual time=34.016..34.016 rows=0 loops=1)
    -> Table scan on   (actual time=34.006..34.006 rows=0 loops=1)
        -> Aggregate using temporary table  (actual time=34.005..34.005 rows=0 loops=1)
            -> Filter: ((VALUES.is_hidden_by_user = 0) and (VALUES.is_hidden = 0) and ((VALUES.weigth = 150) or (VALUES.weigth = 100)))  (cost=1.00 rows=0.05) (actual time=0.024..0.024 rows=0 loops=1)
                -> Index lookup on VALUES using id_analysis (id_analysis=23)  (cost=1.00 rows=1) (actual time=0.024..0.024 rows=0 loops=1)
 |

Result 2:

        | -> Sort row IDs: data_count DESC  (actual time=187172.159..187172.173 rows=136 loops=1)
        -> Table scan on   (actual time=187172.079..187172.111 rows=136 loops=1)
            -> Aggregate using temporary table  (actual time=187172.077..187172.077 rows=136 loops=1)
                -> Filter: ((VALUES.is_hidden_by_user = 0) and (VALUES.is_hidden = 0) and ((VALUES.weigth = 150) or (VALUES.weigth = 100)))  (cost=264956.35 rows=695) (actual time=249.030..186775.012 rows=52289 loops=1)
                    -> Index lookup on VALUES using id_analysis (id_analysis=8950)  (cost=264956.35 rows=265154) (actual time=248.979..186696.529 rows=134236 loops=1)
     |

**EDIT III**

> Consider PARTITIONing

This is a great suggestion. Kudos!. From what I have read now, that is the native equivalent to spliting tables in the way we were consideting to do.

> (weigth = 150 OR weigth = 100) is a rather strange test.

Removing the OR clausule improves the timing: 

    
    | -> Sort row IDs: data_count DESC  (actual time=101261.260..101261.271 rows=113 loops=1)
        -> Table scan on   (actual time=101261.187..101261.216 rows=113 loops=1)
            -> Aggregate using temporary table  (actual time=101261.185..101261.185 rows=113 loops=1)
                -> Filter: ((VALUES.is_hidden_by_user = 0) and (VALUES.is_hidden = 0) and (VALUES.id_analysis = 8950) and (VALUES.weigth = 150))  (cost=79965.29 rows=623) (actual time=83848.835..100942.179 rows=52259 loops=1)
                    -> Intersect rows sorted by row ID  (cost=79965.29 rows=62292) (actual time=83848.830..100908.758 rows=52259 loops=1)
                        -> Index range scan on VALUES using id_analysis over (id_analysis = 8950)  (cost=291.66 rows=265154) (actual time=0.100..443.145 rows=134236 loops=1)
                        -> Index range scan on VALUES using weigth over (weigth = 150)  (cost=13492.63 rows=12380386) (actual time=0.043..83511.686 rows=7822871 loops=1)
     |

> Please elaborate on value versus id_value

I believe it might be just a "bad naming". 

    +-------------------+-------------+------+-----+---------+----------------+
    | Field             | Type        | Null | Key | Default | Extra          |
    +-------------------+-------------+------+-----+---------+----------------+
    | id_value          | int         | NO   | PRI | NULL    | auto_increment |
    | id_document       | int         | NO   | MUL | NULL    |                |
    | id_tag            | int         | YES  | MUL | NULL    |                |
    | value             | mediumtext  | YES  |     | NULL    |                |
    | weigth            | int         | YES  | MUL | NULL    |                |
    | id_analysis       | int         | YES  | MUL | NULL    |                |
    | url               | text        | YES  |     | NULL    |                |
    | domain            | varchar(64) | YES  |     | NULL    |                |
    | filetype          | varchar(16) | YES  |     | NULL    |                |
    | severity_name     | varchar(16) | YES  |     | NULL    |                |
    | id_domain         | int         | YES  | MUL | NULL    |                |
    | id_city           | int         | YES  | MUL | NULL    |                |
    | city_name         | varchar(32) | YES  |     | NULL    |                |
    | is_hidden         | tinyint     | NO   |     | 0       |                |
    | id_company        | int         | YES  |     | NULL    |                |
    | is_hidden_by_user | tinyint(1)  | NO   |     | 0       |                |
    +-------------------+-------------+------+-----+---------+----------------+


                                

Javi M (61 rep)

Aug 23, 2022, 07:08 PM • Last activity: Aug 24, 2022, 07:36 PM

0 votes

1 answers

374 views

Indexing first and last item of a series only

postgresql database-design index-tuning big-data

Does PostgreSQL support any kind of sparse index that would be useful for indexing only the first and last value in the table, based on a certain key? Consider the following data: ``` CREATE TABLE samples ( device_id int not null; ts timestamp not null; value real not null; ); ``` Now, let's assume...

Does PostgreSQL support any kind of sparse index that would be useful for indexing only the first and last value in the table, based on a certain key? Consider the following data:

CREATE TABLE samples (
   device_id int not null;
   ts timestamp not null;
   value real not null;
);

Now, let's assume I have millions of devices and each one can have billions of samples. I want to access all of this data - if access is slow I don't mind. However I am especially interested to access the first and the last value by ts timestamp: (device_id, value) very, fast. This to know what's the range of samples for a specific device. I can build a trigger/application logic that manages a separate table where the information is stored:

CREATE TABLE first_last_samples (
   device_id int not null;
   first_ts timestamp not null;
   first_value real not null;
   last_ts timestamp not null;
   last_value real not null;
);

However, managing the logic for caching first/last item yourself feels a bit clunky. I am asking would PostgreSQL manage any special index types, maybe by extensions, which would let me solve this problem on the table/view/index definition level? Because there are billions of data points, putting all of them in the index does not seem to make sense if you are only interested in first and last.

Mikko Ohtamaa (331 rep)

May 29, 2022, 05:40 PM • Last activity: Jun 3, 2022, 03:58 PM

2 votes

1 answers

897 views

Data lake or data warehouse first?

database-design data-warehouse big-data

I have been confused whether to create a data lake or a data warehouse and hope some experienced real-world professional can give me some enlightenment. I will like to store, visualise and perform machine learning with the data that I ingested from multiple sources (IoT devices, APIs etc.). I read t...

                                  I have been confused whether to create a data lake or a data warehouse and hope some experienced real-world professional can give me some enlightenment.

I will like to store, visualise and perform machine learning with the data that I ingested from multiple sources (IoT devices, APIs etc.). I read that a business will require *both* data lake and warehouse in the current environment that we are in.

My question is: 

1. should I create a data lake first, then transform/process these raw data from the lake and ingest it into a data warehouse?
2. Or is the data lake a separate data processing pipeline on its own?
3. Or is this depends on the use case?

This has been what I been thinking of:

PS: If this is the wrong StackExchange do let me know thanks :)

SunnyBoiz (153 rep)

May 13, 2022, 12:30 PM • Last activity: May 13, 2022, 12:51 PM

0 votes

0 answers

1745 views

PostgreSQL too slow selecting 1 billion records in a table

postgresql postgresql-performance big-data

I created a PostgreSQL db in Google Cloud SQL and have a single table that stores 1 billion records CREATE TABLE a ( user_id INT NOT NULL, dt DATE NOT NULL DEFAULT CURRENT_DATE, id VARCHAR(255) NOT NULL, cnt SMALLINT NOT NULL DEFAULT 1 ); CREATE INDEX user_id_idx ON a(user_id); CREATE INDEX dt_idx O...

                                  I created a PostgreSQL db in Google Cloud SQL and have a single table that stores 1 billion records

    CREATE TABLE a (
        user_id INT NOT NULL,
        dt DATE NOT NULL DEFAULT CURRENT_DATE,
        id VARCHAR(255) NOT NULL,
        cnt SMALLINT NOT NULL DEFAULT 1
    );
    CREATE INDEX user_id_idx ON a(user_id);
    CREATE INDEX dt_idx ON a(dt);
    CREATE INDEX user_id_dt_idx ON a(user_id, dt);

The result from SELECT COUNT(*) FROM a WHERE user_id = 1 AND dt BETWEEN ('2021-08-01' AND '2021-12-31') is 10 million rows which is pretty big.

What I want is to count id with some conditions like below.

    SELECT COUNT(*)
    FROM(
        SELECT t.id, SUM(t.count) cnt_sum
        FROM (
            SELECT id, count
            FROM a
            WHERE user_id = 1 AND dt BETWEEN ('2021-08-01' AND '2021-12-31')
        ) t
        GROUP BY t.id
    ) t2
    WHERE t2.cnt_sum > 1 AND t2.cnt_sum 

    Aggregate  (cost=432974.05..432974.06 rows=1 width=8) (actual time=76873.827..76873.829 rows=1 loops=1)
      Buffers: shared hit=199160, temp read=172244 written=172504
      ->  GroupAggregate  (cost=430508.09..432967.22 rows=546 width=41) (actual time=58606.309..76050.907 rows=9089942 loops=1)
            Group Key: a.id
            Filter: ((sum(a.count) > 1) AND (sum(a.count)   Sort  (cost=430508.09..430781.32 rows=109295 width=35) (actual time=58606.287..70580.733 rows=10100259 loops=1)
                  Sort Key: a.id
                  Sort Method: external merge  Disk: 454704kB
                  Buffers: shared hit=199160, temp read=172244 written=172504
                  ->  Index Scan using user_id_idx on a  (cost=0.57..418372.26 rows=109295 width=35) (actual time=0.025..3334.426 rows=10100259 loops=1)
                        Index Cond: (user_id = 778)
                        Filter: ((dt >= '2021-06-01'::date) AND (dt <= '2022-01-20'::date))
                        Buffers: shared hit=199160
    Planning Time: 0.184 ms
    Execution Time: 76930.444 ms


However, This query took over 80 seconds to complete.

Is there any thing to make it perform better?


                                

Eric Lee (121 rep)

Jan 20, 2022, 08:33 AM • Last activity: Jan 21, 2022, 06:23 AM

0 votes

1 answers

138 views

Can I try SQL Server BIG Data Cluster for free?

sql-server clustering azure big-data

I can not find information about test SQL Server Big Data Cluster in a development environment. We have some AKSes and would like to try installing SQL Server Big Data Cluster. Is it possible to do it for free? Can the master instance be SQL Server 2019 Developer Edition?

                                  I can not find information about test SQL Server Big Data Cluster in a development environment. We have some AKSes and would like to try installing SQL Server Big Data Cluster. 

Is it possible to do it for free?  
Can the master instance be SQL Server 2019 Developer Edition? 
                                

grochu (45 rep)

Oct 25, 2021, 11:54 AM • Last activity: Oct 26, 2021, 01:58 AM

Showing page 1 of 15 total questions