Database Administrators

Q&A for database professionals who wish to improve their database skills

Latest Questions

3 votes

2 answers

428 views

Can I use a timeseries DBMS to store video?

From a research study involving participants with a camera, I have chunks of video recording. I have to split them as part of the preprocessing into _short_ sequences of around 5 min long. Each file will be named after its time-stamp in the sequence. I can store them as binary blobs in a database. I...

                                  From a research study involving participants with a camera, I have chunks of video recording. I have to split them as part of the preprocessing into _short_ sequences of around 5 min long. Each file will be named after its time-stamp in the sequence. I can store them as binary blobs in a database. 

In this study, I also have other sensors. It's clear that I should Use a time series DBMS to store these data (IoT sensing data). This will facilitate queries based on time. Do you see a problem with me storing the file paths and indexing them into a TimeseriesDB (like InfluxDB) based on their file blob's sequential time-stamps to facilitate time-driven queries? 

The Q & A https://dba.stackexchange.com/questions/2445/files-in-the-database-or-not  partially answers my question - from looking at that answer is sounds like storing video as a blob is a good solution. But my question is more about indexing video by the file's time-stamps.

The video is recorded in _large_ (time) chunks at a time. as part of the cleaning/pre-processing, we split them into 5-minute long segments. Each segment is saved to a file that retains/maintains a date-time-stamp (a 50 min long video is split into 10). Putting it this way, it seems natural to store these file paths per their date-time-stamp. I just want to run my logic by others.

sAguinaga (131 rep)

Jan 2, 2019, 07:56 PM • Last activity: Aug 6, 2025, 06:01 AM

0 votes

2 answers

156 views

Mariadb with more than one node each node will have recent 1 year data

mariadb time-series-database

I am creating a saas which will then have time based data every seconds for example. I want horizontal scalling when data reach its limit of 1 year. after 1 year old data should be shift to another server so only 1 year data is available on 1st node 2nd will have 1 year old data and 3rd will have 2...

                                  I am creating a saas which will then have time based data every seconds for example.
I want horizontal scalling when data reach its limit of 1 year.
after 1 year old data should be shift to another server so only 1 year data is available on 1st node
2nd will have 1 year old data and 3rd will have 2 years old data so on.

let me know which db tool I should use to achieve this and how?
or
let me know better solution if any

Deven Sitapara (11 rep)

Apr 26, 2020, 05:33 PM • Last activity: Jul 14, 2025, 10:18 PM

0 votes

1 answers

160 views

Selecting columns based on time ranges

postgresql postgresql-performance time-series-database

I have a table which looks like this: ``` Table t1 timestamp (pkey) | A | B | C ---------------------------------- 2024-01-01 12:00:00 | 1 | 2 | 2 2024-01-01 12:00:01 | 1 | 2 | 2 2024-01-01 12:00:03 | 15 | 4 | 2 ``` Table creation SQL: ``` CREATE TABLE IF NOT EXISTS t1 ("id" serial NOT NULL, "timest...

I have a table which looks like this:

Table t1

timestamp (pkey)     | A  | B | C
----------------------------------
2024-01-01 12:00:00  | 1  | 2 | 2
2024-01-01 12:00:01  | 1  | 2 | 2
2024-01-01 12:00:03  | 15 | 4 | 2

Table creation SQL:

CREATE TABLE IF NOT EXISTS t1 ("id" serial NOT NULL, "timestamp" timestamp PRIMARY KEY, "A" real, "B" real, "C" real)

Some tables might have 100+ columns. I'd like to select several columns having omitting time ranges on per column basis. The way I do it today is like the following:

SELECT 
(CASE WHEN timestamp NOT BETWEEN '2017-06-12T17:01' AND '2018-06-12T21:57'
                    AND timestamp NOT BETWEEN '2020-06-12T17:01' AND '2021-02-15T21:57' THEN A ELSE NULL END) "A",

(CASE WHEN timestamp NOT BETWEEN '2018-07-14' AND '2019-01-01'
                    AND timestamp NOT BETWEEN '2020-06-12T17:01' AND '2021-02-15T21:57' THEN B ELSE NULL END) "B",

C
FROM t1
WHERE timestamp > '2014-01-01'

timestamp is an index obviously. Is there a better / more effective way to do it?

Miro (113 rep)

Mar 25, 2024, 07:22 PM • Last activity: Jul 14, 2025, 10:04 AM

0 votes

2 answers

573 views

Storing arrays of data in a time-series database

array time-series-database

I'm building an low utilization time-series database to capture yearly data points for a set of items fewer than 100,000. My question has to do with storing arrays of data in a way that is easily queried later. Right now the `yearly_visits` table looks something like: visitID MEDIUMINT primary key u...

                                  I'm building an low utilization time-series database to capture yearly data points for a set of items fewer than 100,000.  

My question has to do with storing arrays of data in a way that is easily queried later.  

Right now the yearly_visits table looks something like: 

    visitID       MEDIUMINT primary key
    userID        SMALLINT id of individual submitting yearly data
    weight        SMALLINT weight of individual
    
The intake form also contains a checkbox with a list of favorite colors (via numeric value) from a lookup table.  Users can select one or more colors. 

Should colors be stored in a separate visit_colors table that looks something like:

     visitID MEDIUMINT 
     colorID SMALLINT

Or is there a better way of storing arrays of data in a time series?  I haven't written any code yet, so I want to design this in a way that doesn't bite me down the road when I'm asked to query against the color data down the road.

a coder (208 rep)

Aug 21, 2018, 06:09 PM • Last activity: Jul 10, 2025, 05:06 AM

0 votes

1 answers

519 views

select time intervals from a nested query

postgresql query postgresql-9.6 datetime time-series-database

I have a table which has the following columns: id, start_timestamp, device_id I am trying to get results of all start_timestamps within 5 seconds of a range of generated periodic timestamps. For example, generating a series of timestamps every 80 seconds, I need to get all 'start_timestamps' that f...

                                  I have a table which has the following columns:

     id, start_timestamp, device_id

I am trying to get results of all start_timestamps within 5 seconds of a range of generated periodic timestamps.

For example, generating a series of timestamps every 80 seconds, I need to get all 'start_timestamps' that fall within 5 seconds of each of those timestamps.

So something like this query, but can't figure out the actual query. CTEs might be too exhaustive as the table has millions of records. 

Appreciate any help, thanks!

```

SELECT *
FROM time_intersections 
WHERE  start_timestamp **within 5 seconds of each of these:** 
                                               (
                                          		 SELECT *
                                        		 FROM generate_series(timestamp '2021-03-07',
                                          							  timestamp '2021-08-16',
                                   							          interval  '80 sec') 
	                                           )

                                

roy naufal (187 rep)

Sep 23, 2021, 04:13 PM • Last activity: Mar 16, 2025, 06:04 PM

0 votes

0 answers

61 views

Chronological timestamps of transactions in a financial system

mariadb concurrency time-series-database

I have a high-volume, concurrent application with business transactions updating several tables via a series of DB calls over a network. While the DB ensures ACIDty, the transaction timestamps are not guaranteed to be chronological out of the box. For example, transaction A can start **before** tran...

                                  I have a high-volume, concurrent application with business transactions updating several tables via a series of DB calls over a network. While the DB ensures ACIDty, the transaction timestamps are not guaranteed to be chronological out of the box. For example, transaction A can start **before** transaction B but commit **after** it. Our business requires that financial ledgers follow strict chronological order, in this case, transaction A's timestamp(6) should be after transaction B, because transaction A finished (committed) last. Naively, one could set the transaction time at the end of a transaction, a final step before the commit, however due to variability of network latency this is obviously not guaranteed to work. Commits can take unpredictably long.

**Question: Is serialization of transactions at the application/system level, the only way to deal with this?**

**Proposal 1:** I can run transactions for different (business) accounts concurrently, but within a single business account I might have to serialize globally. Unfortunately some accounts are large, with transaction volumes in 4000/min rates, so even per-account serialization will add significant delays. If there is a better solution to this please let me know.

**Proposal 2:** Every app-node generates a unique ID (GUID) prior to starting the transaction. This represents a transaction ID which does not carry time but is accompanied with every modification as part of the transaction. As the last step of our transaction, we insert a row into a table like below, then commit:

    INSERT INTO transaction_sequence(transaction_guid) value ("GUID");
    
Table transaction_sequence:
    
         transaction_guid varbinary(16) NOT NULL,
         transaction_time datetime(6) NULL

Above, the row that we insert as the last step of a transaction won't have a timestamp either. Later, there is a separate backend process that backfills the **transaction_time** with timestamps* according to their relative order of commits *as determined from the binary-log of the MariaDB/MySQL system files*. This solution is database engine specific and therefore not ideal but back-filling of timestamps will not inhibit concurrency in any way. 

**Proposal 3:** Use MariaDB-specific **transaction-precise history** extension on the **transaction_sequence** table, documented here . This extension will (hopefully) eliminate my need to process binary-log files although it remains DB engine specific. 

*) The timestamps are interpolated with approximate transaction completion times and be guaranteed chronological.

My challenge is described by this paper: https://www.microsoft.com/en-us/research/wp-content/uploads/2016/02/temporaltime.pdf

Slawomir (473 rep)

Feb 6, 2025, 03:57 PM • Last activity: Feb 6, 2025, 06:03 PM

1 votes

1 answers

948 views

Transition from MongoDB Time Series Collections to InfluxDB

database-design mongodb migration time-series-database influx-db

With version 5.0 [MongoDB][1]'s specialized [Time Series Collections][2] were introduced to deal with such data. As I already stored some sensor meta data (configuration, specification ...) in MongoDB, I decided to make use of these special collections to store sensor readings next to the sensor met...

                                  With version 5.0 MongoDB 's specialized Time Series Collections  were introduced to deal with such data. As I already stored some sensor meta data (configuration, specification ...) in MongoDB, I decided to make use of these special collections to store sensor readings next to the sensor meta data.

According to the docs  I used a single document for each sensor reading like this (pseudo code):

    {
            "timestamp": timestamp,
            "value": value,
            "metadata": {
                "sensorId": sensor_uid,
                "unit": sensor_unit,            
                "type": sensor_type,
                "fromFile": reading_imported_from_file,
            },
    }

Around 50 different sensors are read at the same time which results in 50 documents with equal timestamp but varying value and metadata.

I am currently working on migrating our time series data storage from MongoDB to InfluxDB as this seems to provide a sleeker API and has some basic data visualization already included . As already described above, in MongoDB I used a single document per sensor which might be considered as bad practice when using InfluxDB :

> A measurement per sensor seems unnecessary and can add a significant
> amount of system overhead depending on the number of sensors you have.
> I’d suggest storing all sensor data in a single measurement with
> multiple fields, [...]

Based on this I came up with the following data structure to be passed to InfluxDB (Python dictionary pseudo code for influxdb-client ):

    {
        "time": 1,
        "measurement": measurement_name,
        "tags": {
            "location": location,
            "from_file": reading_imported_from_file,
        },
        "fields": {
            "sensor_1": reading_from_sensor_1,
            "sensor_2": reading_from_sensor_2,
            "sensor_3": reading_from_sensor_3,
        },
    }

However, I did not figure out how to store the other meta data like sensorId, unit, or type. On the one hand side I could easily solve this by violating the before mentioned suggestion and use a single measurement per sensor. On the other hand side, from a relational perspective these meta information should be tied to the sensorId and be therefore accessible from a sensor configuration/specification database using the sensorId as a key. Unfortunately, these values can change throughout a single measurement or experiment due to changing device configurations on-site which are not reflected in the configuration database.

How could I solve this issue? Am I missing something or do I simply have to deal with this design/performance vs. ease-of-use tradeoff?





                                

albert (113 rep)

Feb 18, 2022, 10:24 PM • Last activity: Jan 6, 2025, 10:04 AM

1 votes

3 answers

939 views

Designing Tables for Large Amounts of Time-Series Data

postgresql database-design time-series-database

I'm designing a Postgres table to store large amounts of time series data and I'm trying to figure out the best way to structure the columns. I've looked at answers such as [this one](https://dba.stackexchange.com/questions/107207/how-to-store-time-series-data) , but since it's almost 10 years old I wanted to see if there are any new things I should be aware of. The time series data comes from many sources (that's what src_id in the examples refers to). Each source will have one data point per minute, and each data point has many different measurements. The measurements represent things such as temperature, humidity, etc. for that specific minute. I've abstracted them to just be "measurement A", "measurement B", and so on for the examples though. There are currently 20 measurement types that need to be supported, with more being added in the future. The amount of data is in the billions of rows. The vast majority of writes will be adding new rows for the current minute. Typical read queries will be for a specific source, time window, and measurement type. I'm also planning on partitioning whatever table I choose, perhaps into month-long partitions. ### Option 1) Flat table I could implement a simple flat table. One drawback is that as I add more measurement types over time, I will have to update the table with new columns. It's also starting with 23 columns which seems like I'm going down the wrong path.

TABLE data_points (id, src_id, timestamp          , measurement_a, measurement_b, ...)
                  (1 , 1     , 2024-01-01 00:00:00, 100          , 6.8          , ...)
                  (2 , 2     , 2024-01-01 00:00:00, 55           , 0.1          , ...)

### Option 2) Key-value pairs This reduces the number of columns to a set amount, so I will not have to update the table with new columns as new measurements are added. However there will many more rows (20x to start with, since I'm starting with 20 measurement types).

TABLE data_points (id, src_id, timestamp          , meas_type, meas_value)
                  (1 , 1     , 2024-01-01 00:00:00, A        , 100       )
                  (2 , 1     , 2024-01-01 00:00:00, B        , 6.8       )
                  ... 

                  (3 , 2     , 2024-01-01 00:00:00, A        , 55        )
                  (4 , 2     , 2024-01-01 00:00:00, B        , 0.1       )
                  ...

### Option 3) Two tables I could have one table store the src_id and timestamp, and the other table store the measurement data. This is similar to the key-value pairs, just split over two tables so I don't have to repeat the src_id and timestamp columns as much. This might make partitioning a bit trickier though. Also all reads would have to do a join, and I care more about performance than overall db size, so maybe it's not worth the overhead?

TABLE data_point_times (id, src_id, timestamp          )
                       (1 , 1     , 2024-01-01 00:00:00)
                       (2 , 1     , 2024-01-01 00:00:00)
                  
TABLE data_point_values (id, data_point_time_id, meas_type, meas_value)
                        (1 , 1                 , A        , 100       )
                        (2 , 1                 , B        , 6.8       )
                        ...
                        
                        (3 , 2                 , A        , 55       )
                        (4 , 2                 , B        , 0.1       )
                        ...

### Option 4) jsonb I could get "the best of both worlds" using jsonb; a static number of columns with fewer rows. But maybe this has drawbacks I'm not aware of?

TABLE data_points (id, src_id, timestamp          , data                                 )
                  (1 , 1     , 2024-01-01 00:00:00, {"measurement_a": 100, "measurement_b": 6.8, ... })
                  (2 , 2     , 2024-01-01 00:00:00, {"measurement_a": 55 , "measurement_b": 0.1, ... })

Any help is greatly appreciated!

Robert (133 rep)

Jun 13, 2024, 02:57 AM • Last activity: Jun 13, 2024, 12:59 PM

2 votes

1 answers

398 views

Storing Large Quantities of Time Series

postgresql database-design timestamp time-series-database

What is the recommended (type) of database design for storing large amounts of time-series data? The dataset: - 2 million time-series; - Each of the time-series contains around 500 timestamps; - At each of the time-stamp there are multiple variables (meta-data); - It would be optimal if meta-data co...

                                  What is the recommended (type) of database design for storing large amounts of time-series data? 

The dataset: 

- 2 million time-series;
- Each of the time-series contains around 500 timestamps; 
- At each of the time-stamp there are multiple variables (meta-data);
- It would be optimal if meta-data could later easily be ingested into the database (preferably with a Python API) . 

I have looked into PostgreSQL, but so far I have been struggling to set it up for many time-series.

sir_olf (21 rep)

Feb 26, 2020, 05:52 PM • Last activity: Apr 9, 2024, 05:01 AM

1 votes

1 answers

283 views

Postgres BRIN index for time-series table

postgresql postgresql-performance index-tuning time-series-database

I have multiple time-series tables with Pkey on `timestamp` column. I'd like to test performance improvement if I change the index to be `BRIN` instead of `BTREE`. Yet, I'd also like to have a UNIQUE constraint on the `timestamp` column (which as far as I know is based on `BTREE`). What are my optio...

                                  I have multiple time-series tables with Pkey on timestamp column.

I'd like to test performance improvement if I change the index to be BRIN instead of BTREE. 

Yet, I'd also like to have a UNIQUE constraint on the timestamp column (which as far as I know is based on BTREE).

What are my options?

 * Should I add an additional BRIN index to BTREE? In this case which index will be chosen during searches / joins?
 * Should I drop the pkey and add just the UNIQUE constraint? In this case how is it going to work?

My usage scenario is that I need to run FULL OUTER JOINs on a timestamp column to make queries from multiple tables all together and those queries sometimes can run for 30 minutes or more.

Miro (113 rep)

Mar 20, 2024, 04:09 PM • Last activity: Mar 21, 2024, 09:37 AM

3 votes

1 answers

1342 views

Efficient downsampling of a selected timeseries to equidistant samples

postgresql query-performance postgresql-performance time-series-database

We have timeseries data from several sensors in a table and I am interested in getting N equally-spaced samples from the full range, as the whole data would be too much to display in a web-app. I am currently using the following query for N = 500: ``` SELECT sensor_id, sensor_data_id, val1, val2, da...

SELECT sensor_id, sensor_data_id, val1, val2, datetime
FROM (
   SELECT sensor_id, sensor_data_id, val1, val2, datetime
        , ROW_NUMBER() OVER (ORDER BY datetime)
   FROM sensordata
   WHERE sensor_id = 22
  ) x
WHERE mod(ROW_NUMBER, (SELECT COUNT(1) / 500 FROM sensordata WHERE sensor_id = 22)) = 0;

The table contains val1 and val2 for several timestamps per sensor_id (roughly 2M rows for sensor_id = 22). The table has primary key (sensor_data_id) and the following other indices - (sensor_id) and a composite index (sensor_id, date_time). The above query roughly takes 36s and the result size for sensor_id = 22 is around 278 MB, found using (SELECT pg_size_pretty( sum(pg_column_size(sensordata))) FROM sensordata WHERE sensor_id = 22;). While the above query does not consider a time-range, I want to eventually add another condition in the WHERE clause to downsample the data for a selected datetime range for a particular sensor_id. I have tried EXPLAIN ANALYZE a couple of times, and the only hint that seemed to help a little was increasing the work_mem which is 300 MB now. The database does not need to serve a lot of connections simultaneously, so this high a number is not an issue currently. Are there more efficient ways to achieve the above? The table sensordata is likely to grow in time. Currently around 7 GBs for 121 distinct sensor_id. About PG: "PostgreSQL 12.11 on x86_64-pc-linux-gnu, compiled by gcc (GCC) 7.3.1 20180712 (Red Hat 7.3.1-12), 64-bit" Running on: AWS RDS db.t3.micro which has the following specs:

Here and here are execution plans for my query.

Chintan Pathak (143 rep)

Dec 13, 2022, 11:23 PM • Last activity: Aug 20, 2023, 03:14 AM

2 votes

2 answers

6783 views

Add cumulative sum to time-series query PostgreSQL 9.5

postgresql aggregate running-totals time-series-database

I wrote the query that gives me time-series over some date range and interval that shows revenue for each time interval: SELECT interval_date, coalesce(campaign_revenue,0) AS campaign_revenue, FROM -- generate_series helps fill the empty gaps in the following JOIN generate_series( $2::timestamp, $3:...

                                  I wrote the query that gives me time-series over some date range and interval that shows revenue for each time interval:

	SELECT
		interval_date,
		coalesce(campaign_revenue,0) AS campaign_revenue,
	FROM
        -- generate_series helps fill the empty gaps in the following JOIN
		generate_series(
			$2::timestamp,
			$3::timestamp,
			$4) AS interval_date -- could be '1 day', '1 hour' or '1 minute'.
	LEFT OUTER JOIN
		-- This SELECT gets all timeseries rows that have data
		(SELECT
			date_trunc($4, s.created) AS interval,
			SUM(s.revenue) campaign_revenue
		FROM
			sale_event AS s
		WHERE
			s.campaignid = $1 AND s.created BETWEEN $2 AND $3 AND s.event_type = 'session_closed'
		GROUP BY
			interval) results
	ON
		(results.interval = interval_date);


The query takes every row of sale_event table, truncates the created date to some interval (aligns the created timestamp with the time-series wanted granularity), groups by this time interval and sums up the revenue columns on the rows where event_type is session_closed.

This works very well and gives me the revenue in the specified interval. The result may look like:

    interval_date   |   campaign_revenue
    ------------------------------------
     2018-08-05     |   0.0
     2018-08-06     |   1.5
     2018-08-07     |   0.0
     2018-08-08     |   0.5
     2018-08-09     |   1.0

When the provided range is 2018-08-05 - 2018-08-09 and interval = '1 day'.

I want to add to the result the sum of revenue up to that date. So if before 2018-08-05 there a total revenue of 10.0, the result would be:


    interval_date   |   campaign_revenue   |   total_campaign_revenue
    -----------------------------------------------------------------
     2018-08-05     |   0.0                |   10.0
     2018-08-06     |   1.5                |   11.5
     2018-08-07     |   0.0                |   11.5
     2018-08-08     |   0.5                |   12.0
     2018-08-09     |   1.0                |   13.0

                                

Alechko (229 rep)

Aug 10, 2018, 09:48 AM • Last activity: Apr 22, 2023, 08:24 AM

1 votes

0 answers

27 views

How to create a count for Each entry in Postgres Database?

postgresql timestamp time-series-database

So I have my Table that looks some thing like | Name | timestamp | |---------|-------------------| | abc |2022-03-08 06:15:11| | bcd |2022-03-08 08:15:11| | cvd |2022-04-10 06:15:11| Now each Name is unique in this table and it has a timestamp associated with it. I want to create a new table from th...

                                  So I have my Table that looks some thing like

| Name    | timestamp         |
|---------|-------------------|
|  abc    |2022-03-08 06:15:11|
|  bcd    |2022-03-08 08:15:11|
|  cvd    |2022-04-10 06:15:11|

Now each Name is unique in this table and it has a timestamp associated with it.

I want to create a new table from this which will be something like a counter. So my updated table will look something like this.

| Name    | timestamp         |
|---------|-------------------|
|  1    |2022-03-08 06:15:11|
|  2    |2022-03-08 08:15:11|
|  3    |2022-04-10 06:15:11|

If anybody can help me how to achieve this that will be really great :)
                                

tushar_ecmc (11 rep)

Mar 8, 2023, 04:37 AM

0 votes

0 answers

22 views

Separating a single database into one user/transaction database and one time series database

database-design time-series-database multiple-database

My project involves storing sensor data (time series data) in db. There are about 5,000 sensors, each is sending data at 15 min to 1 hour interval. The existing service is using a single db, storing all frontend user, transaction and sensor data. The table holding time series data now has over 10 mi...

                                  My project involves storing sensor data (time series data) in db. There are about 5,000 sensors, each is sending data at 15 min to 1 hour interval. The existing service is using a single db, storing all frontend user, transaction and sensor data. The table holding time series data now has over 10 million rows. Other tables such as user table has a few thousands rows.

If separating sensor data into a dedicate db (say Postgres with TimescaleDB extension), will there be significant benefits in terms of performance, that out-weights the complexity introduced?

Oliver Lu (1 rep)

Feb 11, 2023, 03:25 PM

1 votes

0 answers

338 views

Database schema design of company time series data

schema time-series-database

**Purpose**: To monitor the changes in total capital and paid-in capital of various companies, and identify trends among the different company categories. By determining whether the sum of total capital or paid-in capital is increasing or decreasing for a given category, we can understand which cate...

                                  **Purpose**: 

To monitor the changes in total capital and paid-in capital of various companies, and identify trends among the different company categories. By determining whether the sum of total capital or paid-in capital is increasing or decreasing for a given category, we can understand which categories of companies are growing or not.

**Data structure**: 

I have data for several companies with the following structure:
* company_name: string
* company_id: string
* company_category: string
* date: timestamp
* total_capital: numeric (allows null values)
* paidin_capital: numeric (allows null values)

The "date" field represents the date on which the total capital and paid-in capital were recorded for this company. Different companies may not have the same recording dates.

**Data challenge**: 

If I want to compare the trends of categories by month, I will need to create many aggregated records with the same month in order to sum the total capital for that month. For example:

Raw record 1:
* company_name: coop a
* company_category: category A
* total_capital: 1000
* paidin_capital: null
* date: 2022-03-28

Raw record 2:
* company_name: coop a
* company_category: category A
* total_capital: 2000
* paidin_capital: null
* date: 2022-05-12

(After) 
New record 1 (edited from Raw record 1):
* company_name: coop a
* company_category: category A
* total_capital: 1000
* paidin_capital: null
* month: 2022-03

New record 2 (edited from Raw record 1):
* company_name: coop a
* company_category: category A
* total_capital: 1000
* paidin_capital: null
* month: 2022-04

New record 3 (edited from Raw record 2):
* company_name: coop a
* company_category: category A
* total_capital: 2000
* paidin_capital: null
* month: 2022-05

Aggregated table: 
Aggregated record 1 (aggregated from New record 1):
* company_category: category A
* total_capital: 1000
* paidin_capital: null
* month: 2022-03

Aggregated record 2 (aggregated from New record 2):
* company_category: category A
* total_capital: 1000
* paidin_capital: null
* month: 2022-04

Aggregated record 3 (aggregated from New record 3):
* company_category: category A
* total_capital: 2000
* paidin_capital: null
* month: 2022-05

The above arrangement of data seems very inefficient and will also result in a large number of duplicated records. Is there a better schema structure that I should use if I continue to use a relational database management system (RDBMS) or switch to time-series database engine?
                                

Planetoid Hsu (13 rep)

Jan 9, 2023, 03:41 AM • Last activity: Jan 31, 2023, 09:46 AM

3 votes

3 answers

1714 views

Recommendation for storage of series of time series

database-recommendation time-series-database

Just a few words describing the data: In my application, there are acceleration measurements (for example at 25kHz) for the exemplary duration of one second. These measurements get repeated in not necessery äquidistant time steps for that measurement point. (Maybe each five or ten minutes). Thi...

                                  Just a few words describing the data:
In my application, there are acceleration measurements (for example at 25kHz) for the exemplary duration of one second. These measurements get repeated in not necessery äquidistant time steps for that measurement point. (Maybe each five or ten minutes). This is a kind of interrupted permanent monitoring, somehow two-periodic:  

- short time period of measurement is 25.000 Hz (the resolution of the measurements)
- long time periodic (not in a strong sense, might differ) every 5 minutes    

There are 20 or more of those points. 
 
As dealing with time series, the first idea might be the usage of a time series db. On the other hand, for me it seems as if the main purpose of time series db is storage of scalar values. Of course, my measurements are scalar values. But I'm not sure if it would be a good idea to store every scalar value as a (time/value/measpos_id)-triple - leading to an enormous number of entries. I think single of those entries would never be evaluated.  
Another idea could be the storage of the measurement vector (all values from that second) together with the starting time and the measpos_id. But howto do that? Taking all values as a blob? Not every system is capable of dealing with vectors - and maybe they differ in length. Are there concepts in timeseries-db for such problems, which I'm not aware of?  
Further for evaluation (extraction) I think maybe the exctraction of the complete vector would be the most used case.  
Please feel free to ask, if my description is incomplete or some more details could help in finding a good solution.  
What are your recommendations? NoSQL or relational SQL? Further ideas? Every hint is welcome. Thanks in advance.

additions:

- A rough idea for the volume is steady growing in size of about 1 TB a year
- Giving a sample is not that easy - I'll try to describe:  
 Think of 1 column with 25000 float values for each measurement (each minute rougly and for each measurement position), timestamped each of these columns (at begin).
- Usage for big data evaluation (means testing many kind of algorithms); windowing data, fft (spectral analysis), comparison, aggregation like energetic sum, value of max amplitude, pos (freq) of max amplitude, many more
- purpose (focus) of evaluation: wear detection for condition monitoring of for example rolling devices (gears, generator sets, turbines, shafts, bearings)
- evaluation would (from todays view) focus on each seperate column and maybe compare to others - but not combine (stack) columns together.
- data size example: 25.000 float values in each column for 20 measured engines each 5 minutes (12 per hours) results in 6e6 floats each hour or 5.25e10 floats each year. 
                                

Bastian Ebeling (175 rep)

Jul 3, 2017, 06:45 AM • Last activity: Jan 29, 2023, 10:17 AM

0 votes

1 answers

153 views

Database to save measurements

postgresql database-design timescaledb time-series-database influx-db

I am creating an infrastructure to save measurements coming from a fleet of around 2000 cars. Each car contains about 60 sensors (depending from car) with a sum of about 800 values par second coming from all the sensors. Each sensor is reading from 2 to 50 values of different type (boolean, integer...

                                  I am creating an infrastructure to save measurements coming from a fleet of around 2000 cars. 
Each car contains about 60 sensors (depending from car) with a sum of about 800 values par second coming from all the sensors. 

Each sensor is reading from 2 to 50 values of different type (boolean, integer and commasep).

I would like to save all this values in a database (in cloud) to allow us to read them in case of error and for future reports.

After a study of the possible database we have to chose between:
 * postgres with autopartitions
 * TimescaleDB
 * InfluxDB 

Knowing the scenario my ingegneristic side thinks about InfluxDB since the use case better fit a schemaless option. 
However my conservative side is saying to use a 25-years story database, in this latter case, from your experience is it better to adopt an approach 1 or 2?

Approach 1 is where each row consists in lecture of a value from one sensor -> [timestamp, sensor_id, measure_title, measure_value] (so 800 * 2000 rows every second).

Approach 2 is where a row consists in a lecture of a sensor [timestamp, sensor_id, measure_value_1, …, measure_value_50] (so 60 * 2000 rows every second) where potentially 49 columns can be null and we have another table that contains anagraphic for each title of measure_value_n?

Otherwise do you know other approaches?

Edit 1.

Data must be maintained indefinitely. No way of delete/cancelling

Approach 1 will store around 138 billion of rows par day
Approach 2 will store around 10 billion of rows par day
                                

Jam. G. (1 rep)

Nov 13, 2022, 01:35 PM • Last activity: Nov 13, 2022, 11:12 PM

2 votes

1 answers

129 views

Are there better alternatives to a time series databases for managing financial transactions?

database-design backup aggregate rdbms time-series-database

We are considering using AWS Timestream to store financial transactions from various sources for users. Any solution needs to be able to support ~4k transactions per user per year for 1 million users. Some of the important requirements are: - Be able to update transaction amount for three months in...

                                  We are considering using AWS Timestream to store financial transactions from various sources for users. Any solution needs to be able to support ~4k transactions per user per year for 1 million users. Some of the important requirements are:

 - Be able to update transaction amount for three months in the past
 - Calculate daily, weekly and monthly balances by tallying up amounts in real time
 - Balances need to reflect edited transaction amounts, so we'd need to recalculate these in real time often.

Is a time series database a good fit for something like this in terms of real time AND historical record availability?

A few things that worry me about timestream:

 - Can't delete records at all. If bad data enters your system, its going to be extremely difficult to clean it.
 - No backups offered by AWS, we'd have to write something custom for this.

Are there perhaps better time series databases I should consider?

Thanks!

Moin (21 rep)

Sep 2, 2022, 06:28 PM • Last activity: Sep 4, 2022, 09:40 PM

0 votes

2 answers

203 views

Best way to get aggregates from ranged data

aggregate time-series-database

The problem I'm having is as follows: I have 'shifts' table with the following ranged data I want to make aggregates on: ID | started_at | ended_at | costs 1 | 2022-08-23 08:00 | 2022-08-23 11:00 | 150 2 | 2022-08-23 08:30 | 2022-08-23 09:30 | 50 Currently using MySQL, how to get the following data:...

                                  The problem I'm having is as follows:

I have 'shifts' table with the following ranged data I want to make aggregates on:

    ID   | started_at          | ended_at          | costs
    1    |  2022-08-23 08:00   | 2022-08-23 11:00  | 150
    2    |  2022-08-23 08:30   | 2022-08-23 09:30  | 50


Currently using MySQL, how to get the following data:

    time             | total_costs  | total_count
    2022-08-23 08:00 | 75           | 2
    2022-08-23 09:00 | 75           | 2
    2022-08-23 10:00 | 50           | 1

Or do you recommend using other DB specially built for this kind of data?

**Edit:** sorry for the inconvenience. Time is an interval 2022-08-23 08:00-09:00, 2022-08-23 09:00-10:00, and so on. So the total_costs are the sum(calculated_costs) for the shifts that fall into the interval.

    calculated_costs = (cost / total_duration) * duration_in_interval
                       (150 / 3 hours) * 1 hour = 50

total_count = the count(*) of shifts that have the started_at and ended_at in/around the current time (interval)

Hope someone kan point me in the right direction. Thanks!
                                

Matt_45 (1 rep)

Aug 23, 2022, 08:18 AM • Last activity: Aug 24, 2022, 07:02 AM

0 votes

0 answers

386 views

TimescaleDb : How to store long time events?

timestamp timescaledb time-series-database

I’m trying to use TimescaleDb to store long time events (events started for example 2 days ago and ended 1 hour ago). There are two types of events (current events without end date and historical events with en date). Events (or chunks) should be deleted after one month (based on the event’s end dat...

                                  I’m trying to use TimescaleDb to store long time events (events started for example 2 days ago and ended 1 hour ago).
There are two types of events (current events without end date and historical events with en date).
Events (or chunks) should be deleted after one month (based on the event’s end date).

My questions are :

1. Is there a way to store events without end date and to keep them in a specific chunk that will never be deleted. Indeed, for current events I don’t want them to be deleted even if the event started more than 1 month ago.

2. When I want to request data, I want to see only events that were active during the selected date range. For example, retrieve all events that were active between '2022/07/24 00:00:00" and “2022/07/24 23:59:59”. By active, I mean event that potentially :

- starts before date range and ends after date range
- starts before date range and ends between date range
- starts between date range and ends after date range
- starts between date range and ends between date range.

I found the overlaps postgres function but the problem is that Timescale still have to scan all chunks to find matching events. Indeed, as the end date is potentially between “start date range” and “now”, it means the more I want to have an old date range, the more time the request will take. For now, the only solution I found to limit the number of chunks to scan is to add a where clause where end_date > start_date_range.

Thanks for your time, hope you will give me some good advice :)

                                

Anthony Le Gal (1 rep)

Aug 1, 2022, 08:10 AM • Last activity: Aug 1, 2022, 04:06 PM

Showing page 1 of 20 total questions