Database Administrators
Q&A for database professionals who wish to improve their database skills
Latest Questions
-4
votes
1
answers
470
views
choosing oldest valued cell when 2 columns have same value but third column has differing values (SQL)
In SQL, I have three columns, the first two are values to create distinction for a person. The third column will have different values for each person (dates). How do I select the row for each person with the oldest valued date? [![enter image description here][1]][1] [1]: https://i.sstatic.net/A6Xn...
In SQL, I have three columns, the first two are values to create distinction for a person. The third column will have different values for each person (dates). How do I select the row for each person with the oldest valued date?
should output:
John Plumber 1/3/2020
Elizabeth Teacher 12/1/2020

mcsmithums
(9 rep)
Jun 3, 2022, 07:27 PM
• Last activity: Jul 9, 2025, 02:40 PM
17
votes
1
answers
1711
views
Union does not always eliminate duplicates
I have the following query and expect that as a result I will have a list of IDs without duplicates. But sometimes it produces duplicates (1-2 on 4 million rows). Why can it happen? I run it with the default (read committed) isolation level. I can't use tablock/serializible because these are OLTP ta...
I have the following query and expect that as a result I will have a list of IDs without duplicates. But sometimes it produces duplicates (1-2 on 4 million rows).
Why can it happen? I run it with the default (read committed) isolation level. I can't use tablock/serializible because these are OLTP tables with hundreds of changes.
```
CREATE TABLE #characterId (
CharacterId BIGINT
)
DECLARE @dateTimeFrom DATETIME = '2025-05-04'
, @dateTimeTo DATETIME = '2025-05-07'
INSERT INTO #characterId (CharacterId)
SELECT Id
FROM table1 u
WHERE u.DateUpdated >= @dateTimeFrom
AND u.DateUpdated = @dateTimeFrom
AND usi.DateUpdated = @dateTimeFrom
AND ust.DateCreated (Build 20348: ) (Hypervisor)
Plan with
UNION ALL
instead of first UNION
:

Novitskiy Denis
(331 rep)
May 6, 2025, 07:33 AM
• Last activity: May 8, 2025, 06:09 PM
1
votes
1
answers
63
views
Poor performance bucketing historical data with DISTINCT
I'm running into performance problems with a query that buckets a set of historical records of daily counts into a form that can be rendered easily as a daily graph. ## Context Essentially, the table in question is a record of daily audit reports where the number of problems of varying seriousness a...
I'm running into performance problems with a query that buckets a set of historical records of daily counts into a form that can be rendered easily as a daily graph.
## Context
Essentially, the table in question is a record of daily audit reports where the number of problems of varying seriousness are reported for a set of location IDs. Each record in the table contains the counts of problems found during the audit at one location on one day, of several severities.
CREATE TABLE IF NOT EXISTS problem_reports (
internal_id bigint GENERATED ALWAYS AS IDENTITY PRIMARY KEY,
day timestamp NOT NULL,
location_id varchar(16) NOT NULL,
bad integer DEFAULT 0 NOT NULL,
awful integer DEFAULT 0 NOT NULL,
catastrophic integer DEFAULT 0 NOT NULL
);
Since audits are not performed at every location on every day, I saved space by only recording rows for days that an audit has been performed. If the number of problems found was zero, I record a row with all zero counts for that location, to indicate that the number of problems of each severity is now zero as of that date.
The output I need from the query is a running daily total of all problems found at all locations, by severity. So, if an audit wasn't performed for a location on a given day, the query needs to "look back" to find the most recent audit for that location, and include that in the aggregation. I achieved this using a DISTINCT clause:
WITH days AS (
SELECT GENERATE_SERIES(
DATE_TRUNC('day', '2025-03-01'::date),
DATE_TRUNC('day', '2025-03-07'::date),
'1 day'::interval
) AS day
),
counts AS (
SELECT DISTINCT ON (days.day, pr.location_id)
days.day, pr.location_id, pr.bad, pr.awful, pr.catastrophic
FROM problem_reports pr
RIGHT JOIN days ON pr.day Unique (cost=46930.59..48855.59 rows=40000 width=86) (actual time=0.109..0.120 rows=18 loops=1) |
| Output: (generate_series(date_trunc('day'::text, ('2025-03-01'::date)::timestamp with time zone), date_trunc('day'::text, ('2025-03-07'::date)::timestamp with time zone), '1 day'::interval)), pr.location_id, pr.bad, pr.awful, pr.catastrophic, pr.day, pr.internal_id |
| Buffers: shared hit=1 |
| -> Sort (cost=46930.59..47572.26 rows=256667 width=86) (actual time=0.108..0.111 rows=32 loops=1) |
| Output: (generate_series(date_trunc('day'::text, ('2025-03-01'::date)::timestamp with time zone), date_trunc('day'::text, ('2025-03-07'::date)::timestamp with time zone), '1 day'::interval)), pr.location_id, pr.bad, pr.awful, pr.catastrophic, pr.day, pr.internal_id |
| Sort Key: (generate_series(date_trunc('day'::text, ('2025-03-01'::date)::timestamp with time zone), date_trunc('day'::text, ('2025-03-07'::date)::timestamp with time zone), '1 day'::interval)), pr.location_id, pr.day DESC, pr.internal_id DESC |
| Sort Method: quicksort Memory: 27kB |
| Buffers: shared hit=1 |
| -> Nested Loop Left Join (cost=0.00..11584.65 rows=256667 width=86) (actual time=0.048..0.077 rows=32 loops=1) |
| Output: (generate_series(date_trunc('day'::text, ('2025-03-01'::date)::timestamp with time zone), date_trunc('day'::text, ('2025-03-07'::date)::timestamp with time zone), '1 day'::interval)), pr.location_id, pr.bad, pr.awful, pr.catastrophic, pr.day, pr.internal_id|
| Join Filter: (pr.day ProjectSet (cost=0.00..5.03 rows=1000 width=8) (actual time=0.024..0.027 rows=7 loops=1) |
| Output: generate_series(date_trunc('day'::text, ('2025-03-01'::date)::timestamp with time zone), date_trunc('day'::text, ('2025-03-07'::date)::timestamp with time zone), '1 day'::interval) |
| -> Result (cost=0.00..0.01 rows=1 width=0) (actual time=0.001..0.002 rows=1 loops=1) |
| -> Materialize (cost=0.00..21.55 rows=770 width=78) (actual time=0.003..0.004 rows=8 loops=7) |
| Output: pr.location_id, pr.bad, pr.awful, pr.catastrophic, pr.day, pr.internal_id |
| Buffers: shared hit=1 |
| -> Seq Scan on sto.problem_reports pr (cost=0.00..17.70 rows=770 width=78) (actual time=0.016..0.019 rows=8 loops=1) |
| Output: pr.location_id, pr.bad, pr.awful, pr.catastrophic, pr.day, pr.internal_id |
| Buffers: shared hit=1 |
|Planning Time: 0.186 ms |
|Execution Time: 0.188 ms |
+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
## The Problem
This works fine for moderate numbers of days and locations, but as the number of those grow, the DISTINCT clause in the CTE starts to perform very poorly. I considered recording a row for every location on every day, to eliminate the need for the DISTINCT "look-back", but that will quickly result in more rows than I'd like to deal with, as the number of locations is expected to grow exponentially.
In some cases, I need to query for only a subset of locations, and render a graph of that, so recording a preprocessed aggregate of just the daily counts to the DB wouldn'twork.
Is there a more efficient way of storing and querying this type of historical data to produce the graph data? Any suggestions appreciated!
hampercm
(111 rep)
Apr 1, 2025, 06:49 PM
• Last activity: Apr 4, 2025, 09:45 PM
0
votes
1
answers
672
views
Select only those records that have same multiple values for a particular column if it's existing?
Below is an example of my `pricing_supports` table with more than 3 millions rows. There some rows having same values in the `first_origin_id`, `delivery_id` fields. If there are records with same `first_origin_id`, `delivery_id` but different `source`, I only want to select those records where `sou...
Below is an example of my
pricing_supports
table with more than 3 millions rows.
There some rows having same values in the first_origin_id
, delivery_id
fields.
If there are records with same first_origin_id
, delivery_id
but different source
, I only want to select those records where source = 0
ID code first_origin_id delivery_id source
1 A 10 20 0
2 B 10 20 1
3 C 11 21 1
4 D 12 22 0
5 E 12 22 1
I would like result like that:
ID code first_origin_id delivery_id source
1 A 10 20 0
3 C 11 21 1
4 D 12 22 0
How can I do for good performance?
Dương Khoa
(1 rep)
Sep 20, 2018, 04:24 AM
• Last activity: Jan 27, 2025, 05:04 PM
-1
votes
1
answers
155
views
What happened to Oracle's COUNTDISTINCT() function?
I need the count of unique values, and am trying to use Oracle's [COUNTDISTINCT()](https://docs.oracle.com/cd/E93962_01/bigData.Doc/eql_onPrem/src/reql_aggregation_countdistinct.html) function to obtain that: ```sql select COUNTDISTINCT(a.m_label) from user_rep a, user_group_rep b, trn_grp_rep c whe...
I need the count of unique values, and am trying to use Oracle's [COUNTDISTINCT()](https://docs.oracle.com/cd/E93962_01/bigData.Doc/eql_onPrem/src/reql_aggregation_countdistinct.html) function to obtain that:
select COUNTDISTINCT(a.m_label)
from user_rep a, user_group_rep b, trn_grp_rep c
where b.m_user_id = a.m_reference
...
This results in ORA-00904: "COUNTDISTINCT": invalid identifier
... Using plain COUNT()
works, but is not returning correct results.
What am I doing wrong? What'd be an efficient work-around?
Mikhail T.
(166 rep)
Nov 12, 2024, 05:43 PM
• Last activity: Nov 12, 2024, 06:07 PM
1
votes
1
answers
67
views
Finding the distinct values in an array of documents
I've got a database full of documents which each contain a collection of transactions: ``` [ { "key": 1, "data": [ { "trans": 1, "uid": 1 }, { "trans": 2, "uid": 1 } ] }, { "key": 2, "data": [ { "trans": 3, "uid": 1 }, { "trans": 4, "uid": 2 } ] } ] ``` I want to create a new field in each of the ma...
I've got a database full of documents which each contain a collection of transactions:
[
{
"key": 1,
"data": [
{
"trans": 1,
"uid": 1
},
{
"trans": 2,
"uid": 1
}
]
},
{
"key": 2,
"data": [
{
"trans": 3,
"uid": 1
},
{
"trans": 4,
"uid": 2
}
]
}
]
I want to create a new field in each of the main documents, which has the unique values of the uid
field. I can get partway there using $map
:
db.collection.aggregate([
{
"$set": {
"uid": {
"$map": {
"input": "$data",
"as": "trans",
"in": "$$trans.uid"
}
}
}
}
])
This gives me:
[
{
"key": 1,
"uid": [1,1],
"data": [
{
"trans": 1,
"uid": 1
},
{
"trans": 2,
"uid": 1
}
]
},
{
"key": 2,
"uid": [1, 2],
"data": [
{
"trans": 3,
"uid": 1
},
{
"trans": 4,
"uid": 2
}
]
}
]
This is close, but I can't seem to figure out the last step, I want to use only the unique values, so the uid
for the first document should be `, not
[1, 1]`.
The distinct()
function works across collections, not single documents.
I would think that $addToSet
would work, but it doesn't operate on arrays, only on the output of $group
. I also looked at trying to create a $reduce
specification using $setUnion
, but I don't know how to promote my numeric value into an array.
I can use the $unwind
stage with grouping by _id
to get the right values for the new field, but I can't figure out how to attach them back to the original objects.
ralmond
(13 rep)
Nov 8, 2024, 11:26 PM
• Last activity: Nov 9, 2024, 01:10 AM
0
votes
2
answers
335
views
Selecting a count of distinct logins per day
I have an activity_log table that I need to derive a basic output of how many distinct users logged in that day, and sum up them up day by day. The table logs all actions of a user and will have several entries per day for each user. Here is the table - [![Log table][1]][1] This is what I have so fa...
I have an activity_log table that I need to derive a basic output of how many distinct users logged in that day, and sum up them up day by day. The table logs all actions of a user and will have several entries per day for each user.
Here is the table -
This is what I have so far -
select DISTINCT CAST(test.activitytime AS date) Date, username
from [dbo].[activity_log] as test
where test.activitytime between '2024-03-01' and '2024-03-08'
order by Date
Here is the output for that -
I need it to display every day once, and sum up the total of distinct users per day (does not matter if they logged in yesterday as well) just total distinct users a day. I am lost on where to go from here
ie
2024-03-01 24
2024-03-02 39
2024-03-03 26


JDBA
(3 rep)
Apr 17, 2024, 10:20 PM
• Last activity: Aug 19, 2024, 07:56 AM
2
votes
1
answers
442
views
Emulate Loose Index Scan for multiple columns with alternating sort direction
A while back I asked [this question](https://dba.stackexchange.com/questions/320064/use-skip-scan-index-to-efficiently-select-unique-permutations-of-columns-in-post) about efficiently selecting unique permutations of columns in Postgres. Now I have a follow-up question regarding how to do so, with t...
A while back I asked [this question](https://dba.stackexchange.com/questions/320064/use-skip-scan-index-to-efficiently-select-unique-permutations-of-columns-in-post) about efficiently selecting unique permutations of columns in Postgres. Now I have a follow-up question regarding how to do so, with the addition of being able to order any of the columns with any combination of
ASC
/DESC
across the columns.
The table contains hundreds of millions of rows, and while the accepted answer to my previous question is orders of magnitude faster than traditional approaches, not being able to order the columns in an ad-hoc way prevents me from putting this query to good use (I really need it to 'paginate', with LIMIT
/OFFSET
in small chunks). Is there a way to do this? The author of the previous answer kindly suggested a workaround (changing the row comparison for an explicit where clause), which I tried, but it doesn't seem to work (or I misunderstand it).
Given the following generic query:
WITH RECURSIVE cte AS (
(
SELECT col1, col2, col3, col4
FROM tbl
ORDER BY 1,2,3,4
LIMIT 1
)
UNION ALL
SELECT l.*
FROM cte c
CROSS JOIN LATERAL (
SELECT t.col1, t.col2, t.col3, t.col4
FROM tbl t
WHERE (t.col1, t.col2, t.col3, t.col4) > (c.col1, c.col2, c.col3, c.col4)
ORDER BY 1,2,3,4
LIMIT 1
) l
)
SELECT * FROM cte
Is there a way to order the columns in an ad-hoc way, whilst maintaining the performance? For example:
ORDER BY by col1 DESC, col2 ASC, col3 ASC, col4 DESC
Assume an index on each column, as well as a combined index across all 4 columns.
Postgres version is 15.4.
The table is read-only in the sense that the data can't / won't be modified, however it will be added to. Following is a CREATE TABLE
script to replicate my problematic table (more or less):
CREATE TABLE tbl (id SERIAL primary key, col1 integer NOT NULL, col2 integer NOT NULL, col3 integer NOT NULL, col4 integer NOT NULL);
INSERT INTO tbl (col1, col2, col3, col4) SELECT (random()*1000)::int AS col1, (random()*1000)::int AS col2, (random()*1000)::int AS col3, (random()*1000)::int AS col4 FROM generate_series(1,10000000);
CREATE INDEX ON tbl (col1);
CREATE INDEX ON tbl (col2);
CREATE INDEX ON tbl (col3);
CREATE INDEX ON tbl (col4);
CREATE INDEX ON tbl (col1, col2, col3, col4);
hunter
(217 rep)
May 2, 2024, 06:52 PM
• Last activity: May 4, 2024, 03:49 AM
2
votes
1
answers
1053
views
DISTINCT gives duplicate values
Can it be considered a bug? This query gives duplicate values despite DISTINCT: select distinct '1' from ( select * from dual connect by level <= 10 ) order by dbms_random.value fetch first 10 rows only I understand it creates additional column with random values and uses distinct with two columns,...
Can it be considered a bug? This query gives duplicate values despite DISTINCT:
select distinct '1'
from
(
select *
from dual
connect by level <= 10
)
order by dbms_random.value
fetch first 10 rows only
I understand it creates additional column with random values and uses distinct with two columns, but still I expect to get unique values.
Andy DB Analyst
(110 rep)
Jun 2, 2023, 02:47 PM
• Last activity: Mar 13, 2024, 03:01 PM
0
votes
1
answers
84
views
Distinct based on window function (sort curiosity or wrong problem approach)
A lot of values are stored in `values`. The data is in a hierarchical order, so there could be a probe, on which some measurements have been performed. Many points have been archived with strain and stress for example. So in the `values` table all the strain/stress values belongs to different object...
A lot of values are stored in
values
. The data is in a hierarchical order, so there could be a probe, on which some measurements have been performed. Many points have been archived with strain and stress for example. So in the values
table all the strain/stress values belongs to different objects.
The relation to the probe is stored in another table. In this example the hierarchical order is neglected.
I would like to calculate something on the probe-level, based on the measurements. So i want to group all the stress/strain values in a json, so it is clear they belong together.
Now there should be a second measurement with strain and stress. And i want to calculate values based on all values, but grouped by strain. So there should be a value calculated by the stress, where strain = 2
(for example) of both measurements (the additional level in the hierarchy coming from multiple measurements is neglected too). Yes, and that's not enough: i want to be able to group by multiple values.
In multiple rounds i calculate values based on values of previous rounds.
I made a little example :
- complex
is a table containing the model info for grouping the data.
complex_in
contains the inputs for the aggregated datatype (->json)
complex_groupcond
contains the information, by which subvalues (in the json) the calculation should be grouped by values
(contains the values in json format).
- the second next block is just for inserting initial values and grouping models).
- the third block simulate the first round, building the first complex data_type).
- In the fourth block i have the basics of my query, where the values get joined with the inputs for the complex data type and get grouped by them. Important is the window function for getting the condition of the group by out of the json and the distinct on the same Partition.
So I have some questions (now there is more data so the performance should be improved):
Sort Key: "values".name, "values".value_id, complex_in.complex_id
Presorted Key: "values".name, "values".value_id
1. The system uses the index complex_in_pkey
for the join of the values
and the complex_in
. Why the system does not recognize, that everything should already be sorted with complex_in.complex_id
in the next place (because complex_id
is in the index of complex_in_pkey
)?
2. Is one approach better, where i join the (prejoined) values
and complex_in
with a pre grouped table of complex_cond
, so the rows of the prejoin
cant get multiplied with this join. But now its more difficult to extract the grouping condition out of the the complex value (json).
3. Or would be one approach better to leave all the values single and track the hierarchical order where the values comes from. So in the calculation one could order by this hierarchical knowledge. (Not finished thinking about the grouping condition in that case.)
If your are a German speaker, perhaps this this description will be clearer.
To make the question more clear:
1. Why the system does not recognize the data is already sorted by "values".name, "values".value_id, complex_in.complex_id? (or isn't that true and why?)
2. Is a join (with complex_groupcond
) better on a unique key, with the a usage of a new special function (for extracting the data)?
3. Do you have a better idea to approach a problem like that?
lrahlff
(1 rep)
Mar 6, 2024, 09:01 AM
• Last activity: Mar 7, 2024, 04:39 AM
1
votes
2
answers
846
views
Window functions and distinct
I have the following table. Column | Type | Collation | Nullable | Default ----------+-----------------------------+-----------+----------+--------- code | text | | not null | price_at | timestamp without time zone | | not null | price | double precision | | not null | I need to use the data in this...
I have the following table.
Column | Type | Collation | Nullable | Default
----------+-----------------------------+-----------+----------+---------
code | text | | not null |
price_at | timestamp without time zone | | not null |
price | double precision | | not null |
I need to use the data in this table to create a candlestick chart that will have 6-hour candlesticks. Each candlestick is represented by the fields
{ x, open, close, low, high }
where x is the UNIX timestamp of the beginning of the period.
The following query works, but using distinct
causes the query to take longer. Usually, when people don't want to use distinct, they use group by
instead, but I can't use that with the window functions and I'm not sure it would help anyway. Is there a way to eliminate the use of distinct in this query to make it faster and still return the same results?
with price_quotes as (
select
extract (epoch from price_at) - (extract (epoch from price_at) % extract (epoch from '6 hours'::interval)) as period_begin,
extract (epoch from price_at) as quote_time,
price
from quote)
select distinct
period_begin as x,
first_value (price) over (partition by period_begin order by quote_time asc) as open,
last_value (price) over (partition by period_begin order by quote_time asc rows between current row and unbounded following) as close,
min (price) over (partition by period_begin) as low,
max (price) over (partition by period_begin) as high
from price_quotes
order by x asc
Zephyrus
(283 rep)
Nov 14, 2023, 04:55 PM
• Last activity: Nov 15, 2023, 12:55 PM
0
votes
0
answers
293
views
Optimizing SELECT COUNT(DISTINCT) on a table increasing daily
Let's say we have a table ```Daily_users``` which has the columns ```student_id, school_id, grade, timestamp```. We collect usage data of students daily and so the table grows daily (note that there could be multiple rows corresponding to the same student - the purpose is to track usages, so the stu...
Let's say we have a table
which has the columns , school_id, grade, timestamp
. We collect usage data of students daily and so the table grows daily (note that there could be multiple rows corresponding to the same student - the purpose is to track usages, so the student could be logging on and off, or using different apps etc). Using this table, we maintain another table called
and it has the columns , grade, count.
The
is the number of unique students we have recorded for a pair of (school_id, grade)
, so far.
Then, once yesterday passes, we run the nightly query to create the table unique_students_to_date
:
SELECT COUNT(DISTINCT(student_id)) AS count,
grade,
school_id
FROM Daily_users
WHERE timestamp < '"today's date"'
GROUP BY school_id, grade
to get the unique students up to yesterday inclusive. This is a simple enough query and it does the job. However, I can't help but thinking there is lots of redundancies here. The table Daily_users
today is only different by one day's worth of data from its yesterday's version, so when we do COUNT(DISTINCT(student_id))
, we are re-doing a lots of the calculations we did yesterday.
So my question is - Can we optimize this to at least to minimize the redundant calculations ?
dezdichado
(101 rep)
Oct 13, 2023, 09:59 PM
• Last activity: Oct 14, 2023, 06:15 AM
0
votes
1
answers
132
views
The most recent rows in for each type/agent
I have a table (still on MySQL 5.7) with lots of data. This table is written by some scripts. It works like a "checkin", where each agent is checking in in some period of time. It looks like below. | id | agent | status | timestamp | |----|-------|--------|----------|-----------| | 1 | tom | ok | 20...
I have a table (still on MySQL 5.7) with lots of data. This table is written by some scripts. It works like a "checkin", where each agent is checking in in some period of time. It looks like below.
| id | agent | status | timestamp |
|----|-------|--------|----------|-----------|
| 1 | tom | ok | 2023-03-16 12:27:03 |
| 2 | jeff | degraded | 2023-08-31 00:01:13
| 100 | tom | ok | 2023-10-03 12:00:00 |
| 101 | jeff | ok | 2023-10-03 11:59:00 |
I'd like to pick the most fresh line per each agent. So in above example I'd like to get row 100 and 101.
I use following script, but it turns out only first two rows. I tried changing order by id or timestamp ASC/DESC.
SELECT * FROM (SELECT * FROM db_table ORDER BY timestamp ASC) as agnt GROUP BY agent
How to make it working?
Marek
(13 rep)
Oct 3, 2023, 11:49 AM
• Last activity: Oct 3, 2023, 08:01 PM
0
votes
1
answers
57
views
Select distinct
I had some confusion regarding ```distinct``` keyword... For example Table ```test``` has 5 columns ```a``` ```b``` ```c``` ```d``` ```e```, and column ```a``` has repeated value, and I want to use ```distinct``` keyword to get distinct values while also fetching data from other columns eg. ```selec...
I had some confusion regarding
keyword...
For example Table
has 5 columns
, and column
has repeated value, and I want to use
keyword to get distinct values while also fetching data from other columns
eg. distinct * from test
(I know the syntax is wrong here, I have written the query to help people get the idea of what I'm trying to do here)
Can someone help here?
///edit
ID DEGREE AWARDED ACADEMIC_PERIOD
147 GR MS 2022
147 UG BS 2020
170 GR MA 2023
170 UG BA 2018
Expected output is
ID DEGREE AWARDED ACADEMIC_PERIOD
147 GR MS 2022
170 GR MA 2023
found this data on dbastackexchange itself
datascinalyst
(105 rep)
Oct 3, 2023, 11:25 AM
• Last activity: Oct 3, 2023, 01:20 PM
19
votes
1
answers
1223
views
DISTINCT not reducing two equal values to one
Can anyone explain the below situation, where two seemingly equal values are not reduced by `DISTINCT`? [![screenshot showing two values for 'SBS_UCS_DISPATCH'][1]][1] The query above is `SELECT DISTINCT name FROM master.sys.dm_os_spinlock_stats where name = 'SBS_UCS_DISPATCH';` The equivalent `SELE...
Can anyone explain the below situation, where two seemingly equal values are not reduced by
The query above is
DISTINCT
?

SELECT DISTINCT name FROM master.sys.dm_os_spinlock_stats where name = 'SBS_UCS_DISPATCH';
The equivalent SELECT name FROM master.sys.dm_os_spinlock_stats where name = 'SBS_UCS_DISPATCH' GROUP BY name;
also does the same, and adding HAVING COUNT(1) > 1
does not yield the rows.
@@VERSION
is *Microsoft SQL Server 2019 (RTM-CU13) (KB5005679) - 15.0.4178.1 (X64) Sep 23 2021 16:47:49 Copyright (C) 2019 Microsoft Corporation Enterprise Edition: Core-based Licensing (64-bit) on Windows Server 2016 Standard 10.0 (Build 14393: )*
jimbobmcgee
(529 rep)
Aug 3, 2023, 02:55 PM
• Last activity: Aug 4, 2023, 03:14 PM
1
votes
1
answers
2216
views
Insert rows into a (postgres) table where the first col is a static value and the second col is the result of a SELECT DISTINCT from another table
I haven't had to use SQL in years and I'm clearly rusty. I have a situation where I'm trying to insert multiple rows into table A. There are two columns of data to be inserted, the first is a "static" value (it is derived but will be the same for all new rows) and the second is the result of a SELEC...
I haven't had to use SQL in years and I'm clearly rusty.
I have a situation where I'm trying to insert multiple rows into table A.
There are two columns of data to be inserted, the first is a "static" value (it is derived but will be the same for all new rows) and the second is the result of a SELECT DISTINCT on table B. There are no relevant constraints on either of the tables that should come into play here.
FWIW, this is a sql script in Retool going against a postgres db.
_Apologies for the formatting. Table markdown doesn't seem to be working for me_
__Table A__ should look like this at the end:
**id** | **source_file**
foo | apple.csv
foo | banana.csv
foo | orange.csv
Based on these values in __Table B__
**source_file**
apple.csv
apple.csv
apple.csv
banana.csv
banana.csv
orange.csv
orange.csv
I've tried a few variations all along the lines of:
INSERT INTO table_A (id, source_file)
{{id}}, SELECT DISTINCT source_file
FROM table_B WHERE source_file IS NOT NULL AND source_file ''
Which throws
syntax error at or near "$1"
And
INSERT INTO table_A (id, source_file)
SELECT DISTINCT {{id}}, source_file
FROM table_B WHERE source_file IS NOT NULL AND source_file ''
Which throws
insert or update on table "dg_client_data_sources" violates
foreign key constraint "fk_data_grid"
That second error has me wondering if I need to loop through (gasp!) the results of the SELECT DISTINCT to insert the multiple rows? I may be overthinking that bit....
I can provide other context/examples if needed. Any direction is greatly appreciated!
Amy Lee
(13 rep)
May 30, 2023, 05:32 PM
• Last activity: May 31, 2023, 06:20 AM
3
votes
2
answers
2437
views
Improving distinct values estimates in Postgres
Full counts in Postgres can be slow, for reasons that are well-understood and much discussed. So, I've been using estimation techniques instead, where possible. For rows, pg_stats seems fine, for views, extracting an estimate returned by `EXPLAIN` works okay. https://www.cybertec-postgresql.com/en/c...
Full counts in Postgres can be slow, for reasons that are well-understood and much discussed. So, I've been using estimation techniques instead, where possible. For rows, pg_stats seems fine, for views, extracting an estimate returned by
EXPLAIN
works okay.
https://www.cybertec-postgresql.com/en/count-made-fast/
But what about distinct values? Here, I have had a lot less luck. Sometimes the estimates are 100% correct, sometimes they're off by factors of 2 or 20. Truncated tables seem to have badly stale estimates in particular (?).
I just ran this test and have provided some results:
analyze assembly_prods; -- Doing an ANLYZE to give pg_stats every help.
select 'count(*) distinct' as method,
count(*) as count
from (select distinct assembly_id
from assembly_prods) d
union all
select 'n_distinct from pg_stats' as method,
n_distinct as count
from pg_stats
where tablename = 'assembly_prods' and
attname = 'assembly_id';
The results:
method count
count(*) distinct 28088
n_distinct from pg_stats 13805
That's only off by a factor of 2, but I've seem _much_ worse in my data. To the point where I won't use estimates. Is there something else that I can try? Is this something that PG 12 improves?
# Follow-up #
I hadn't ever experimented SET STATISTICS
before, because there are only so many hours in a day. Inspired by Laurenz' answer, I've take a quick look. Here's a useful comment from the documentation:
https://www.postgresql.org/docs/current/planner-stats.html
> The amount of information stored in pg_statistic
by ANALYZE
, in
> particular the maximum number of entries in the most_common_vals
and
> histogram_bounds arrays for each column, can be set on a
> column-by-column basis using the ALTER TABLE SET STATISTICS
command,
> or globally by setting the default_statistics_target
configuration
> variable. The default limit is presently 100 entries. Raising the
> limit might allow more accurate planner estimates to be made,
> particularly for columns with irregular data distributions, at the
> price of consuming more space in pg_statistic
and slightly more time
> to compute the estimates. Conversely, a lower limit might be
> sufficient for columns with simple data distributions.
I have often got tables with a few common values and a lot of rare values. Or the other way around, so the right threshold will depend. For those who haven't used SET STATISTICS
, it lets you set the sampling rate as a target number of entries. The default is 100, so 1000 should be higher fidelity. Here's what that looks like:
ALTER TABLE assembly_prods
ALTER COLUMN assembly_id
SET STATISTICS 1000;
You can use SET STATISTICS
on a table or index. Here's an interesting piece on indexes:
https://akorotkov.github.io/blog/2017/05/31/alter-index-weird/
Note that the current documentation *does* list SET STATISTICS
on indexes.
So I tried out thresholds of 1, 10, 100, 1000, and 10,000 and got these results out of a table with 467,767 rows and 28,088 distinct values:
Target Estimate Difference Missing
1 13,657 14,431 51%
10 13,867 14,221 51%
100 13,759 14,329 51%
1,000 24,746 3,342 12%
10,000 28,088 0 0%
Obviously you can't draw any general conclusions from one case, but SET STATISTICS
looks pretty darn useful and I'll be glad to have it in the back of my mind. I'm tempted to raise the target a bit in general as I suspect it would help in many of the cases in our system.
Morris de Oryx
(939 rep)
Oct 1, 2019, 12:04 AM
• Last activity: Apr 27, 2023, 09:44 PM
16
votes
2
answers
25767
views
SELECT DISTINCT ON, ordered by another column
Please consider the following table `test`: ``` CREATE TABLE test(col1 int, col2 varchar, col3 date); INSERT INTO test VALUES (1,'abc','2015-09-10') , (1,'abc','2015-09-11') , (2,'xyz','2015-09-12') , (2,'xyz','2015-09-13') , (3,'tcs','2015-01-15') , (3,'tcs','2015-01-18'); ``` postgres=# select * f...
Please consider the following table
test
:
CREATE TABLE test(col1 int, col2 varchar, col3 date);
INSERT INTO test VALUES
(1,'abc','2015-09-10')
, (1,'abc','2015-09-11')
, (2,'xyz','2015-09-12')
, (2,'xyz','2015-09-13')
, (3,'tcs','2015-01-15')
, (3,'tcs','2015-01-18');
postgres=# select * from test;
col1 | col2 | col3
------+------+------------
1 | abc | 2015-09-10
1 | abc | 2015-09-11
2 | xyz | 2015-09-12
2 | xyz | 2015-09-13
3 | tcs | 2015-01-15
3 | tcs | 2015-01-18
I'd like to have a returned set ordered by date desc:
col1 | col2 | col3
------+------+------------
2 | xyz | 2015-09-13
1 | abc | 2015-09-11
3 | tcs | 2015-01-18
What I've managed to accomplish with distinct on
:
select distinct on (col1) col1, col2, col3 from test order by col1, col3 desc;
col1 | col2 | col3
------+------+------------
1 | abc | 2015-09-11
2 | xyz | 2015-09-13
3 | tcs | 2015-01-18
And not what I need with having
:
select distinct on (col1) col1, col2, col3 from test group by col1, col2, col3 having col3 = max(col3)
col1 | col2 | col3
------+------+------------
1 | abc | 2015-09-10
2 | xyz | 2015-09-13
3 | tcs | 2015-01-18
Luis
(347 rep)
May 5, 2020, 08:43 PM
• Last activity: Apr 2, 2023, 09:35 AM
0
votes
1
answers
121
views
Fetching 88 records via DISTINCT from a table of 1.5 million rows is taking 3 seconds
There is a table which are populated from an online service and updated every week. This goes on a separate (second) database. The table is like 300MB in size (1.5 million rows) processed by a cron that downloads the CSV data every week. [This CSV file](https://pricing.us-east-1.amazonaws.com/offers...
There is a table which are populated from an online service and updated every week.
This goes on a separate (second) database.
The table is like 300MB in size (1.5 million rows) processed by a cron that downloads the CSV data every week. [This CSV file](https://pricing.us-east-1.amazonaws.com/offers/v1.0/aws/AmazonEC2/current/index.csv) is 3GB in size (4 millions rows) where rows and columns are filtered to retain only a subset of the data. And the order of the columns *may* change. It has in the past. So we can't even predict based on the first row. Even row header names may change.
So its a total dump and filter by some hard-set row names.
In my Django's view which is a DRF powered one, my API URL takes 3 seconds (0:00:03.263031) to execute.
SELECT DISTINCT Region
AS Region
FROM my-table-name
WHERE flag
= 'condition'
LIMIT 0, 100
Thing is, if it just this endpoint that I'm fetching on the page it wouldn't kill. But there are many endpoints on the single page which are fetched when the user clicks on an INPUT element. And 3 seconds for each trigger is way too long.
What else can I do to optimize the table / query ?

anjanesh
(279 rep)
Mar 23, 2023, 11:42 AM
• Last activity: Mar 23, 2023, 12:01 PM
0
votes
2
answers
1160
views
PostgreSQL, INSERT INTO SELECT DISTINCT generates more rows?
I have a large PostgreSQL table (NOT indexed) with duplicates so I ran the following query to eventually end up with a unique rows table: INSERT INTO newTable(Field1, Field2, Field3) SELECT DISTINCT Field1, Field2, Field3 FROM oldTable ; It ran until I ran out of disk space which is more than the si...
I have a large PostgreSQL table (NOT indexed) with duplicates so I ran the following query to eventually end up with a unique rows table:
INSERT INTO newTable(Field1, Field2, Field3)
SELECT DISTINCT Field1, Field2, Field3
FROM oldTable ;
It ran until I ran out of disk space which is more than the size of the oldTable, and I'm positive that at least a 3rd is duplicate.
If anyone would know why it behaves like this. I couldn't run explain analyze because the table is over a Tb in size.
I can't remove duplicates using SELECT with functions min() or max() as I don't have an ID column.
I expected a smaller table with unique rows.
Edit :
Does running a slightly different query change the the in-depth mechanism of the process ? I know that the following creates a new table unlike the first query that Inserts into an existing one:
SELECT DISTINCT Field1, Field2, Field3
INTO newTable
FROM oldTable;
Anvil
(11 rep)
Mar 7, 2023, 07:01 PM
• Last activity: Mar 8, 2023, 07:29 PM
Showing page 1 of 20 total questions