Database Administrators

Q&A for database professionals who wish to improve their database skills

Latest Questions

-4 votes

1 answers

470 views

choosing oldest valued cell when 2 columns have same value but third column has differing values (SQL)

distinct

In SQL, I have three columns, the first two are values to create distinction for a person. The third column will have different values for each person (dates). How do I select the row for each person with the oldest valued date? [![enter image description here][1]][1] [1]: https://i.sstatic.net/A6Xn...

                                  In SQL, I have three columns, the first two are values to create distinction for a person. The third column will have different values for each person (dates). How do I select the row for each person with the oldest valued date?

should output:  
John Plumber 1/3/2020  
Elizabeth Teacher 12/1/2020

mcsmithums (9 rep)

Jun 3, 2022, 07:27 PM • Last activity: Jul 9, 2025, 02:40 PM

17 votes

1 answers

1711 views

Union does not always eliminate duplicates

sql-server execution-plan sql-server-2019 union distinct

I have the following query and expect that as a result I will have a list of IDs without duplicates. But sometimes it produces duplicates (1-2 on 4 million rows). Why can it happen? I run it with the default (read committed) isolation level. I can't use tablock/serializible because these are OLTP ta...

                                  I have the following query and expect that as a result I will have a list of IDs without duplicates. But sometimes it produces duplicates (1-2 on 4 million rows).

Why can it happen? I run it with the default (read committed) isolation level. I can't use tablock/serializible because these are OLTP tables with hundreds of changes.

```
CREATE TABLE #characterId	(
			CharacterId BIGINT 	
		)

DECLARE @dateTimeFrom DATETIME = '2025-05-04'
	, @dateTimeTo DATETIME = '2025-05-07'

INSERT INTO #characterId (CharacterId)  														
SELECT Id 
FROM table1 u
WHERE u.DateUpdated >= @dateTimeFrom 
	AND u.DateUpdated = @dateTimeFrom 
	AND usi.DateUpdated = @dateTimeFrom 
	AND ust.DateCreated  (Build 20348: ) (Hypervisor) 

Plan with UNION ALL instead of first UNION:



                                

Novitskiy Denis (331 rep)

May 6, 2025, 07:33 AM • Last activity: May 8, 2025, 06:09 PM

1 votes

1 answers

63 views

Poor performance bucketing historical data with DISTINCT

postgresql query-performance distinct postgresql-14

I'm running into performance problems with a query that buckets a set of historical records of daily counts into a form that can be rendered easily as a daily graph. ## Context Essentially, the table in question is a record of daily audit reports where the number of problems of varying seriousness are reported for a set of location IDs. Each record in the table contains the counts of problems found during the audit at one location on one day, of several severities.

CREATE TABLE IF NOT EXISTS problem_reports (
    internal_id      bigint GENERATED ALWAYS AS IDENTITY PRIMARY KEY,
    day              timestamp         NOT NULL,
    location_id      varchar(16)       NOT NULL,
    bad              integer DEFAULT 0 NOT NULL,
    awful            integer DEFAULT 0 NOT NULL,
    catastrophic     integer DEFAULT 0 NOT NULL
);

Since audits are not performed at every location on every day, I saved space by only recording rows for days that an audit has been performed. If the number of problems found was zero, I record a row with all zero counts for that location, to indicate that the number of problems of each severity is now zero as of that date. The output I need from the query is a running daily total of all problems found at all locations, by severity. So, if an audit wasn't performed for a location on a given day, the query needs to "look back" to find the most recent audit for that location, and include that in the aggregation. I achieved this using a DISTINCT clause:

WITH days AS (
    SELECT GENERATE_SERIES(
        DATE_TRUNC('day', '2025-03-01'::date),
        DATE_TRUNC('day', '2025-03-07'::date),
        '1 day'::interval
    ) AS day
),
counts AS (
    SELECT DISTINCT ON (days.day, pr.location_id)
        days.day, pr.location_id, pr.bad, pr.awful, pr.catastrophic
    FROM problem_reports pr
        RIGHT JOIN days ON pr.day   Unique  (cost=46930.59..48855.59 rows=40000 width=86) (actual time=0.109..0.120 rows=18 loops=1)                                                                                                                                                                                       |
|        Output: (generate_series(date_trunc('day'::text, ('2025-03-01'::date)::timestamp with time zone), date_trunc('day'::text, ('2025-03-07'::date)::timestamp with time zone), '1 day'::interval)), pr.location_id, pr.bad, pr.awful, pr.catastrophic, pr.day, pr.internal_id            |
|        Buffers: shared hit=1                                                                                                                                                                                                                                                                |
|        ->  Sort  (cost=46930.59..47572.26 rows=256667 width=86) (actual time=0.108..0.111 rows=32 loops=1)                                                                                                                                                                                  |
|              Output: (generate_series(date_trunc('day'::text, ('2025-03-01'::date)::timestamp with time zone), date_trunc('day'::text, ('2025-03-07'::date)::timestamp with time zone), '1 day'::interval)), pr.location_id, pr.bad, pr.awful, pr.catastrophic, pr.day, pr.internal_id      |
|              Sort Key: (generate_series(date_trunc('day'::text, ('2025-03-01'::date)::timestamp with time zone), date_trunc('day'::text, ('2025-03-07'::date)::timestamp with time zone), '1 day'::interval)), pr.location_id, pr.day DESC, pr.internal_id DESC                             |
|              Sort Method: quicksort  Memory: 27kB                                                                                                                                                                                                                                           |
|              Buffers: shared hit=1                                                                                                                                                                                                                                                          |
|              ->  Nested Loop Left Join  (cost=0.00..11584.65 rows=256667 width=86) (actual time=0.048..0.077 rows=32 loops=1)                                                                                                                                                               |
|                    Output: (generate_series(date_trunc('day'::text, ('2025-03-01'::date)::timestamp with time zone), date_trunc('day'::text, ('2025-03-07'::date)::timestamp with time zone), '1 day'::interval)), pr.location_id, pr.bad, pr.awful, pr.catastrophic, pr.day, pr.internal_id|
|                    Join Filter: (pr.day   ProjectSet  (cost=0.00..5.03 rows=1000 width=8) (actual time=0.024..0.027 rows=7 loops=1)                                                                                                                                                                            |
|                          Output: generate_series(date_trunc('day'::text, ('2025-03-01'::date)::timestamp with time zone), date_trunc('day'::text, ('2025-03-07'::date)::timestamp with time zone), '1 day'::interval)                                                                       |
|                          ->  Result  (cost=0.00..0.01 rows=1 width=0) (actual time=0.001..0.002 rows=1 loops=1)                                                                                                                                                                             |
|                    ->  Materialize  (cost=0.00..21.55 rows=770 width=78) (actual time=0.003..0.004 rows=8 loops=7)                                                                                                                                                                          |
|                          Output: pr.location_id, pr.bad, pr.awful, pr.catastrophic, pr.day, pr.internal_id                                                                                                                                                                                  |
|                          Buffers: shared hit=1                                                                                                                                                                                                                                              |
|                          ->  Seq Scan on sto.problem_reports pr  (cost=0.00..17.70 rows=770 width=78) (actual time=0.016..0.019 rows=8 loops=1)                                                                                                                                             |
|                                Output: pr.location_id, pr.bad, pr.awful, pr.catastrophic, pr.day, pr.internal_id                                                                                                                                                                            |
|                                Buffers: shared hit=1                                                                                                                                                                                                                                        |
|Planning Time: 0.186 ms                                                                                                                                                                                                                                                                      |
|Execution Time: 0.188 ms                                                                                                                                                                                                                                                                     |
+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+

## The Problem This works fine for moderate numbers of days and locations, but as the number of those grow, the DISTINCT clause in the CTE starts to perform very poorly. I considered recording a row for every location on every day, to eliminate the need for the DISTINCT "look-back", but that will quickly result in more rows than I'd like to deal with, as the number of locations is expected to grow exponentially. In some cases, I need to query for only a subset of locations, and render a graph of that, so recording a preprocessed aggregate of just the daily counts to the DB wouldn'twork. Is there a more efficient way of storing and querying this type of historical data to produce the graph data? Any suggestions appreciated!

hampercm (111 rep)

Apr 1, 2025, 06:49 PM • Last activity: Apr 4, 2025, 09:45 PM

0 votes

1 answers

672 views

Select only those records that have same multiple values for a particular column if it's existing?

postgresql group-by distinct exists case-sensitive

Below is an example of my `pricing_supports` table with more than 3 millions rows. There some rows having same values in the `first_origin_id`, `delivery_id` fields. If there are records with same `first_origin_id`, `delivery_id` but different `source`, I only want to select those records where `sou...

                                  Below is an example of my pricing_supports table with more than 3 millions rows. 
There some rows having same values in the first_origin_id, delivery_id fields. 
If there are records with same first_origin_id, delivery_id but different source, I only want to select those records where source = 0

    ID   code first_origin_id    delivery_id    source
    1      A       10                20              0
    2      B       10                20              1
    3      C       11                21              1
    4      D       12                22              0
    5      E       12                22              1

I would like result like that:

    ID   code first_origin_id    delivery_id    source
    1      A       10                20              0
    3      C       11                21              1
    4      D       12                22              0

How can I do for good performance?

                                

Dương Khoa (1 rep)

Sep 20, 2018, 04:24 AM • Last activity: Jan 27, 2025, 05:04 PM

-1 votes

1 answers

155 views

What happened to Oracle's COUNTDISTINCT() function?

oracle count oracle-19c distinct

I need the count of unique values, and am trying to use Oracle's [COUNTDISTINCT()](https://docs.oracle.com/cd/E93962_01/bigData.Doc/eql_onPrem/src/reql_aggregation_countdistinct.html) function to obtain that: ```sql select COUNTDISTINCT(a.m_label) from user_rep a, user_group_rep b, trn_grp_rep c whe...

select COUNTDISTINCT(a.m_label)
        from user_rep a, user_group_rep b, trn_grp_rep c
        where b.m_user_id = a.m_reference
...

This results in ORA-00904: "COUNTDISTINCT": invalid identifier... Using plain COUNT() works, but is not returning correct results. What am I doing wrong? What'd be an efficient work-around?

Mikhail T. (166 rep)

Nov 12, 2024, 05:43 PM • Last activity: Nov 12, 2024, 06:07 PM

1 votes

1 answers

67 views

Finding the distinct values in an array of documents

mongodb array distinct

I've got a database full of documents which each contain a collection of transactions: ``` [ { "key": 1, "data": [ { "trans": 1, "uid": 1 }, { "trans": 2, "uid": 1 } ] }, { "key": 2, "data": [ { "trans": 3, "uid": 1 }, { "trans": 4, "uid": 2 } ] } ] ``` I want to create a new field in each of the ma...

I've got a database full of documents which each contain a collection of transactions:

[
  {
    "key": 1,
    "data": [
      {
        "trans": 1,
        "uid": 1
      },
      {
        "trans": 2,
        "uid": 1
      }
    ]
  },
  {
    "key": 2,
    "data": [
      {
        "trans": 3,
        "uid": 1
      },
      {
        "trans": 4,
        "uid": 2
      }
    ]
  }
]

I want to create a new field in each of the main documents, which has the unique values of the uid field. I can get partway there using $map:

db.collection.aggregate([
  {
    "$set": {
      "uid": {
        "$map": {
          "input": "$data",
          "as": "trans",
          "in": "$$trans.uid"
        }
      }
    }
  }
])

This gives me:

[
  {
    "key": 1,
    "uid": [1,1],
    "data": [
      {
        "trans": 1,
        "uid": 1
      },
      {
        "trans": 2,
        "uid": 1
      }
    ]
  },
  {
    "key": 2,
    "uid": [1, 2],
    "data": [
      {
        "trans": 3,
        "uid": 1
      },
      {
        "trans": 4,
        "uid": 2
      }
    ]
  }
]

This is close, but I can't seem to figure out the last step, I want to use only the unique values, so the uid for the first document should be `, not [1, 1]`. The distinct() function works across collections, not single documents. I would think that $addToSet would work, but it doesn't operate on arrays, only on the output of $group. I also looked at trying to create a $reduce specification using $setUnion, but I don't know how to promote my numeric value into an array. I can use the $unwind stage with grouping by _id to get the right values for the new field, but I can't figure out how to attach them back to the original objects.

ralmond (13 rep)

Nov 8, 2024, 11:26 PM • Last activity: Nov 9, 2024, 01:10 AM

0 votes

2 answers

335 views

Selecting a count of distinct logins per day

sql-server query distinct

I have an activity_log table that I need to derive a basic output of how many distinct users logged in that day, and sum up them up day by day. The table logs all actions of a user and will have several entries per day for each user. Here is the table - [![Log table][1]][1] This is what I have so fa...

                                  I have an activity_log table that I need to derive a basic output of how many distinct users logged in that day, and sum up them up day by day.  The table logs all actions of a user and will have several entries per day for each user.
Here is the table -

This is what I have so far -

select DISTINCT CAST(test.activitytime AS date) Date, username
from [dbo].[activity_log] as test  
where test.activitytime between '2024-03-01' and '2024-03-08'  
order by Date

Here is the output for that - 

I need it to display every day once, and sum up the total of distinct users per day (does not matter if they logged in yesterday as well) just total distinct users a day.  I am lost on where to go from here

ie

2024-03-01 24

2024-03-02 39

2024-03-03 26

JDBA (3 rep)

Apr 17, 2024, 10:20 PM • Last activity: Aug 19, 2024, 07:56 AM

2 votes

1 answers

442 views

Emulate Loose Index Scan for multiple columns with alternating sort direction

postgresql postgresql-performance order-by distinct pagination

A while back I asked [this question](https://dba.stackexchange.com/questions/320064/use-skip-scan-index-to-efficiently-select-unique-permutations-of-columns-in-post) about efficiently selecting unique permutations of columns in Postgres. Now I have a follow-up question regarding how to do so, with the addition of being able to order any of the columns with any combination of ASC/DESC across the columns. The table contains hundreds of millions of rows, and while the accepted answer to my previous question is orders of magnitude faster than traditional approaches, not being able to order the columns in an ad-hoc way prevents me from putting this query to good use (I really need it to 'paginate', with LIMIT/OFFSET in small chunks). Is there a way to do this? The author of the previous answer kindly suggested a workaround (changing the row comparison for an explicit where clause), which I tried, but it doesn't seem to work (or I misunderstand it). Given the following generic query:

WITH RECURSIVE cte AS (
   (
   SELECT col1, col2, col3, col4
   FROM   tbl
   ORDER  BY 1,2,3,4
   LIMIT  1
   )
   UNION ALL
   SELECT l.*
   FROM   cte c
   CROSS  JOIN LATERAL (
      SELECT t.col1, t.col2, t.col3, t.col4
      FROM   tbl t
      WHERE (t.col1, t.col2, t.col3, t.col4) > (c.col1, c.col2, c.col3, c.col4)
      ORDER  BY 1,2,3,4
      LIMIT  1
      ) l
   )
SELECT * FROM cte

Is there a way to order the columns in an ad-hoc way, whilst maintaining the performance? For example: ORDER BY by col1 DESC, col2 ASC, col3 ASC, col4 DESC Assume an index on each column, as well as a combined index across all 4 columns. Postgres version is 15.4. The table is read-only in the sense that the data can't / won't be modified, however it will be added to. Following is a CREATE TABLE script to replicate my problematic table (more or less):

CREATE TABLE tbl (id SERIAL primary key, col1 integer NOT NULL, col2 integer NOT NULL, col3 integer NOT NULL, col4 integer NOT NULL);

INSERT INTO tbl (col1, col2, col3, col4) SELECT (random()*1000)::int AS col1, (random()*1000)::int AS col2, (random()*1000)::int AS col3, (random()*1000)::int AS col4 FROM generate_series(1,10000000);

CREATE INDEX ON tbl (col1);
CREATE INDEX ON tbl (col2);
CREATE INDEX ON tbl (col3);
CREATE INDEX ON tbl (col4);
CREATE INDEX ON tbl (col1, col2, col3, col4);

hunter (217 rep)

May 2, 2024, 06:52 PM • Last activity: May 4, 2024, 03:49 AM

2 votes

1 answers

1053 views

DISTINCT gives duplicate values

duplication oracle-19c distinct

Can it be considered a bug? This query gives duplicate values despite DISTINCT: select distinct '1' from ( select * from dual connect by level <= 10 ) order by dbms_random.value fetch first 10 rows only I understand it creates additional column with random values and uses distinct with two columns,...

                                  Can it be considered a bug? This query gives duplicate values despite DISTINCT:

    select distinct '1'
    from 
    (
    select *
    from dual
    connect by level <= 10
    )
    order by dbms_random.value
    fetch first 10 rows only

I understand it creates additional column with random values and uses distinct with two columns, but still I expect to get unique values.
                                

Andy DB Analyst (110 rep)

Jun 2, 2023, 02:47 PM • Last activity: Mar 13, 2024, 03:01 PM

0 votes

1 answers

84 views

Distinct based on window function (sort curiosity or wrong problem approach)

postgresql query-performance group-by window-functions distinct

A lot of values are stored in `values`. The data is in a hierarchical order, so there could be a probe, on which some measurements have been performed. Many points have been archived with strain and stress for example. So in the `values` table all the strain/stress values belongs to different object...

A lot of values are stored in values. The data is in a hierarchical order, so there could be a probe, on which some measurements have been performed. Many points have been archived with strain and stress for example. So in the values table all the strain/stress values belongs to different objects. The relation to the probe is stored in another table. In this example the hierarchical order is neglected. I would like to calculate something on the probe-level, based on the measurements. So i want to group all the stress/strain values in a json, so it is clear they belong together. Now there should be a second measurement with strain and stress. And i want to calculate values based on all values, but grouped by strain. So there should be a value calculated by the stress, where strain = 2 (for example) of both measurements (the additional level in the hierarchy coming from multiple measurements is neglected too). Yes, and that's not enough: i want to be able to group by multiple values. In multiple rounds i calculate values based on values of previous rounds. I made a little example : - complex is a table containing the model info for grouping the data. complex_in contains the inputs for the aggregated datatype (->json) complex_groupcond contains the information, by which subvalues (in the json) the calculation should be grouped by values (contains the values in json format). - the second next block is just for inserting initial values and grouping models). - the third block simulate the first round, building the first complex data_type). - In the fourth block i have the basics of my query, where the values get joined with the inputs for the complex data type and get grouped by them. Important is the window function for getting the condition of the group by out of the json and the distinct on the same Partition. So I have some questions (now there is more data so the performance should be improved):

Sort Key: "values".name, "values".value_id, complex_in.complex_id
Presorted Key: "values".name, "values".value_id

1. The system uses the index complex_in_pkey for the join of the values and the complex_in. Why the system does not recognize, that everything should already be sorted with complex_in.complex_id in the next place (because complex_id is in the index of complex_in_pkey)? 2. Is one approach better, where i join the (prejoined) values and complex_in with a pre grouped table of complex_cond, so the rows of the prejoin cant get multiplied with this join. But now its more difficult to extract the grouping condition out of the the complex value (json). 3. Or would be one approach better to leave all the values single and track the hierarchical order where the values comes from. So in the calculation one could order by this hierarchical knowledge. (Not finished thinking about the grouping condition in that case.) If your are a German speaker, perhaps this this description will be clearer. To make the question more clear: 1. Why the system does not recognize the data is already sorted by "values".name, "values".value_id, complex_in.complex_id? (or isn't that true and why?) 2. Is a join (with complex_groupcond) better on a unique key, with the a usage of a new special function (for extracting the data)? 3. Do you have a better idea to approach a problem like that?

lrahlff (1 rep)

Mar 6, 2024, 09:01 AM • Last activity: Mar 7, 2024, 04:39 AM

1 votes

2 answers

846 views

Window functions and distinct

postgresql window-functions distinct

                                  I have the following table.

      Column  |            Type             | Collation | Nullable | Default 
    ----------+-----------------------------+-----------+----------+---------
     code     | text                        |           | not null | 
     price_at | timestamp without time zone |           | not null | 
     price    | double precision            |           | not null | 

I need to use the data in this table to create a candlestick chart that will have 6-hour candlesticks. Each candlestick is represented by the fields { x, open, close, low, high } where x is the UNIX timestamp of the beginning of the period.

The following query works, but using distinct causes the query to take longer. Usually, when people don't want to use distinct, they use group by instead, but I can't use that with the window functions and I'm not sure it would help anyway. Is there a way to eliminate the use of distinct in this query to make it faster and still return the same results?

    with price_quotes as (
    		select
    			extract (epoch from price_at) - (extract (epoch from price_at) % extract (epoch from '6 hours'::interval)) as period_begin,
    			extract (epoch from price_at) as quote_time,
    			price
    		from quote)
    select distinct
    	period_begin as x,
    	first_value (price) over (partition by period_begin order by quote_time asc) as open,
    	last_value (price) over (partition by period_begin order by quote_time asc rows between current row and unbounded following) as close,
    	min (price) over (partition by period_begin) as low,
    	max (price) over (partition by period_begin) as high
    from price_quotes
    order by x asc

                                

Zephyrus (283 rep)

Nov 14, 2023, 04:55 PM • Last activity: Nov 15, 2023, 12:55 PM

0 votes

0 answers

293 views

Optimizing SELECT COUNT(DISTINCT) on a table increasing daily

query-performance optimization aws distinct

Let's say we have a table ```Daily_users``` which has the columns ```student_id, school_id, grade, timestamp```. We collect usage data of students daily and so the table grows daily (note that there could be multiple rows corresponding to the same student - the purpose is to track usages, so the stu...

Let's say we have a table

which has the columns

, school_id, grade, timestamp

. We collect usage data of students daily and so the table grows daily (note that there could be multiple rows corresponding to the same student - the purpose is to track usages, so the student could be logging on and off, or using different apps etc). Using this table, we maintain another table called

and it has the columns

, grade, count.

The

is the number of unique students we have recorded for a pair of

(school_id, grade)

, so far. Then, once yesterday passes, we run the nightly query to create the table unique_students_to_date:

SELECT COUNT(DISTINCT(student_id)) AS count,
       grade,
       school_id
FROM Daily_users
WHERE timestamp < '"today's date"'
GROUP BY school_id, grade

to get the unique students up to yesterday inclusive. This is a simple enough query and it does the job. However, I can't help but thinking there is lots of redundancies here. The table Daily_users today is only different by one day's worth of data from its yesterday's version, so when we do COUNT(DISTINCT(student_id)), we are re-doing a lots of the calculations we did yesterday. So my question is - Can we optimize this to at least to minimize the redundant calculations ?

dezdichado (101 rep)

Oct 13, 2023, 09:59 PM • Last activity: Oct 14, 2023, 06:15 AM

0 votes

1 answers

132 views

The most recent rows in for each type/agent

mysql mysql-5.7 subquery greatest-n-per-group distinct

I have a table (still on MySQL 5.7) with lots of data. This table is written by some scripts. It works like a "checkin", where each agent is checking in in some period of time. It looks like below. | id | agent | status | timestamp | |----|-------|--------|----------|-----------| | 1 | tom | ok | 20...

                                  I have a table (still on MySQL 5.7) with lots of data. This table is written by some scripts. It works like a "checkin", where each agent is checking in in some period of time. It looks like below. 

| id | agent | status |  timestamp |
|----|-------|--------|----------|-----------|
| 1  | tom   | ok  | 2023-03-16 12:27:03 |
| 2  | jeff  | degraded | 2023-08-31 00:01:13
| 100 | tom   | ok | 2023-10-03 12:00:00 |
| 101 | jeff  | ok | 2023-10-03 11:59:00 |

I'd like to pick the most fresh line per each agent. So in above example I'd like to get row 100 and 101. 

I use following script, but it turns out only first two rows. I tried changing order by id or timestamp ASC/DESC. 

    SELECT * FROM (SELECT * FROM db_table ORDER BY timestamp ASC) as agnt GROUP BY agent

How to make it working?

Marek (13 rep)

Oct 3, 2023, 11:49 AM • Last activity: Oct 3, 2023, 08:01 PM

0 votes

1 answers

57 views

Select distinct

mysql query select distinct

I had some confusion regarding ```distinct``` keyword... For example Table ```test``` has 5 columns ```a``` ```b``` ```c``` ```d``` ```e```, and column ```a``` has repeated value, and I want to use ```distinct``` keyword to get distinct values while also fetching data from other columns eg. ```selec...

I had some confusion regarding

keyword... For example Table

has 5 columns

, and column

has repeated value, and I want to use

keyword to get distinct values while also fetching data from other columns eg.

distinct * from test

(I know the syntax is wrong here, I have written the query to help people get the idea of what I'm trying to do here) Can someone help here? ///edit

ID	DEGREE	AWARDED	ACADEMIC_PERIOD
147	GR	         MS	2022
147	UG           BS	2020
170	GR           MA	2023
170	UG           BA	2018

Expected output is

ID	DEGREE	AWARDED	ACADEMIC_PERIOD
147	GR	         MS	2022
170	GR           MA	2023

found this data on dbastackexchange itself

datascinalyst (105 rep)

Oct 3, 2023, 11:25 AM • Last activity: Oct 3, 2023, 01:20 PM

19 votes

1 answers

1223 views

DISTINCT not reducing two equal values to one

sql-server distinct system-tables

Can anyone explain the below situation, where two seemingly equal values are not reduced by `DISTINCT`? [![screenshot showing two values for 'SBS_UCS_DISPATCH'][1]][1] The query above is `SELECT DISTINCT name FROM master.sys.dm_os_spinlock_stats where name = 'SBS_UCS_DISPATCH';` The equivalent `SELE...

                                  Can anyone explain the below situation, where two seemingly equal values are not reduced by DISTINCT?

The query above is SELECT DISTINCT name FROM master.sys.dm_os_spinlock_stats where name = 'SBS_UCS_DISPATCH';

The equivalent SELECT name FROM master.sys.dm_os_spinlock_stats where name = 'SBS_UCS_DISPATCH' GROUP BY name; also does the same, and adding HAVING COUNT(1) > 1 does not yield the rows.

@@VERSION is *Microsoft SQL Server 2019 (RTM-CU13) (KB5005679) - 15.0.4178.1 (X64)   Sep 23 2021 16:47:49   Copyright (C) 2019 Microsoft Corporation  Enterprise Edition: Core-based Licensing (64-bit) on Windows Server 2016 Standard 10.0  (Build 14393: )*

jimbobmcgee (529 rep)

Aug 3, 2023, 02:55 PM • Last activity: Aug 4, 2023, 03:14 PM

1 votes

1 answers

2216 views

Insert rows into a (postgres) table where the first col is a static value and the second col is the result of a SELECT DISTINCT from another table

postgresql insert distinct

I haven't had to use SQL in years and I'm clearly rusty. I have a situation where I'm trying to insert multiple rows into table A. There are two columns of data to be inserted, the first is a "static" value (it is derived but will be the same for all new rows) and the second is the result of a SELEC...

                                  I haven't had to use SQL in years and I'm clearly rusty. 
I have a situation where I'm trying to insert multiple rows into table A. 
There are two columns of data to be inserted, the first is a "static" value (it is derived but will be the same for all new rows) and the second is the result of a SELECT DISTINCT on table B. There are no relevant constraints on either of the tables that should come into play here.

FWIW, this is a sql script in Retool going against a postgres db.

_Apologies for the formatting. Table markdown doesn't seem to be working for me_

__Table A__ should look like this at the end:

**id**   |    **source_file**

foo |  apple.csv

foo |  banana.csv

foo |  orange.csv

Based on these values in __Table B__

**source_file**

apple.csv

apple.csv

apple.csv

banana.csv

banana.csv

orange.csv

orange.csv

I've tried a few variations all along the lines of:

    INSERT INTO table_A (id, source_file)
         {{id}}, SELECT DISTINCT source_file 
         FROM table_B WHERE source_file IS NOT NULL AND source_file  ''

Which throws syntax error at or near "$1"

And

    INSERT INTO table_A (id, source_file)
         SELECT DISTINCT {{id}},  source_file 
         FROM table_B WHERE source_file IS NOT NULL AND source_file  ''

Which throws 

    insert or update on table "dg_client_data_sources" violates 
     foreign key constraint "fk_data_grid"

That second error has me wondering if I need to loop through (gasp!) the results of the SELECT DISTINCT to insert the multiple rows? I may be overthinking that bit....

I can provide other context/examples if needed. Any direction is greatly appreciated!

Amy Lee (13 rep)

May 30, 2023, 05:32 PM • Last activity: May 31, 2023, 06:20 AM

3 votes

2 answers

2437 views

Improving distinct values estimates in Postgres

postgresql distinct cardinality-estimates

Full counts in Postgres can be slow, for reasons that are well-understood and much discussed. So, I've been using estimation techniques instead, where possible. For rows, pg_stats seems fine, for views, extracting an estimate returned by `EXPLAIN` works okay. https://www.cybertec-postgresql.com/en/c...

                                  Full counts in Postgres can be slow, for reasons that are well-understood and much discussed. So, I've been using estimation techniques instead, where possible. For rows, pg_stats seems fine, for views, extracting an estimate returned by EXPLAIN works okay.

https://www.cybertec-postgresql.com/en/count-made-fast/ 

But what about distinct values? Here, I have had a lot less luck. Sometimes the estimates are 100% correct, sometimes they're off by factors of 2 or 20. Truncated tables seem to have badly stale estimates in particular (?). 

I just ran this test and have provided some results:

    analyze assembly_prods; -- Doing an ANLYZE to give pg_stats every help.
    
    select 'count(*) distinct' as method,
            count(*) as count
    from (select distinct assembly_id 
          from assembly_prods) d 
    union all
    select 'n_distinct from pg_stats' as method,
            n_distinct as count
    from pg_stats 
	where tablename  = 'assembly_prods' and
	      attname    = 'assembly_id';

The results:
       
    method	                    count
    count(*) distinct	        28088
    n_distinct from pg_stats	13805

That's only off by a factor of 2, but I've seem _much_ worse in my data. To the point where I won't use estimates. Is there something else that I can try? Is this something that PG 12 improves?

# Follow-up #
I hadn't ever experimented SET STATISTICS before, because there are only so many hours in a day. Inspired by Laurenz' answer, I've take a quick look. Here's a useful comment from the documentation:

https://www.postgresql.org/docs/current/planner-stats.html 

> The amount of information stored in pg_statistic by ANALYZE, in
> particular the maximum number of entries in the most_common_vals and
> histogram_bounds arrays for each column, can be set on a
> column-by-column basis using the ALTER TABLE SET STATISTICS command,
> or globally by setting the default_statistics_target configuration
> variable. The default limit is presently 100 entries. Raising the
> limit might allow more accurate planner estimates to be made,
> particularly for columns with irregular data distributions, at the
> price of consuming more space in pg_statistic and slightly more time
> to compute the estimates. Conversely, a lower limit might be
> sufficient for columns with simple data distributions.

I have often got tables with a few common values and a lot of rare values. Or the other way around, so the right threshold will depend. For those who haven't used SET STATISTICS, it lets you set the sampling rate as a target number of entries. The default is 100, so 1000 should be higher fidelity. Here's what that looks like:

    ALTER TABLE assembly_prods 
        ALTER COLUMN assembly_id
        SET STATISTICS 1000;

You can use SET STATISTICS on a table or index. Here's an interesting piece on indexes:

https://akorotkov.github.io/blog/2017/05/31/alter-index-weird/ 

Note that the current documentation *does* list SET STATISTICS on indexes.

So I tried out thresholds of 1, 10, 100, 1000, and 10,000 and got these results out of a table with 467,767 rows and 28,088 distinct values:

    Target   Estimate  Difference  Missing
         1   13,657    14,431      51%
        10   13,867    14,221      51%
       100   13,759    14,329      51%
     1,000   24,746     3,342      12%
    10,000   28,088         0       0%

Obviously you can't draw any general conclusions from one case, but SET STATISTICS looks pretty darn useful and I'll be glad to have it in the back of my mind. I'm tempted to raise the target a bit in general as I suspect it would help in many of the cases in our system.


                                

Morris de Oryx (939 rep)

Oct 1, 2019, 12:04 AM • Last activity: Apr 27, 2023, 09:44 PM

16 votes

2 answers

25767 views

SELECT DISTINCT ON, ordered by another column

postgresql postgresql-9.6 order-by greatest-n-per-group distinct

Please consider the following table `test`: ``` CREATE TABLE test(col1 int, col2 varchar, col3 date); INSERT INTO test VALUES (1,'abc','2015-09-10') , (1,'abc','2015-09-11') , (2,'xyz','2015-09-12') , (2,'xyz','2015-09-13') , (3,'tcs','2015-01-15') , (3,'tcs','2015-01-18'); ``` postgres=# select * f...

Please consider the following table test:

CREATE TABLE test(col1 int, col2 varchar, col3 date);
INSERT INTO test VALUES
  (1,'abc','2015-09-10')
, (1,'abc','2015-09-11')
, (2,'xyz','2015-09-12')
, (2,'xyz','2015-09-13')
, (3,'tcs','2015-01-15')
, (3,'tcs','2015-01-18');

postgres=# select * from test; col1 | col2 | col3 ------+------+------------ 1 | abc | 2015-09-10 1 | abc | 2015-09-11 2 | xyz | 2015-09-12 2 | xyz | 2015-09-13 3 | tcs | 2015-01-15 3 | tcs | 2015-01-18 I'd like to have a returned set ordered by date desc: col1 | col2 | col3 ------+------+------------ 2 | xyz | 2015-09-13 1 | abc | 2015-09-11 3 | tcs | 2015-01-18 What I've managed to accomplish with distinct on: select distinct on (col1) col1, col2, col3 from test order by col1, col3 desc; col1 | col2 | col3 ------+------+------------ 1 | abc | 2015-09-11 2 | xyz | 2015-09-13 3 | tcs | 2015-01-18 And not what I need with having: select distinct on (col1) col1, col2, col3 from test group by col1, col2, col3 having col3 = max(col3) col1 | col2 | col3 ------+------+------------ 1 | abc | 2015-09-10 2 | xyz | 2015-09-13 3 | tcs | 2015-01-18

Luis (347 rep)

May 5, 2020, 08:43 PM • Last activity: Apr 2, 2023, 09:35 AM

0 votes

1 answers

121 views

Fetching 88 records via DISTINCT from a table of 1.5 million rows is taking 3 seconds

mysql mysql-8.0 distinct

There is a table which are populated from an online service and updated every week. This goes on a separate (second) database. The table is like 300MB in size (1.5 million rows) processed by a cron that downloads the CSV data every week. [This CSV file](https://pricing.us-east-1.amazonaws.com/offers/v1.0/aws/AmazonEC2/current/index.csv) is 3GB in size (4 millions rows) where rows and columns are filtered to retain only a subset of the data. And the order of the columns *may* change. It has in the past. So we can't even predict based on the first row. Even row header names may change. So its a total dump and filter by some hard-set row names. In my Django's view which is a DRF powered one, my API URL takes 3 seconds (0:00:03.263031) to execute.

SELECT DISTINCT Region AS Region
FROM my-table-name
WHERE flag = 'condition'
LIMIT 0, 100

Thing is, if it just this endpoint that I'm fetching on the page it wouldn't kill. But there are many endpoints on the single page which are fetched when the user clicks on an INPUT element. And 3 seconds for each trigger is way too long. What else can I do to optimize the table / query ?

anjanesh (279 rep)

Mar 23, 2023, 11:42 AM • Last activity: Mar 23, 2023, 12:01 PM

0 votes

2 answers

1160 views

PostgreSQL, INSERT INTO SELECT DISTINCT generates more rows?

postgresql select insert bulk-insert distinct

I have a large PostgreSQL table (NOT indexed) with duplicates so I ran the following query to eventually end up with a unique rows table: INSERT INTO newTable(Field1, Field2, Field3) SELECT DISTINCT Field1, Field2, Field3 FROM oldTable ; It ran until I ran out of disk space which is more than the si...

                                  I have a large PostgreSQL table (NOT indexed) with duplicates so I ran the following query to eventually end up with a unique rows table:

    INSERT INTO newTable(Field1, Field2, Field3)
    SELECT DISTINCT Field1, Field2, Field3 
    FROM oldTable ;

It ran until I ran out of disk space which is more than the size of the oldTable, and I'm positive that at least a 3rd is duplicate. 

If anyone would know why it behaves like this. I couldn't run explain analyze because the table is over a Tb in size.

I can't remove duplicates using SELECT with functions min() or max() as I don't have an ID column.

I expected a smaller table with unique rows.

Edit : 
Does running a slightly different query change the the in-depth mechanism of the process ? I know that the following creates a new table unlike the first query that Inserts into an existing one:

    SELECT DISTINCT Field1, Field2, Field3
    INTO newTable
    FROM oldTable;
                                

Anvil (11 rep)

Mar 7, 2023, 07:01 PM • Last activity: Mar 8, 2023, 07:29 PM

Showing page 1 of 20 total questions