Database Administrators

Q&A for database professionals who wish to improve their database skills

Latest Questions

1 votes

2 answers

6902 views

Why aggregate query with lookup is extremely slow?

I have a mongodb query that works but takes too long to execute and causes my CPU to spike to 100% while it is executing. It is this query here: db.logs.aggregate([ { $lookup: { from: 'graphs', let: { logId : '$_id' }, as: 'matched_docs', pipeline: [ { $match: { $expr: { $and: [ { $eq: ['$$logId', '...

                                  I have a mongodb query that works but takes too long to execute and causes my CPU to spike to 100% while it is executing.  It is this query here:

	  db.logs.aggregate([
		{
		  $lookup:
		  {
			from: 'graphs',
			let: { logId : '$_id' },
			as: 'matched_docs',
			pipeline:
			  [
				{
				  $match: {
					$expr: {
					  $and: [
						{ $eq: ['$$logId', '$lId'] },
						{ $gte: [ '$d', new Date('2020-12-21T00:00:00.000Z') ] },
						{ $lt: [ '$d', new Date('2020-12-23T00:00:00.000Z') ] }
					  ]
					}
				  }
				}
			  ],
		  }
		},
		{
		  $match: {
			$expr: {
			  $and: [
				{ $eq: [ '$matched_docs', [] ] },
				{ $gte: [ '$createDate', new Date('2020-12-21T00:00:00.000Z') ] },
				{ $lt: [ '$createDate', new Date('2020-12-23T00:00:00.000Z') ] }
			  ]
			}
		  }
		},
		{ $limit: 5 }
	  ]);

This query looks for all records in the db.logs collection for which they have not been transformed and loaded into db.graphs.  It's analogous to this SQL approach:

	WHERE db.logs._id NOT IN (
			SELECT lId FROM db.graphs
			WHERE db.graphs.d >= @startTime
			AND db.graphs.d = @startTime
		AND db.logs.createDate < @endTime
	)


The db.logs has over 1 Million records and here are the indexes:

	db.logs.getIndexes();
	[
			{
					"v" : 2,
					"key" : {
							"_id" : 1
					},
					"name" : "_id_"
			},
			{
					"v" : 2,
					"key" : {
							"createDate" : 1
					},
					"name" : "createDate_1"
			}
	]


And db.reportgraphs has fewer than 100 records with indexes on every property/column.

In my attempt to analyze why the mongo query is so slow and CPU intensive, I suffixed my mongo query with a .explain().  But mongo gave me the error saying db.logs.aggregate(...).explain() is not a function.  I also tried adding , {$explain: 1} immediately after { $limit: 5} and got an error saying Unrecognized pipeline stage name $explain.

So I guess I have two questions:

1. Can someone give feedback on why my  mongo query is so slow or possible solutions?
2. Is there a way to see the execution plan of my mongo query so that I can review where the performance bottle necks are?



---
UPDATE

A possible solution I'm considering is to have a property db.logs.isGraphed:boolean .  Then use a simple db.logs.find({isGraphed:false, createDate:{...date filter...}}).limit(5).  Wasn't sure if this is the approach most people would have considered in the first place?

                                

learningtech (189 rep)

Dec 23, 2020, 04:12 PM • Last activity: Jul 23, 2025, 03:01 PM

0 votes

2 answers

177 views

Understanding SQL aggregation efficiency

mysql performance aggregate

How would the first query compete in terms of performance against the second alternative: Query 1: select count(*) page_views, count(distinct(session_id)) sessions, (count(*) / count(distinct(session_id))) pages_per_session from page_views_table Query 2: select page_views, sessions, (page_views / se...

                                  How would the first query compete in terms of performance against the second alternative:

Query 1:  

    select count(*) page_views, 
    count(distinct(session_id)) sessions, 
    (count(*) / count(distinct(session_id))) pages_per_session
    from page_views_table

Query 2:  

    select page_views, sessions, (page_views / sessions) pages_per_session from (
        select count(*) page_views, count(distinct(session_id)) sessions 
        from page_views_table
    )


I basically want to know if by using the first query, the database would need to calculate page_views (count(*)) and sessions (count(distinct(session_id))) twice in order to get the pages_per_session field, because if that's the case then the second query should be faster. So is the second query in fact the better choice?
                                

user1432193 (163 rep)

Nov 16, 2021, 02:07 PM • Last activity: Jul 22, 2025, 08:08 AM

0 votes

2 answers

157 views

Check if values in a column of one set match values in another set

postgresql aggregate

Given two results consisting of single key|value pair (coming from CTEs), I want to join and group them by key, aggregate their values and return two different things: a) those keys where aggregated list of values in first resultset exactly matches aggregated list of values in second resultset b) th...

                                  Given two results consisting of single key|value pair (coming from CTEs), I want to join and group them by key, aggregate their values and return two different things:

a) those keys where aggregated list of values in first resultset exactly matches aggregated list of values in second resultset

b) those keys where aggregated list of values in first resultset matches second resultset independent of order

I know of string_agg(), but it seems I can use it in the SELECT list, and it's inefficient anyway. Is there something more efficient?

Set 1

    |key  | value |
    |-----|-------|
    | 1   | 1     |
    | 1   | 2     |
    | 3   | 4     |
    | 2   | 5     |
    | 2   | 7     |
    | 1   | 3     |

Set 2

    |key  | value |
    |-----|-------|
    | 1   | 1     |
    | 1   | 2     |
    | 1   | 3     |
    | 2   | 7     |
    | 2   | 5     |
    | 4   | 6     |

Desired result:

a) key 1 

(1,2,3 = 1,2,3)

b) key 1 and key 2 

(5,7 = 7,5)
                                

gadelat (101 rep)

May 10, 2018, 12:23 AM • Last activity: Jul 19, 2025, 11:48 PM

1 votes

2 answers

865 views

MongoDB - Slow aggregate query

query-performance index mongodb aggregate

It takes around **6min** to get the result from MongoDB, when I use the following aggregate query. db.barcodes.aggregate([ { $lookup: { from: 'company', localField: 'company', foreignField: '_id', as: 'company' } }, { $match: { 'company.name': 'ABCd' } } ]); I have two collections in my DB, company...

                                  It takes around **6min** to get the result from MongoDB, when I use the following aggregate query.

    db.barcodes.aggregate([
        {
            $lookup: {
                from: 'company',
                localField: 'company',
                foreignField: '_id',
                as: 'company'
            }
        },
        {
            $match: {
                'company.name': 'ABCd'
            }
        }
    ]);

I have two collections in my DB, company and barcode. If I search with text **'ABC'** *(instead of **'ABCd'**, company name 'ABC' already exists in the DB)* it takes only **0.05Sec** to complete the result.

Total **42,14,301** documents in barcode collection and 2 documents in company collection.



**Sample documents**

Company

    { 
        "_id" : ObjectId("615dd7873c4f710b71438772"), 
        "name" : "ABC", 
        "isActive" : true
    }

Barcode

    { 
        "_id" : ObjectId("615dd8ff3c4f710b71438773"), 
        "barcode" : "1", 
        "company" : ObjectId("615dd7873c4f710b71438772"), 
        "comment" : "text 1"
    }

**Indexed fields**

 - company._id
 - company.name
 - company.isActive
 - barcode.company
 - barcode._id

Mongo clients used: Studio 3t and MongoDB CLI


**Output of explain**

    { 
        "stages" : [
            {
                "$cursor" : {
                    "query" : {
    
                    }, 
                    "queryPlanner" : {
                        "plannerVersion" : 1.0, 
                        "namespace" : "diet.barcodes", 
                        "indexFilterSet" : false, 
                        "parsedQuery" : {
    
                        }, 
                        "winningPlan" : {
                            "stage" : "COLLSCAN", 
                            "direction" : "forward"
                        }, 
                        "rejectedPlans" : [
    
                        ]
                    }
                }
            }, 
            {
                "$lookup" : {
                    "from" : "company", 
                    "as" : "company", 
                    "localField" : "company", 
                    "foreignField" : "_id"
                }
            }, 
            {
                "$match" : {
                    "company.name" : {
                        "$eq" : "ABCd"
                    }
                }
            }
        ], 
        "ok" : 1.0
    }


                                

Albert (11 rep)

Oct 10, 2021, 03:20 PM • Last activity: Jul 6, 2025, 08:04 AM

1 votes

1 answers

226 views

Equivalent to BIT_OR function in SQLite?

mysql sqlite aggregate

I have code that runs against a MySQL database but the unit tests for the code use SQLite. The issue is one code method uses a SQL query which aggregates using the `BIT_OR` function of MySQL. Is there an equivalent or a way to replicate its functionality in SQLite?

                                  I have code that runs against a MySQL database but the unit tests for the code use SQLite.

The issue is one code method uses a SQL query which aggregates using the BIT_OR function of MySQL. Is there an equivalent or a way to replicate its functionality in SQLite?

Nick Gotch (163 rep)

Oct 7, 2020, 04:33 PM • Last activity: Jun 13, 2025, 06:11 PM

1 votes

1 answers

238 views

How to get aggregate data from a dynamic number of related rows in adjacent table

postgresql postgresql-9.4 aggregate

EDIT: Unknowing of the rule that prohibits cross-posting, I also asked this on Stackoverflow and chose an answer over there. Since there's another (fully working) answer in this thread though, I won't delete it. But for the solution I chose, check out this thread - https://stackoverflow.com/question...

                                  EDIT: Unknowing of the rule that prohibits cross-posting, I also asked this on Stackoverflow and chose an answer over there. Since there's another (fully working) answer in this thread though, I won't delete it. But for the solution I chose, check out this thread - https://stackoverflow.com/questions/52024244/how-to-get-aggregate-data-from-a-dynamic-number-of-related-rows-in-adjacent-tabl 


I have a table of matches played, roughly looking like this:

    player_id | match_id | result | opponent_rank
    ----------------------------------------------
    82        | 2847     |   w    |   42
    82        | 3733     |   w    |  185
    82        | 4348     |   l    |   10
    82        | 5237     |   w    |  732
    82        | 5363     |   w    |   83
    82        | 7274     |   w    |    6
    51        | 2347     |   w    |   39
    51        | 3746     |   w    |  394
    51        | 5037     |   l    |   90
    ...       | ...      |  ...   |  ...

To get all the winning streaks (not just top streak by any player), I use this query:


    SELECT player.tag, s.streak, match.date, s.player_id, s.match_id FROM (
    	SELECT streaks.streak, streaks.player_id, streaks.match_id FROM (
    		SELECT w1.player_id, max(w1.match_id) AS match_id, count(*) AS streak FROM (
    			SELECT w2.player_id, w2.match_id, w2.win, w2.date, sum(w2.grp) OVER w AS grp FROM (
    				SELECT m.player_id, m.match_id, m.win, m.date, (m.win = false AND LAG(m.win, 1, true) OVER w = true)::integer AS grp FROM matches_m AS m
    				WHERE matches_m.opponent_position<'100'
    					WINDOW w AS (PARTITION BY m.player_id ORDER BY m.date, m.match_id)
    					) AS w2
    					WINDOW w AS (PARTITION BY w2.player_id ORDER BY w2.date, w2.match_id)
    				) AS w1
    			WHERE w1.win = true
    			GROUP BY w1.player_id, w1.grp
    			ORDER BY w1.player_id DESC, count(*) DESC
    		) AS streaks
    	ORDER BY streaks.streak DESC
    	LIMIT 100
    	) AS s
    LEFT JOIN player ON player.id = s.player_id
    LEFT JOIN match ON match.id = s.match_id

And the result looks like this (note that this is not a fixed table/view, as the query above can be extended by certain parameters such as nationality, date range, ranking of players, etc):

    player_id | match_id | streak
    -------------------------------
    82        | 3733     |  2
    82        | 7274     |  3
    51        | 3746     |  2
    ...       | ...      |  ...

What I want to add now is a bunch of aggregate data to provide details about the winning streaks. For starters, **I'd like to know the average rank of the opponents during each those streaks**. Other data are the duration of the streak in time, first and last date, opponent name who ended the streak or if it's still ongoing, and so on. I've tried various things - CTE, some elaborate joins, unions, or adding them in as lag functions in the existing code. But I'm completely stuck how to solve this.

As is obvious from the code, my SQL skills are very basic, so please excuse any mistakes or inefficient statements. Also new to DBA so let me know if my question can be phrased better. For complete context, I'm using Postgres 9.4 on Debian, the matches_m table is a materialized view with 550k lines (query takes 2.5s right now). The data comes from http://aligulac.com/about/db/ , I just mirror it to create the aforementioned view.
                                

hcac (11 rep)

Aug 25, 2018, 10:18 PM • Last activity: Jun 6, 2025, 07:12 AM

1 votes

2 answers

412 views

Index conditional aggregate query on PostgreSQL

postgresql optimization aggregate

How does one optimize a conditional aggregate query on PostgreSQL? A example of such query would be: ```sql SELECT SUM(score) FROM "games" WHERE "PlayerId" = 23 AND "status" = 'FINAL'; ``` Indexing `(PlayerId, status)` would help the filter, any tricks or ideas to optimize the aggregation post filtr...

How does one optimize a conditional aggregate query on PostgreSQL? A example of such query would be:

SELECT SUM(score) FROM "games" WHERE "PlayerId" = 23 AND "status" = 'FINAL';

Indexing (PlayerId, status) would help the filter, any tricks or ideas to optimize the aggregation post filtration? What would be ideal would be no external caching mechanism involved, and a save mechanism to keep a running SUM every time "games" table is updated – similar to how a cache works but a more reliable one.

uzyn (111 rep)

Mar 11, 2020, 07:24 AM • Last activity: Jun 3, 2025, 10:04 PM

0 votes

1 answers

683 views

Postgresql: partition by number of rows after the current one

postgresql aggregate window-functions

I need to calculate correlation for each say 3 consecutive rows. For example, assume there is a table with the following data. create table a ( asset_id int, time int not null, value decimal not null ); insert into a values (1,1,30),(1,2,45),(1,3,30),(1,4,30),(1,5,30), (2,1,30),(2,2,40),(2,3,30),(2,...

                                  I need to calculate correlation for each say 3 consecutive rows.

For example, assume there is a table with the following data.

    create table a (
    	asset_id int,
    	time int not null,
    	value decimal not null
    );
    insert into a values
    (1,1,30),(1,2,45),(1,3,30),(1,4,30),(1,5,30),
    (2,1,30),(2,2,40),(2,3,30),(2,4,25),(2,5,25);


    # select row_number() over() as r, a.asset_id, b.asset_id, a.time, a.value, b.value
    from a join a b on a.time=b.time and a.asset_id
                              
                            

yaugenka (455 rep)

Feb 5, 2022, 12:32 PM • Last activity: Jun 2, 2025, 08:05 PM

1 votes

1 answers

5700 views

Postgres HAVING ilike any wildcard in array_agg

postgresql query aggregate pattern-matching like

I need a filter where the input value is similar to any value in an aggregate, using `ilike` and wildcards. My attempt at a query so far is: ```sql SELECT p.id, p.name, array_agg(vo.name) FROM product p LEFT JOIN product_variation pv ON p.id = pv.product_id LEFT JOIN variation_option vo ON pv.variat...

I need a filter where the input value is similar to any value in an aggregate, using ilike and wildcards. My attempt at a query so far is:

SELECT p.id, p.name, array_agg(vo.name)
FROM product p
LEFT JOIN product_variation pv
ON p.id = pv.product_id
LEFT JOIN variation_option vo
ON pv.variation_option_id = vo.id
GROUP BY p.id
HAVING $1 ilike any(array_agg('%' || vo.name || '%'));

But it doesn't work. It gets even more complicated: ideally, I'd be able to input an array of strings, where any of them are similar to the aggregate. Any tips? I realize this kind of works, just not as I need it to. The input value should be able to be a partial match for any of the array elements, but it is the other way around. The wild card would need to be in the input value.

koi (23 rep)

Jun 10, 2020, 02:46 PM • Last activity: May 3, 2025, 07:07 AM

0 votes

1 answers

1045 views

Oracle equivalent of PostgreSQL ARRAY_AGG

postgresql oracle aggregate

I'm looking for an aggregate function supported by Oracle SQL to collect values into arrays, something similar to PostgreSQL ```ARRAY_AGG```. I already use ```listagg``` for strings, I use ```cast()``` with ```collect()``` and ```multiset()``` but it requires to declare a table type. Is there a more...

I'm looking for an aggregate function supported by Oracle SQL to collect values into arrays, something similar to PostgreSQL

. I already use

for strings, I use

()

with

()

and

()

but it requires to declare a table type. Is there a more straightforward way of emulating

with Oracle SQL?

gouessej (111 rep)

Jun 16, 2023, 01:00 PM • Last activity: May 1, 2025, 02:04 AM

0 votes

2 answers

316 views

Aggregate sampler function for programmer express predictions

mysql mysql-5.7 aggregate

Are there any SQL-standard or DBMS implementation that offers "sample aggregator" for predictably repeated columns? I will explain, see bold text at the end. PS: this question is about "little performance gain" in big tables, and about "semantic enhancements" in the *SQL language*, in general. ----...

                                  Are there any SQL-standard or DBMS implementation that offers "sample aggregator" for predictably repeated columns?

I will explain, see bold text at the end.

PS: this question is about "little performance gain" in big tables, and about "semantic enhancements" in the *SQL language*, in general.

----

As typical table, we can imagine a SQL-view v1 of some piece of data:

    Organization   | City           |  Country | Info1 | Info2
    ---            | ---            | ---      | ---   | ---
    LocalOrg1      | San Francisco  |  US      | 10    | 23
    LocalOrg1      | San Francisco  |  US      | 2     | 24
    Armarinhos Fer | São Paulo      |  BR      | 11    | 55
    Armarinhos Fer | São Paulo      |  BR      | 12    | 56

My (programmer's) prediction is that all organizations of v1 exists in only one city (eg. there are only one "San Francisco City Hall"). So,on basis of my prediction, the query


    SELECT organization, city, country, 
           sum(info1) as tot1, avg(info2) as avg2, count(*) as n
    FROM v1
    GROUP BY organization, city, country  -- city,country predicted repeat


have some redundancy and some semantic limitation: 
the grouping criteria is  organization, there is no need to spend CPU checking city  and country. And I can say "Hey John look that query grouped by organization", because it does not make sense to say to another human   "organization, city and country" when he have the same prediction in mind. It was a SQL obligation but is not the semantic essence of the query.

Of course, the semantic may be better with GROUP by organization, 2,3 syntax sugar, or using max(),


    SELECT organization, max(city) as city, max(country) as country, 
           sum(info1) as tot1, avg(info2) as avg2, count(*) as n
    FROM v1
    GROUP BY organization  -- better semantic, worse performance 

but max() lost time "comparing nothing" (because as I predicted there are only repeated things), and perhaps the query cost will be bigger than the first query.

Some DBMS also offer first() and last() which might be more efficient (!) then max(), and I guess that this is the simplest solution today: use some kind of first() function optimize performance and semantic, in basis of my prediction that city and country repeats when grouping by organization.

But no matter if the sample of a repeated column comes from first or last sampled row. The first/last can be also an internal optimization choice, so, the real need in this context is a kind of aggsample() function: 
 
    SELECT organization, 
           aggsample(city), aggsample(country) -- better semantic and perfornace
           sum(info1) as tot1, avg(info2) as avg2, count(*) as n
    FROM v1
    GROUP BY organization  -- better performance 
 

So, putting in this detailed context: **Are there are any *SQL language variation* where this kind of function (aggsample) was defined?**

-----

EDITED after comments and some homework...

## NOTES

Candidates and limitations: 

* [PostgreSQL's DISTINCT ON](http://www.postgresql.org/docs/current/static/sql-select.html#SQL-DISTINCT) . Seems to address the similar problem, but not solves the use in usual GROUP BY summarizations.

* [MySQL's ANY_VALUE](https://dev.mysql.com/doc/refman/5.7/en/group-by-handling.html) , as suggested by @AndriyM. Seems perfect (!), but I never used... Will try later, install mysql and do some tests.

Similar discussions:

* https://stackoverflow.com/a/8373384/287948 
* https://stackoverflow.com/q/36134657/287948 
* ... https://stackoverflow.com/a/20347763/287948 

### Trying


    CREATE TABLE v1 (
      Organization text,  City text,
      Country text, Info1  int, Info2 int
    );
    INSERT INTO v1 VALUES 
     ('LocalOrg1', 'San Francisco', ' US', '10  ', '23'),
     ('LocalOrg1', 'San Francisco', ' US', '2   ', '24'),
     ('Armarinhos Fer', 'São Paulo', ' BR', '11  ', '55'),
     ('Armarinhos Fer', 'São Paulo', ' BR', '12  ', '56');

Extrange, PostgreSQL offer as valid syntax "DISTINCT ON + GROUP BY", 

    SELECT DISTINCT ON (organization) organization, city, country, 
           sum(info1) as tot1, avg(info2) as avg2, count(*) as n
    FROM v1
    GROUP BY organization, city, country

but not works better (worst performance) and is not valid the reduction ou grouping with only GROUP BY organization.

... no other test.

PS: the only valid syntax (and result) with DISTINCT ON of postgresql is

    SELECT DISTINCT ON (organization) organization, City, Country, 
    	sum(Info1) OVER w AS tot1,
    	avg(info2) OVER w AS avg2,
    	count(*)   OVER w as n
    FROM v1
    WINDOW w AS (PARTITION BY organization);

as we see, ugly syntax, and with worst performance.

                                

Peter Krauss (476 rep)

Mar 28, 2016, 09:01 AM • Last activity: Apr 28, 2025, 01:00 AM

1 votes

2 answers

3510 views

Converting results to array

postgresql aggregate array

I have many entities like: user

id | name
-----------
 1 | Joe
 2 | David
 3 | Jane

cars

id| name 
------------
1 | cars1 
2 | cars2 
3 | cars3 
4 | cars4 
5 | cars5 
6 | cars6 
7 | cars7 
8 | cars8 
9 | cars9

cars_data

id | price  | category | uid | car_id
---+--------+----------+-----+-------
1  | 225.00 |	p1 	   |  1  |	1
2  | 451.00 |	p2 	   |  1  |	1
3  | 324.00 |	p2 	   |  1  |	2
4  | 784.00 |	p2 	   |  1  |	3
5  | 724.00 |	p3 	   |  1  |	2
6  | 214.00 |	p1 	   |  2  |	1
7  | 451.00 |	p1 	   |  2  |	2
8  | 926.00 |	p1 	   |  2  |	3
9  | 271.00 |	p2 	   |  2  |	3
10 | 421.00 |	p2 	   |  2  |	4
11 | 684.00 |	p2 	   |  2  |	2
12 | 124.00 |	p3 	   |  2  |	5
13 | 128.00 |	p3 	   |  2  |	1
14 | 741.00 |	p1 	   |  3  |	1
15 | 965.00 |	p1 	   |  3  |	3
16 | 124.00 |	p2 	   |  3  |	4
17 | 415.00 |	p2 	   |  3  |	1
18 | 51.00 	|   p2    |  3  |	2
19 | 965.00 |	p2 	   |  3  |	6

filters

id 	| name 	  |  filter 	     | uid
----+ --------+------------------+-----
1 	| filter1 |	string filters 1 |	1
2 	| filter2 |	string filters 2 |	1
3 	| filter3 |	string filters 3 |	1
4 	| filter3 |	string filters 3 |	1
5 	| filter3 |	string filters 3 |	1
6 	| filter3 |	string filters 3 |	1
7 	| filter3 |	string filters 3 |	1
8 	| filter  |	string filters 1 |	2
9 	| filter5 |	string filters 5 |	2
10 	| filter6 |	string filters 6 |	2
11 	| filter6 |	string filters 6 |	2
12 	| filter6 |	string filters 6 |	2
13 	| filter6 |	string filters 6 |	2
14 	| filter7 |	string filters 7 |	3
15 	| filter8 |	string filters 8 |	3
16 	| filter8 |	string filters 8 |	3
17 	| filter8 |	string filters 8 |	3
18 	| filter8 |	string filters 8 |	3
19 	| filter9 |	string filters 9 |	3

assign_filters

uid | category  | filter_id
----+ ----------+-----------
1   |	p1 	    |1          
1   |	p2 	    |1          
1   |	p2 	    |2          
1   |	p2 	    |3          
1   |	p3 	    |4          
2   |	p1 	    |9          
2   |	p1 	    |8          
2   |	p1 	    |13         
3   |	p2 	    |14         
3   |	p2 	    |16         
3   |	p2 	    |17         
3   |	p3 	    |19         
3   |	p3 	    |18         
3   |	p1 	    |14         
3   |	p1 	    |18

What I want is results like this:

uid | category  | filter_id |car_id  
----+ ----------+-----------+--------
1   |	p1 	    |1          |[1] 
1   |	p2 	    |1          |[1,2,3]
1   |	p2 	    |2          |[1,2,3]
1   |	p2 	    |3          |[1,2,3]
1   |	p3 	    |4          |
2   |	p1 	    |9          |[1,2,3]
2   |	p1 	    |8          |[1,2,3]
2   |	p1 	    |13         |[1,2,3]
3   |	p2 	    |14         |[1,2,3,4,6]
3   |	p2 	    |16         |[1,2,3,4,6]
3   |	p2 	    |17         |[1,2,3,4,6]
3   |	p3 	    |19         |[6,7]
3   |	p3 	    |18         |[6,7]
3   |	p1 	    |14         |[1] 
3   |	p1 	    |18         |[1]

How can I change the original SQL query to give me the result above? dbfiddle **update:** - Each user can define a filter for himself and each filter is specific to one user. - Each user can place cars in specific categories in the cars_data table.(categories is [p1,p2,p3]) - Each user can assign a number of filters to each of their categories. I need to know which filters each user has used, and on which cars the filter has been applied. For example, user number 1 has placed cars1, cars2, cars3 in category p1. Also, filter1, filter2, filter3 have been assigned to this category. The result I need is this:

uid | category  | filter_id |car_id  
----+ ----------+-----------+--------
1   |	p2 	    |1          |[1,2,3]
1   |	p2 	    |2          |[1,2,3]
1   |	p2 	    |3          |[1,2,3]

majid (15 rep)

Nov 1, 2021, 06:35 AM • Last activity: Apr 20, 2025, 04:56 AM

0 votes

1 answers

1398 views

MongoDB Aggregate Poor Index Usage

mongodb aggregate mongodb-3.4 mongodb-3.6 mongodb-4.2

I've been trying to understand the MongoDB Aggregate process so I can better optimize my queries and I'm confused by usage and `$match` and `$sort` together. Sample DB has only one collection `people` [{ "name": "Joe Smith", "age": 40, "admin": false }, { "name": "Jen Ford", "age": 45, "admin": true...

                                  I've been trying to understand the MongoDB Aggregate process so I can better optimize my queries and I'm confused by usage and $match and $sort together.

Sample DB has only one collection people

    [{
        "name": "Joe Smith",
        "age": 40,
        "admin": false
    },
    {
        "name": "Jen Ford",
        "age": 45,
        "admin": true
    },
    {
        "name": "Steve Nash",
        "age": 45,
        "admin": true
    },
    {
        "name": "Ben Simmons",
        "age": 45,
        "admin": true
    }]
I've multiplied this data x1000 just as a POC.

The DB above has one index name_1

The Following query 

db.people.find({"name": "Jen Ford"}).sort({"_id": -1}).explain()

Has the following output 

    { queryPlanner: 
       { plannerVersion: 1,
         namespace: 'db.people',
         indexFilterSet: false,
         parsedQuery: { name: { '$eq': 'Jen Ford' } },
         queryHash: '3AE4BDA3',
         planCacheKey: '2A9CC473',
         winningPlan: 
          { stage: 'SORT',
            sortPattern: { _id: -1 },
            inputStage: 
             { stage: 'SORT_KEY_GENERATOR',
               inputStage: 
                { stage: 'FETCH',
                  inputStage: 
                   { stage: 'IXSCAN',
                     keyPattern: { name: 1 },
                     indexName: 'name_1',
                     isMultiKey: false,
                     multiKeyPaths: { name: [] },
                     isUnique: false,
                     isSparse: false,
                     isPartial: false,
                     indexVersion: 2,
                     direction: 'forward',
                     indexBounds: { name: [ '["Jen Ford", "Jen Ford"]' ] } } } } },
         rejectedPlans: 
          [ { stage: 'FETCH',
              filter: { name: { '$eq': 'Jen Ford' } },
              inputStage: 
               { stage: 'IXSCAN',
                 keyPattern: { _id: 1 },
                 indexName: '_id_',
                 isMultiKey: false,
                 multiKeyPaths: { _id: [] },
                 isUnique: true,
                 isSparse: false,
                 isPartial: false,
                 indexVersion: 2,
                 direction: 'backward',
                 indexBounds: { _id: [ '[MaxKey, MinKey]' ] } } } ] },
      serverInfo: 
       { host: '373ea645996b',
         port: 27017,
         version: '4.2.0',
         gitVersion: 'a4b751dcf51dd249c5865812b390cfd1c0129c30' },
      ok: 1 }

This makes total sense.

**However**

The following query results in the same set but uses the aggregate pipeline

    db.people.aggregate([ { $match: { $and: [{ name: "Jen Ford" }]}}, { $sort: {"_id": -1}}], {"explain": true})

Has the following output.

    { queryPlanner: 
       { plannerVersion: 1,
         namespace: 'db.people',
         indexFilterSet: false,
         parsedQuery: { name: { '$eq': 'Jen Ford' } },
         queryHash: '3AE4BDA3',
         planCacheKey: '2A9CC473',
         optimizedPipeline: true,
         winningPlan: 
          { stage: 'FETCH',
            filter: { name: { '$eq': 'Jen Ford' } },
            inputStage: 
             { stage: 'IXSCAN',
               keyPattern: { _id: 1 },
               indexName: '_id_',
               isMultiKey: false,
               multiKeyPaths: { _id: [] },
               isUnique: true,
               isSparse: false,
               isPartial: false,
               indexVersion: 2,
               direction: 'backward',
               indexBounds: { _id: [ '[MaxKey, MinKey]' ] } } },
         rejectedPlans: [] },
      serverInfo: 
       { host: '373ea645996b',
         port: 27017,
         version: '4.2.0',
         gitVersion: 'a4b751dcf51dd249c5865812b390cfd1c0129c30' },
      ok: 1 }

**Notice how the Aggregate Query is unable to recognize it should utilize the name index against the $match. This has massive implications as the size of the collection grows**

I've seen this behavior now in Mongo 3.4, 3.6, and 4.2.

https://docs.mongodb.com/v4.2/core/aggregation-pipeline-optimization/  provides this blurb

> $sort + $match Sequence Optimization:
When you have a sequence with $sort followed by a $match, the $match moves before the $sort to minimize the number of objects to sort.

From all this, I think I'm fundamentally misunderstanding something with the Mongo aggregate command.

I already understand that if I create a composite index name,_id then it will work as it includes the fields used in my $match and my $sort clause.

But **why** must an index include a field from the  $sort clause to be utilized to restrict my  $match set? It seems obvious that we would prefer to $sort on the smallest set possible?
                                

Stevie H (1 rep)

Mar 10, 2021, 11:58 PM • Last activity: Apr 14, 2025, 05:06 AM

0 votes

1 answers

216 views

Why do I get an index spool on a heap in scalar subquery?

sql-server subquery aggregate aggregate-filter index-spool

I was reading [this](https://learn.microsoft.com/en-us/archive/blogs/craigfr/scalar-subqueries) And I did the following: ``` go create table fiirst ( col1 int, col2 int ); create table seecond( col1 int, col2 int ); with n1(c) as (select 0 union all select 0 ), n2(c) as ( select 0 from n1 as t1 cros...

I was reading [this](https://learn.microsoft.com/en-us/archive/blogs/craigfr/scalar-subqueries) And I did the following:

go
create table fiirst (
col1 int,
col2 int
);
create table seecond(
col1 int,
col2 int
);
with
n1(c) as (select 0 union all select 0 ),
n2(c) as ( select 0 from n1 as t1 cross join n1 as t2),
n3(c) as ( select 0 from n2 as t1 cross join n2 as t2),
n4(c) as (select 0 from n3 as t1 cross join n3 as t2),
ids(id) as (select ROW_NUMBER() over (order by (select null)) from n4)
insert into fiirst(col1,col2)
select id,id 
from ids;
with
n1(c) as (select 0 union all select 0 ),
n2(c) as ( select 0 from n1 as t1 cross join n1 as t2),
n3(c) as ( select 0 from n2 as t1 cross join n2 as t2),
n4(c) as (select 0 from n3 as t1 cross join n3 as t2),
ids(id) as (select ROW_NUMBER() over (order by (select null)) from n4)
insert into seecond(col1,col2)
select id,id
from ids;
----Craig Freedman's query

select *
from fiirst
where fiirst.col1 > (
    select min(seecond.col1)
    from seecond
    where seecond.col2 < fiirst.col2
);

And I got an [index spool](https://www.brentozar.com/pastetheplan/?id=QpaT80Kj8T) , even though the table is a heap. The question is, how did this happen? Why do I get an index spool on a heap? In the example mentioned in the above link, there was no rows, so no spools, but here I see them?

Suleyman Essa (167 rep)

Apr 13, 2025, 12:55 PM • Last activity: Apr 13, 2025, 02:08 PM

1 votes

1 answers

2674 views

Rolling sum with a floor in Athena/Presto

aggregate window-functions amazon-presto

In Athena, I want to calculate a rolling sum, over a window, that can't go below 0. For example, if summing up a column that has values ```(1, 2, -1, -2, -1, -2, 1, 2)``` I should get ```(1, 3, 2, 0, 0, 0, 1, 3)```. Without the floor constraint, it's easy - ```SUM(X) OVER (PARTITION BY some_group OR...

In Athena, I want to calculate a rolling sum, over a window, that can't go below 0. For example, if summing up a column that has values

(1, 2, -1, -2, -1, -2, 1, 2)

I should get

(1, 3, 2, 0, 0, 0, 1, 3)

. Without the floor constraint, it's easy -

(X) OVER (PARTITION BY some_group ORDER BY ordering_col)

. Or if we just wanted to clip all the values at 0 after doing an actual cumulative sum. But I cannot figure out how to use window functions to achieve my desired result in general, though I can make it work for some special cases. Is this even possible?

Max Gibiansky (111 rep)

Mar 17, 2022, 06:15 PM • Last activity: Apr 7, 2025, 01:09 PM

2 votes

1 answers

136 views

UNION Two Queries and then Aggregate

aggregate union

I have two different queries that I need to UNION together and then aggregate each rate type group together into a final table. For each rate group (Rate_001mo, Rate_003mo, etc.) SELECT EffectiveDate, SUM(CASE WHEN RateName = 'FHLBCSOFR006M' THEN InterestRate ELSE 0 END) / 100.0 AS Rate_001mo, SUM(C...

                                  I have two different queries that I need to UNION together and then aggregate each rate type group together into a final table.  For each rate group (Rate_001mo, Rate_003mo, etc.)

    SELECT EffectiveDate, 
    SUM(CASE WHEN RateName = 'FHLBCSOFR006M' THEN InterestRate ELSE 0 END) / 100.0 AS Rate_001mo,
    SUM(CASE WHEN RateName = 'FHLBCSOFR006M' THEN InterestRate ELSE 0 END) / 100.0 AS Rate_003mo,
    SUM(CASE WHEN RateName = 'FHLBCSOFR006M' THEN InterestRate ELSE 0 END) / 100.0 AS Rate_006mo,
    SUM(CASE WHEN RateName = 'FHLBCSOFR012M' THEN InterestRate ELSE 0 END) / 100.0 AS Rate_012mo,
    SUM(CASE WHEN RateName = 'FHLBCSOFR024M' THEN InterestRate ELSE 0 END) / 100.0 AS Rate_024mo,
    SUM(CASE WHEN RateName = 'FHLBCSOFR036M' THEN InterestRate ELSE 0 END) / 100.0 AS Rate_036mo
    FROM dbo.STAGING_FhlbRates WHERE EffectiveDate = '20250131'
    group by EffectiveDate
    
    UNION
    
    SELECT EffectiveDate,
    	SUM(CASE WHEN RateName = 'USOSFRA' THEN InterestRate ELSE 0 END) / 100.0 AS Rate_001mo,
    	SUM(CASE WHEN RateName = 'USOSFRC' THEN InterestRate ELSE 0 END) / 100.0 AS Rate_003mo,
    	SUM(CASE WHEN RateName = 'USOSFRF' THEN InterestRate ELSE 0 END) / 100.0 AS Rate_006mo,
    	SUM(CASE WHEN RateName = 'USOSFR1' THEN InterestRate ELSE 0 END) / 100.0 AS Rate_012mo,
    	SUM(CASE WHEN RateName = 'USOSFR2' THEN InterestRate ELSE 0 END) / 100.0 AS Rate_024mo,
    	SUM(CASE WHEN RateName = 'USOSFR3' THEN InterestRate ELSE 0 END) / 100.0 AS Rate_036mo
    	FROM dbo.STAGING_SofrOisRates WHERE EffectiveDate = '20250131'
    	GROUP BY EffectiveDate

The columns for the final results table should include Effective Date and then each rate group (Rate_001mo, Rate_003mo, etc.)
                                

Daylon Hunt (27 rep)

Apr 4, 2025, 06:56 PM • Last activity: Apr 4, 2025, 07:17 PM

0 votes

1 answers

974 views

Slow performance of query with $and & $or operators

mongodb aggregate mongodb-3.6

My query with `$and`, `$or` operators is performing poorly. Even though I've built a compound index on all three fields that are specified in my query criteria, the query does a complete collection scan. I have tried both simple find and aggregation and the Winning plan for both is COLLSCAN. ### Sim...

My query with $and, $or operators is performing poorly. Even though I've built a compound index on all three fields that are specified in my query criteria, the query does a complete collection scan. I have tried both simple find and aggregation and the Winning plan for both is COLLSCAN. ### Simple find Query:

db.test.find({ $or: [ {id1: "900004"}, {relatedid: "900004"}], $and: [ { isValid: "Y" } ] } )

Execution stats:

{
    "queryPlanner" : {
        "plannerVersion" : 1,
        "namespace" : "XXX",
        "indexFilterSet" : false,
        "parsedQuery" : {
            "$and" : [ 
                {
                    "$or" : [ 
                        {
                            "Id1" : {
                                "$eq" : "900004"
                            }
                        }, 
                        {
                            "relatedID" : {
                                "$eq" : "900004"
                            }
                        }
                    ]
                }, 
                {
                    "isValid" : {
                        "$eq" : "Y"
                    }
                }
            ]
        },
        "winningPlan" : {
            "stage" : "COLLSCAN",
            "filter" : {
                "$and" : [ 
                    {
                        "$or" : [ 
                            {
                                "Id1" : {
                                    "$eq" : "900004"
                                }
                            }, 
                            {
                                "relatedID" : {
                                    "$eq" : "900004"
                                }
                            }
                        ]
                    }, 
                    {
                        "isValid" : {
                            "$eq" : "Y"
                        }
                    }
                ]
            },
            "direction" : "forward"
        },
        "rejectedPlans" : []
    },
    "executionStats" : {
        "executionSuccess" : true,
        "nReturned" : 1,
        "executionTimeMillis" : 61,
        "totalKeysExamined" : 0,
        "totalDocsExamined" : 100006,
        "executionStages" : {
            "stage" : "COLLSCAN",
            "filter" : {
                "$and" : [ 
                    {
                        "$or" : [ 
                            {
                                "Id1" : {
                                    "$eq" : "900004"
                                }
                            }, 
                            {
                                "relatedID" : {
                                    "$eq" : "900004"
                                }
                            }
                        ]
                    }, 
                    {
                        "isValid" : {
                            "$eq" : "Y"
                        }
                    }
                ]
            },
            "nReturned" : 1,
            "executionTimeMillisEstimate" : 60,
            "works" : 100008,
            "advanced" : 1,
            "needTime" : 100006,
            "needYield" : 0,
            "saveState" : 781,
            "restoreState" : 781,
            "isEOF" : 1,
            "invalidates" : 0,
            "direction" : "forward",
            "docsExamined" : 100006
        }
    },
    "serverInfo" : {
        "host" : "XXXX",
        "port" : ,
        "version" : "4.0.9",
        "gitVersion" : "XXX"
    },
    "ok" : 1.0
}

### Aggregation Query:

db.test.aggregate( [{ "$match":{ "$and": [ {"isValid": "Y"}, { "$or": [ {"Id1": "900004"}, {"relatedID": "900004"} ]}] }} ] )

Execution Stats:

{
    "stages" : [ 
        {
            "$cursor" : {
                "query" : {
                    "$and" : [ 
                        {
                            "isValid" : "Y"
                        }, 
                        {
                            "$or" : [ 
                                {
                                    "Id1" : "900004"
                                }, 
                                {
                                    "relatedID" : "900004"
                                }
                            ]
                        }
                    ]
                },
                "queryPlanner" : {
                    "plannerVersion" : 1,
                    "namespace" : "XXXXX",
                    "indexFilterSet" : false,
                    "parsedQuery" : {
                        "$and" : [ 
                            {
                                "$or" : [ 
                                    {
                                        "Id1" : {
                                            "$eq" : "900004"
                                        }
                                    }, 
                                    {
                                        "relatedID" : {
                                            "$eq" : "900004"
                                        }
                                    }
                                ]
                            }, 
                            {
                                "isValid" : {
                                    "$eq" : "Y"
                                }
                            }
                        ]
                    },
                    "winningPlan" : {
                        "stage" : "COLLSCAN",
                        "filter" : {
                            "$and" : [ 
                                {
                                    "$or" : [ 
                                        {
                                            "Id" : {
                                                "$eq" : "900004"
                                            }
                                        }, 
                                        {
                                            "relatedID" : {
                                                "$eq" : "900004"
                                            }
                                        }
                                    ]
                                }, 
                                {
                                    "isValid" : {
                                        "$eq" : "Y"
                                    }
                                }
                            ]
                        },
                        "direction" : "forward"
                    },
                    "rejectedPlans" : []
                }
            }
        }
    ],
    "ok" : 1.0
}

How can I get a fast a query for my results?

Anban (45 rep)

Jun 6, 2019, 01:01 PM • Last activity: Apr 3, 2025, 05:09 AM

0 votes

3 answers

3750 views

Is it a good idea to join data from two different databases?

sql-server join aggregate cross-join

I am building an application that requires a lot of tables in one database and while the joining and aggregation of data is really nice and seamless, I am starting to wonder if I am creating too many tables in one database rather than organizing better by creating multiple databases. 1. Is this stan...

                                  I am building an application that requires a lot of tables in one database and while the joining and aggregation of data is really nice and seamless, I am starting to wonder if I am creating too many tables in one database rather than organizing better by creating multiple databases.

1. Is this standard practice in enterprise level companies? 
2. How do you usually join data from two different databases if its normal to do so?
3. Does this cause latency issues? 

Any help or guidance would help,

jaffer_syed (13 rep)

Sep 21, 2023, 11:22 PM • Last activity: Mar 25, 2025, 05:30 PM

-1 votes

1 answers

334 views

How does MySQL handles GROUP BY?

mysql group-by aggregate mysql-8.0 mysql-workbench

I'm on Windows 10, MySQL Workbench. I'm on MySQL 8.0. Here's the dataset. northwind.sql. https://pastebin.com/bMgjXvfT Objective: **Write a query to get the order ID, customer's name, grand total of each order, and the name of the employee who handled each order. See below the expected first four ro...

This is the database schema diagram. https://brucebauer.info/assets/ITEC3610/Northwind/Northwind-Sample-Database-Diagram.pdf It's northwind.sql database. This is my query.

SELECT 
    o.orderid,
    c.contactname,
    SUM(od.unitprice * od.quantity),
    CONCAT(e.lastname, ' ', e.firstname) AS emp_name
FROM
    orders o
        INNER JOIN
    customers c ON o.customerid = c.customerid
        INNER JOIN
    orderdetails od ON o.orderid = od.OrderID
        INNER JOIN
    employees e ON o.EmployeeID = e.EmployeeID
GROUP BY emp_name
ORDER BY orderid
LIMIT 4;

But it was producing an error. **Error Code: 1055. Expression #1 of SELECT list is not in GROUP BY clause and contains nonaggregated column 'northwind.o.OrderID' which is not functionally dependent on columns in GROUP BY clause; this is incompatible with sql_mode=only_full_group_by** But if I order by orderid alone, it'll work, Why does it work? Should not it be producing the same error as above?

achhainsan (159 rep)

Jun 9, 2023, 05:55 AM • Last activity: Mar 22, 2025, 06:03 PM

1 votes

1 answers

337 views

How to use reference to another table in aggregate function

sql-server t-sql aggregate cross-apply

I have following easy-to-explain task: we have a table, and we should calculate total count of rows grouped by columns. Column number is fixed so we write: SELECT [GroupId] = cr.[GroupId], [OnCreateCount] = SUM(CASE WHEN cr.StatusOfRequest = 1 THEN 1 ELSE 0 END), [NewCount] = SUM(CASE WHEN cr.Status...

                                  I have following easy-to-explain task: we have a table, and we should calculate total count of rows grouped by columns. Column number is fixed so we write:

    SELECT  [GroupId] = cr.[GroupId], 
            [OnCreateCount] =  SUM(CASE WHEN cr.StatusOfRequest = 1 THEN 1 ELSE 0 END),
            [NewCount] = SUM(CASE WHEN cr.StatusOfRequest = 2 THEN 1 ELSE 0 END),
            [CreatedCount] = SUM(CASE WHEN cr.StatusOfRequest = 1 AND THEN 1 ELSE 0 END),
            [ReviewCount] = SUM(CASE WHEN cr.StatusOfRequest = 3 THEN 1 ELSE 0 END),
            [RejectedCount] = SUM(CASE WHEN cr.StatusOfRequest = 4 THEN 1 ELSE 0 END),
            [TotalCount] = COUNT(*)
    FROM [FK].[CertificateRequest] cr
    GROUP BY cr.[GroupId]

Here OnCreateCount and CreatedCount are the same, but it's wrong. CreatedCount is number of rows with status 1 which are referenced from table [Certificate]. OnCreate is number of rows with status 1 which are **not** referenced from table [Certificate]

Here is how I solved it:

    SELECT  [GroupId] = cr.[GroupId], 
            [OnCreateCount] =  SUM(CASE WHEN cr.StatusOfRequest = 1 AND c.ID IS NULL THEN 1 ELSE 0 END),
            [NewCount] = SUM(CASE WHEN cr.StatusOfRequest = 2 THEN 1 ELSE 0 END),
            [CreatedCount] = SUM(CASE WHEN cr.StatusOfRequest = 1 AND c.ID IS NOT NULL THEN 1 ELSE 0 END),
            [ReviewCount] = SUM(CASE WHEN cr.StatusOfRequest = 3 THEN 1 ELSE 0 END),
            [RejectedCount] = SUM(CASE WHEN cr.StatusOfRequest = 4 THEN 1 ELSE 0 END),
            [TotalCount] = COUNT(*)

    FROM [FK].[CertificateRequest] cr
    OUTER APPLY
    (
      SELECT TOP 1 *
      FROM [FK].[Certificate] c
      WHERE c.CertificateRequestID = cr.ID
    ) c
    GROUP BY cr.[GroupId]

But I really don't like OUTER APPLY here which is used just to determine whenever there is any row in referenced table.

What is the easiest way to get required result?
                                

Alex Zhukovskiy (343 rep)

Nov 2, 2016, 10:28 AM • Last activity: Mar 21, 2025, 05:04 AM

Showing page 1 of 20 total questions