Database Administrators
Q&A for database professionals who wish to improve their database skills
Latest Questions
1
votes
2
answers
6902
views
Why aggregate query with lookup is extremely slow?
I have a mongodb query that works but takes too long to execute and causes my CPU to spike to 100% while it is executing. It is this query here: db.logs.aggregate([ { $lookup: { from: 'graphs', let: { logId : '$_id' }, as: 'matched_docs', pipeline: [ { $match: { $expr: { $and: [ { $eq: ['$$logId', '...
I have a mongodb query that works but takes too long to execute and causes my CPU to spike to 100% while it is executing. It is this query here:
db.logs.aggregate([
{
$lookup:
{
from: 'graphs',
let: { logId : '$_id' },
as: 'matched_docs',
pipeline:
[
{
$match: {
$expr: {
$and: [
{ $eq: ['$$logId', '$lId'] },
{ $gte: [ '$d', new Date('2020-12-21T00:00:00.000Z') ] },
{ $lt: [ '$d', new Date('2020-12-23T00:00:00.000Z') ] }
]
}
}
}
],
}
},
{
$match: {
$expr: {
$and: [
{ $eq: [ '$matched_docs', [] ] },
{ $gte: [ '$createDate', new Date('2020-12-21T00:00:00.000Z') ] },
{ $lt: [ '$createDate', new Date('2020-12-23T00:00:00.000Z') ] }
]
}
}
},
{ $limit: 5 }
]);
This query looks for all records in the
db.logs
collection for which they have not been transformed and loaded into db.graphs
. It's analogous to this SQL approach:
WHERE db.logs._id NOT IN (
SELECT lId FROM db.graphs
WHERE db.graphs.d >= @startTime
AND db.graphs.d = @startTime
AND db.logs.createDate < @endTime
)
The db.logs
has over 1 Million records and here are the indexes:
db.logs.getIndexes();
[
{
"v" : 2,
"key" : {
"_id" : 1
},
"name" : "_id_"
},
{
"v" : 2,
"key" : {
"createDate" : 1
},
"name" : "createDate_1"
}
]
And db.reportgraphs
has fewer than 100 records with indexes on every property/column.
In my attempt to analyze why the mongo query is so slow and CPU intensive, I suffixed my mongo query with a .explain()
. But mongo gave me the error saying db.logs.aggregate(...).explain() is not a function
. I also tried adding , {$explain: 1}
immediately after { $limit: 5}
and got an error saying Unrecognized pipeline stage name $explain
.
So I guess I have two questions:
1. Can someone give feedback on why my mongo query is so slow or possible solutions?
2. Is there a way to see the execution plan of my mongo query so that I can review where the performance bottle necks are?
---
UPDATE
A possible solution I'm considering is to have a property db.logs.isGraphed:boolean
. Then use a simple db.logs.find({isGraphed:false, createDate:{...date filter...}}).limit(5)
. Wasn't sure if this is the approach most people would have considered in the first place?
learningtech
(189 rep)
Dec 23, 2020, 04:12 PM
• Last activity: Jul 23, 2025, 03:01 PM
0
votes
2
answers
177
views
Understanding SQL aggregation efficiency
How would the first query compete in terms of performance against the second alternative: Query 1: select count(*) page_views, count(distinct(session_id)) sessions, (count(*) / count(distinct(session_id))) pages_per_session from page_views_table Query 2: select page_views, sessions, (page_views / se...
How would the first query compete in terms of performance against the second alternative:
Query 1:
select count(*) page_views,
count(distinct(session_id)) sessions,
(count(*) / count(distinct(session_id))) pages_per_session
from page_views_table
Query 2:
select page_views, sessions, (page_views / sessions) pages_per_session from (
select count(*) page_views, count(distinct(session_id)) sessions
from page_views_table
)
I basically want to know if by using the first query, the database would need to calculate page_views (count(*)) and sessions (count(distinct(session_id))) twice in order to get the pages_per_session field, because if that's the case then the second query should be faster. So is the second query in fact the better choice?
user1432193
(163 rep)
Nov 16, 2021, 02:07 PM
• Last activity: Jul 22, 2025, 08:08 AM
0
votes
2
answers
157
views
Check if values in a column of one set match values in another set
Given two results consisting of single key|value pair (coming from CTEs), I want to join and group them by key, aggregate their values and return two different things: a) those keys where aggregated list of values in first resultset exactly matches aggregated list of values in second resultset b) th...
Given two results consisting of single key|value pair (coming from CTEs), I want to join and group them by key, aggregate their values and return two different things:
a) those keys where aggregated list of values in first resultset exactly matches aggregated list of values in second resultset
b) those keys where aggregated list of values in first resultset matches second resultset independent of order
I know of
string_agg()
, but it seems I can use it in the SELECT
list, and it's inefficient anyway. Is there something more efficient?
Set 1
|key | value |
|-----|-------|
| 1 | 1 |
| 1 | 2 |
| 3 | 4 |
| 2 | 5 |
| 2 | 7 |
| 1 | 3 |
Set 2
|key | value |
|-----|-------|
| 1 | 1 |
| 1 | 2 |
| 1 | 3 |
| 2 | 7 |
| 2 | 5 |
| 4 | 6 |
Desired result:
a) key 1
(1,2,3 = 1,2,3)
b) key 1 and key 2
(5,7 = 7,5)
gadelat
(101 rep)
May 10, 2018, 12:23 AM
• Last activity: Jul 19, 2025, 11:48 PM
1
votes
2
answers
865
views
MongoDB - Slow aggregate query
It takes around **6min** to get the result from MongoDB, when I use the following aggregate query. db.barcodes.aggregate([ { $lookup: { from: 'company', localField: 'company', foreignField: '_id', as: 'company' } }, { $match: { 'company.name': 'ABCd' } } ]); I have two collections in my DB, company...
It takes around **6min** to get the result from MongoDB, when I use the following aggregate query.
db.barcodes.aggregate([
{
$lookup: {
from: 'company',
localField: 'company',
foreignField: '_id',
as: 'company'
}
},
{
$match: {
'company.name': 'ABCd'
}
}
]);
I have two collections in my DB, company and barcode. If I search with text **'ABC'** *(instead of **'ABCd'**, company name 'ABC' already exists in the DB)* it takes only **0.05Sec** to complete the result.
Total **42,14,301** documents in barcode collection and 2 documents in company collection.
**Sample documents**
Company
{
"_id" : ObjectId("615dd7873c4f710b71438772"),
"name" : "ABC",
"isActive" : true
}
Barcode
{
"_id" : ObjectId("615dd8ff3c4f710b71438773"),
"barcode" : "1",
"company" : ObjectId("615dd7873c4f710b71438772"),
"comment" : "text 1"
}
**Indexed fields**
- company._id
- company.name
- company.isActive
- barcode.company
- barcode._id
Mongo clients used: Studio 3t and MongoDB CLI
**Output of explain**
{
"stages" : [
{
"$cursor" : {
"query" : {
},
"queryPlanner" : {
"plannerVersion" : 1.0,
"namespace" : "diet.barcodes",
"indexFilterSet" : false,
"parsedQuery" : {
},
"winningPlan" : {
"stage" : "COLLSCAN",
"direction" : "forward"
},
"rejectedPlans" : [
]
}
}
},
{
"$lookup" : {
"from" : "company",
"as" : "company",
"localField" : "company",
"foreignField" : "_id"
}
},
{
"$match" : {
"company.name" : {
"$eq" : "ABCd"
}
}
}
],
"ok" : 1.0
}
Albert
(11 rep)
Oct 10, 2021, 03:20 PM
• Last activity: Jul 6, 2025, 08:04 AM
1
votes
1
answers
226
views
Equivalent to BIT_OR function in SQLite?
I have code that runs against a MySQL database but the unit tests for the code use SQLite. The issue is one code method uses a SQL query which aggregates using the `BIT_OR` function of MySQL. Is there an equivalent or a way to replicate its functionality in SQLite?
I have code that runs against a MySQL database but the unit tests for the code use SQLite.
The issue is one code method uses a SQL query which aggregates using the
BIT_OR
function of MySQL. Is there an equivalent or a way to replicate its functionality in SQLite?
Nick Gotch
(163 rep)
Oct 7, 2020, 04:33 PM
• Last activity: Jun 13, 2025, 06:11 PM
1
votes
1
answers
238
views
How to get aggregate data from a dynamic number of related rows in adjacent table
EDIT: Unknowing of the rule that prohibits cross-posting, I also asked this on Stackoverflow and chose an answer over there. Since there's another (fully working) answer in this thread though, I won't delete it. But for the solution I chose, check out this thread - https://stackoverflow.com/question...
EDIT: Unknowing of the rule that prohibits cross-posting, I also asked this on Stackoverflow and chose an answer over there. Since there's another (fully working) answer in this thread though, I won't delete it. But for the solution I chose, check out this thread - https://stackoverflow.com/questions/52024244/how-to-get-aggregate-data-from-a-dynamic-number-of-related-rows-in-adjacent-tabl
I have a table of matches played, roughly looking like this:
player_id | match_id | result | opponent_rank
----------------------------------------------
82 | 2847 | w | 42
82 | 3733 | w | 185
82 | 4348 | l | 10
82 | 5237 | w | 732
82 | 5363 | w | 83
82 | 7274 | w | 6
51 | 2347 | w | 39
51 | 3746 | w | 394
51 | 5037 | l | 90
... | ... | ... | ...
To get all the winning streaks (not just top streak by any player), I use this query:
SELECT player.tag, s.streak, match.date, s.player_id, s.match_id FROM (
SELECT streaks.streak, streaks.player_id, streaks.match_id FROM (
SELECT w1.player_id, max(w1.match_id) AS match_id, count(*) AS streak FROM (
SELECT w2.player_id, w2.match_id, w2.win, w2.date, sum(w2.grp) OVER w AS grp FROM (
SELECT m.player_id, m.match_id, m.win, m.date, (m.win = false AND LAG(m.win, 1, true) OVER w = true)::integer AS grp FROM matches_m AS m
WHERE matches_m.opponent_position<'100'
WINDOW w AS (PARTITION BY m.player_id ORDER BY m.date, m.match_id)
) AS w2
WINDOW w AS (PARTITION BY w2.player_id ORDER BY w2.date, w2.match_id)
) AS w1
WHERE w1.win = true
GROUP BY w1.player_id, w1.grp
ORDER BY w1.player_id DESC, count(*) DESC
) AS streaks
ORDER BY streaks.streak DESC
LIMIT 100
) AS s
LEFT JOIN player ON player.id = s.player_id
LEFT JOIN match ON match.id = s.match_id
And the result looks like this (note that this is not a fixed table/view, as the query above can be extended by certain parameters such as nationality, date range, ranking of players, etc):
player_id | match_id | streak
-------------------------------
82 | 3733 | 2
82 | 7274 | 3
51 | 3746 | 2
... | ... | ...
What I want to add now is a bunch of aggregate data to provide details about the winning streaks. For starters, **I'd like to know the average rank of the opponents during each those streaks**. Other data are the duration of the streak in time, first and last date, opponent name who ended the streak or if it's still ongoing, and so on. I've tried various things - CTE, some elaborate joins, unions, or adding them in as lag functions in the existing code. But I'm completely stuck how to solve this.
As is obvious from the code, my SQL skills are very basic, so please excuse any mistakes or inefficient statements. Also new to DBA so let me know if my question can be phrased better. For complete context, I'm using Postgres 9.4 on Debian, the matches_m table is a materialized view with 550k lines (query takes 2.5s right now). The data comes from http://aligulac.com/about/db/ , I just mirror it to create the aforementioned view.
hcac
(11 rep)
Aug 25, 2018, 10:18 PM
• Last activity: Jun 6, 2025, 07:12 AM
1
votes
2
answers
412
views
Index conditional aggregate query on PostgreSQL
How does one optimize a conditional aggregate query on PostgreSQL? A example of such query would be: ```sql SELECT SUM(score) FROM "games" WHERE "PlayerId" = 23 AND "status" = 'FINAL'; ``` Indexing `(PlayerId, status)` would help the filter, any tricks or ideas to optimize the aggregation post filtr...
How does one optimize a conditional aggregate query on PostgreSQL?
A example of such query would be:
SELECT SUM(score) FROM "games" WHERE "PlayerId" = 23 AND "status" = 'FINAL';
Indexing (PlayerId, status)
would help the filter, any tricks or ideas to optimize the aggregation post filtration?
What would be ideal would be no external caching mechanism involved, and a save mechanism to keep a running SUM every time "games" table is updated – similar to how a cache works but a more reliable one.
uzyn
(111 rep)
Mar 11, 2020, 07:24 AM
• Last activity: Jun 3, 2025, 10:04 PM
0
votes
1
answers
683
views
Postgresql: partition by number of rows after the current one
I need to calculate correlation for each say 3 consecutive rows. For example, assume there is a table with the following data. create table a ( asset_id int, time int not null, value decimal not null ); insert into a values (1,1,30),(1,2,45),(1,3,30),(1,4,30),(1,5,30), (2,1,30),(2,2,40),(2,3,30),(2,...
I need to calculate correlation for each say 3 consecutive rows.
For example, assume there is a table with the following data.
create table a (
asset_id int,
time int not null,
value decimal not null
);
insert into a values
(1,1,30),(1,2,45),(1,3,30),(1,4,30),(1,5,30),
(2,1,30),(2,2,40),(2,3,30),(2,4,25),(2,5,25);
# select row_number() over() as r, a.asset_id, b.asset_id, a.time, a.value, b.value
from a join a b on a.time=b.time and a.asset_id
yaugenka
(455 rep)
Feb 5, 2022, 12:32 PM
• Last activity: Jun 2, 2025, 08:05 PM
1
votes
1
answers
5700
views
Postgres HAVING ilike any wildcard in array_agg
I need a filter where the input value is similar to any value in an aggregate, using `ilike` and wildcards. My attempt at a query so far is: ```sql SELECT p.id, p.name, array_agg(vo.name) FROM product p LEFT JOIN product_variation pv ON p.id = pv.product_id LEFT JOIN variation_option vo ON pv.variat...
I need a filter where the input value is similar to any value in an aggregate, using
ilike
and wildcards. My attempt at a query so far is:
SELECT p.id, p.name, array_agg(vo.name)
FROM product p
LEFT JOIN product_variation pv
ON p.id = pv.product_id
LEFT JOIN variation_option vo
ON pv.variation_option_id = vo.id
GROUP BY p.id
HAVING $1 ilike any(array_agg('%' || vo.name || '%'));
But it doesn't work. It gets even more complicated: ideally, I'd be able to input an array of strings, where any of them are similar to the aggregate. Any tips?
I realize this kind of works, just not as I need it to. The input value should be able to be a partial match for any of the array elements, but it is the other way around. The wild card would need to be in the input value.
koi
(23 rep)
Jun 10, 2020, 02:46 PM
• Last activity: May 3, 2025, 07:07 AM
0
votes
1
answers
1045
views
Oracle equivalent of PostgreSQL ARRAY_AGG
I'm looking for an aggregate function supported by Oracle SQL to collect values into arrays, something similar to PostgreSQL ```ARRAY_AGG```. I already use ```listagg``` for strings, I use ```cast()``` with ```collect()``` and ```multiset()``` but it requires to declare a table type. Is there a more...
I'm looking for an aggregate function supported by Oracle SQL to collect values into arrays, something similar to PostgreSQL
. I already use
for strings, I use ()
with ()
and ()
but it requires to declare a table type. Is there a more straightforward way of emulating
with Oracle SQL?
gouessej
(111 rep)
Jun 16, 2023, 01:00 PM
• Last activity: May 1, 2025, 02:04 AM
0
votes
2
answers
316
views
Aggregate sampler function for programmer express predictions
Are there any SQL-standard or DBMS implementation that offers "sample aggregator" for predictably repeated columns? I will explain, see bold text at the end. PS: this question is about "little performance gain" in big tables, and about "semantic enhancements" in the *SQL language*, in general. ----...
Are there any SQL-standard or DBMS implementation that offers "sample aggregator" for predictably repeated columns?
I will explain, see bold text at the end.
PS: this question is about "little performance gain" in big tables, and about "semantic enhancements" in the *SQL language*, in general.
----
As typical table, we can imagine a SQL-view
v1
of some piece of data:
Organization | City | Country | Info1 | Info2
--- | --- | --- | --- | ---
LocalOrg1 | San Francisco | US | 10 | 23
LocalOrg1 | San Francisco | US | 2 | 24
Armarinhos Fer | São Paulo | BR | 11 | 55
Armarinhos Fer | São Paulo | BR | 12 | 56
My (programmer's) prediction is that all organizations of v1
exists in only one city (eg. there are only one "San Francisco City Hall"). So,on basis of my prediction, the query
SELECT organization, city, country,
sum(info1) as tot1, avg(info2) as avg2, count(*) as n
FROM v1
GROUP BY organization, city, country -- city,country predicted repeat
have some redundancy and some semantic limitation:
the grouping criteria is organization
, there is no need to spend CPU checking city
and country
. And I can say "Hey John look that query grouped by organization", because it does not make sense to say to another human "organization, city and country" when he have the same prediction in mind. It was a SQL obligation but is not the semantic essence of the query.
Of course, the semantic may be better with GROUP by organization, 2,3
syntax sugar, or using max()
,
SELECT organization, max(city) as city, max(country) as country,
sum(info1) as tot1, avg(info2) as avg2, count(*) as n
FROM v1
GROUP BY organization -- better semantic, worse performance
but max()
lost time "comparing nothing" (because as I predicted there are only repeated things), and perhaps the query cost will be bigger than the first query.
Some DBMS also offer first()
and last()
which might be more efficient (!) then max()
, and I guess that this is the simplest solution today: use some kind of first()
function optimize performance and semantic, in basis of my prediction that city and country repeats when grouping by organization.
But no matter if the sample of a repeated column comes from first or last sampled row. The first/last can be also an internal optimization choice, so, the real need in this context is a kind of aggsample()
function:
SELECT organization,
aggsample(city), aggsample(country) -- better semantic and perfornace
sum(info1) as tot1, avg(info2) as avg2, count(*) as n
FROM v1
GROUP BY organization -- better performance
So, putting in this detailed context: **Are there are any *SQL language variation* where this kind of function (aggsample
) was defined?**
-----
EDITED after comments and some homework...
## NOTES
Candidates and limitations:
* [PostgreSQL's DISTINCT ON
](http://www.postgresql.org/docs/current/static/sql-select.html#SQL-DISTINCT) . Seems to address the similar problem, but not solves the use in usual GROUP BY summarizations.
* [MySQL's ANY_VALUE
](https://dev.mysql.com/doc/refman/5.7/en/group-by-handling.html) , as suggested by @AndriyM. Seems perfect (!), but I never used... Will try later, install mysql and do some tests.
Similar discussions:
* https://stackoverflow.com/a/8373384/287948
* https://stackoverflow.com/q/36134657/287948
* ... https://stackoverflow.com/a/20347763/287948
### Trying
CREATE TABLE v1 (
Organization text, City text,
Country text, Info1 int, Info2 int
);
INSERT INTO v1 VALUES
('LocalOrg1', 'San Francisco', ' US', '10 ', '23'),
('LocalOrg1', 'San Francisco', ' US', '2 ', '24'),
('Armarinhos Fer', 'São Paulo', ' BR', '11 ', '55'),
('Armarinhos Fer', 'São Paulo', ' BR', '12 ', '56');
Extrange, PostgreSQL offer as valid syntax "DISTINCT ON + GROUP BY",
SELECT DISTINCT ON (organization) organization, city, country,
sum(info1) as tot1, avg(info2) as avg2, count(*) as n
FROM v1
GROUP BY organization, city, country
but not works better (worst performance) and is not valid the reduction ou grouping with only GROUP BY organization
.
... no other test.
PS: the only valid syntax (and result) with DISTINCT ON
of postgresql is
SELECT DISTINCT ON (organization) organization, City, Country,
sum(Info1) OVER w AS tot1,
avg(info2) OVER w AS avg2,
count(*) OVER w as n
FROM v1
WINDOW w AS (PARTITION BY organization);
as we see, ugly syntax, and with worst performance.
Peter Krauss
(476 rep)
Mar 28, 2016, 09:01 AM
• Last activity: Apr 28, 2025, 01:00 AM
1
votes
2
answers
3510
views
Converting results to array
I have many entities like: user ``` id | name ----------- 1 | Joe 2 | David 3 | Jane ``` cars ``` id| name ------------ 1 | cars1 2 | cars2 3 | cars3 4 | cars4 5 | cars5 6 | cars6 7 | cars7 8 | cars8 9 | cars9 ``` cars_data ``` id | price | category | uid | car_id ---+--------+----------+-----+-----...
I have many entities like:
user
id | name
-----------
1 | Joe
2 | David
3 | Jane
cars
id| name
------------
1 | cars1
2 | cars2
3 | cars3
4 | cars4
5 | cars5
6 | cars6
7 | cars7
8 | cars8
9 | cars9
cars_data
id | price | category | uid | car_id
---+--------+----------+-----+-------
1 | 225.00 | p1 | 1 | 1
2 | 451.00 | p2 | 1 | 1
3 | 324.00 | p2 | 1 | 2
4 | 784.00 | p2 | 1 | 3
5 | 724.00 | p3 | 1 | 2
6 | 214.00 | p1 | 2 | 1
7 | 451.00 | p1 | 2 | 2
8 | 926.00 | p1 | 2 | 3
9 | 271.00 | p2 | 2 | 3
10 | 421.00 | p2 | 2 | 4
11 | 684.00 | p2 | 2 | 2
12 | 124.00 | p3 | 2 | 5
13 | 128.00 | p3 | 2 | 1
14 | 741.00 | p1 | 3 | 1
15 | 965.00 | p1 | 3 | 3
16 | 124.00 | p2 | 3 | 4
17 | 415.00 | p2 | 3 | 1
18 | 51.00 | p2 | 3 | 2
19 | 965.00 | p2 | 3 | 6
filters
id | name | filter | uid
----+ --------+------------------+-----
1 | filter1 | string filters 1 | 1
2 | filter2 | string filters 2 | 1
3 | filter3 | string filters 3 | 1
4 | filter3 | string filters 3 | 1
5 | filter3 | string filters 3 | 1
6 | filter3 | string filters 3 | 1
7 | filter3 | string filters 3 | 1
8 | filter | string filters 1 | 2
9 | filter5 | string filters 5 | 2
10 | filter6 | string filters 6 | 2
11 | filter6 | string filters 6 | 2
12 | filter6 | string filters 6 | 2
13 | filter6 | string filters 6 | 2
14 | filter7 | string filters 7 | 3
15 | filter8 | string filters 8 | 3
16 | filter8 | string filters 8 | 3
17 | filter8 | string filters 8 | 3
18 | filter8 | string filters 8 | 3
19 | filter9 | string filters 9 | 3
assign_filters
uid | category | filter_id
----+ ----------+-----------
1 | p1 |1
1 | p2 |1
1 | p2 |2
1 | p2 |3
1 | p3 |4
2 | p1 |9
2 | p1 |8
2 | p1 |13
3 | p2 |14
3 | p2 |16
3 | p2 |17
3 | p3 |19
3 | p3 |18
3 | p1 |14
3 | p1 |18
What I want is results like this:
uid | category | filter_id |car_id
----+ ----------+-----------+--------
1 | p1 |1 |[1]
1 | p2 |1 |[1,2,3]
1 | p2 |2 |[1,2,3]
1 | p2 |3 |[1,2,3]
1 | p3 |4 |
2 | p1 |9 |[1,2,3]
2 | p1 |8 |[1,2,3]
2 | p1 |13 |[1,2,3]
3 | p2 |14 |[1,2,3,4,6]
3 | p2 |16 |[1,2,3,4,6]
3 | p2 |17 |[1,2,3,4,6]
3 | p3 |19 |[6,7]
3 | p3 |18 |[6,7]
3 | p1 |14 |[1]
3 | p1 |18 |[1]
How can I change the original SQL query to give me the result above?
dbfiddle
**update:**
- Each user can define a filter for himself and each filter is specific to one user.
- Each user can place cars in specific categories in the cars_data
table.(categories is [p1,p2,p3])
- Each user can assign a number of filters to each of their categories.
I need to know which filters each user has used, and on which cars the filter has been applied.
For example, user number 1 has placed cars1, cars2, cars3 in category p1. Also, filter1, filter2, filter3 have been assigned to this category.
The result I need is this:
uid | category | filter_id |car_id
----+ ----------+-----------+--------
1 | p2 |1 |[1,2,3]
1 | p2 |2 |[1,2,3]
1 | p2 |3 |[1,2,3]
majid
(15 rep)
Nov 1, 2021, 06:35 AM
• Last activity: Apr 20, 2025, 04:56 AM
0
votes
1
answers
1398
views
MongoDB Aggregate Poor Index Usage
I've been trying to understand the MongoDB Aggregate process so I can better optimize my queries and I'm confused by usage and `$match` and `$sort` together. Sample DB has only one collection `people` [{ "name": "Joe Smith", "age": 40, "admin": false }, { "name": "Jen Ford", "age": 45, "admin": true...
I've been trying to understand the MongoDB Aggregate process so I can better optimize my queries and I'm confused by usage and
$match
and $sort
together.
Sample DB has only one collection people
[{
"name": "Joe Smith",
"age": 40,
"admin": false
},
{
"name": "Jen Ford",
"age": 45,
"admin": true
},
{
"name": "Steve Nash",
"age": 45,
"admin": true
},
{
"name": "Ben Simmons",
"age": 45,
"admin": true
}]
I've multiplied this data x1000 just as a POC.
The DB above has one index name_1
The Following query
db.people.find({"name": "Jen Ford"}).sort({"_id": -1}).explain()
Has the following output
{ queryPlanner:
{ plannerVersion: 1,
namespace: 'db.people',
indexFilterSet: false,
parsedQuery: { name: { '$eq': 'Jen Ford' } },
queryHash: '3AE4BDA3',
planCacheKey: '2A9CC473',
winningPlan:
{ stage: 'SORT',
sortPattern: { _id: -1 },
inputStage:
{ stage: 'SORT_KEY_GENERATOR',
inputStage:
{ stage: 'FETCH',
inputStage:
{ stage: 'IXSCAN',
keyPattern: { name: 1 },
indexName: 'name_1',
isMultiKey: false,
multiKeyPaths: { name: [] },
isUnique: false,
isSparse: false,
isPartial: false,
indexVersion: 2,
direction: 'forward',
indexBounds: { name: [ '["Jen Ford", "Jen Ford"]' ] } } } } },
rejectedPlans:
[ { stage: 'FETCH',
filter: { name: { '$eq': 'Jen Ford' } },
inputStage:
{ stage: 'IXSCAN',
keyPattern: { _id: 1 },
indexName: '_id_',
isMultiKey: false,
multiKeyPaths: { _id: [] },
isUnique: true,
isSparse: false,
isPartial: false,
indexVersion: 2,
direction: 'backward',
indexBounds: { _id: [ '[MaxKey, MinKey]' ] } } } ] },
serverInfo:
{ host: '373ea645996b',
port: 27017,
version: '4.2.0',
gitVersion: 'a4b751dcf51dd249c5865812b390cfd1c0129c30' },
ok: 1 }
This makes total sense.
**However**
The following query results in the same set but uses the aggregate
pipeline
db.people.aggregate([ { $match: { $and: [{ name: "Jen Ford" }]}}, { $sort: {"_id": -1}}], {"explain": true})
Has the following output.
{ queryPlanner:
{ plannerVersion: 1,
namespace: 'db.people',
indexFilterSet: false,
parsedQuery: { name: { '$eq': 'Jen Ford' } },
queryHash: '3AE4BDA3',
planCacheKey: '2A9CC473',
optimizedPipeline: true,
winningPlan:
{ stage: 'FETCH',
filter: { name: { '$eq': 'Jen Ford' } },
inputStage:
{ stage: 'IXSCAN',
keyPattern: { _id: 1 },
indexName: '_id_',
isMultiKey: false,
multiKeyPaths: { _id: [] },
isUnique: true,
isSparse: false,
isPartial: false,
indexVersion: 2,
direction: 'backward',
indexBounds: { _id: [ '[MaxKey, MinKey]' ] } } },
rejectedPlans: [] },
serverInfo:
{ host: '373ea645996b',
port: 27017,
version: '4.2.0',
gitVersion: 'a4b751dcf51dd249c5865812b390cfd1c0129c30' },
ok: 1 }
**Notice how the Aggregate Query is unable to recognize it should utilize the name
index against the $match
. This has massive implications as the size of the collection grows**
I've seen this behavior now in Mongo 3.4, 3.6, and 4.2.
https://docs.mongodb.com/v4.2/core/aggregation-pipeline-optimization/ provides this blurb
> $sort + $match Sequence Optimization:
When you have a sequence with $sort followed by a $match, the $match moves before the $sort to minimize the number of objects to sort.
From all this, I think I'm fundamentally misunderstanding something with the Mongo aggregate
command.
I already understand that if I create a composite index name,_id
then it will work as it includes the fields used in my $match
and my $sort
clause.
But **why** must an index include a field from the $sort
clause to be utilized to restrict my $match
set? It seems obvious that we would prefer to $sort
on the smallest set possible?
Stevie H
(1 rep)
Mar 10, 2021, 11:58 PM
• Last activity: Apr 14, 2025, 05:06 AM
0
votes
1
answers
216
views
Why do I get an index spool on a heap in scalar subquery?
I was reading [this](https://learn.microsoft.com/en-us/archive/blogs/craigfr/scalar-subqueries) And I did the following: ``` go create table fiirst ( col1 int, col2 int ); create table seecond( col1 int, col2 int ); with n1(c) as (select 0 union all select 0 ), n2(c) as ( select 0 from n1 as t1 cros...
I was reading [this](https://learn.microsoft.com/en-us/archive/blogs/craigfr/scalar-subqueries)
And I did the following:
go
create table fiirst (
col1 int,
col2 int
);
create table seecond(
col1 int,
col2 int
);
with
n1(c) as (select 0 union all select 0 ),
n2(c) as ( select 0 from n1 as t1 cross join n1 as t2),
n3(c) as ( select 0 from n2 as t1 cross join n2 as t2),
n4(c) as (select 0 from n3 as t1 cross join n3 as t2),
ids(id) as (select ROW_NUMBER() over (order by (select null)) from n4)
insert into fiirst(col1,col2)
select id,id
from ids;
with
n1(c) as (select 0 union all select 0 ),
n2(c) as ( select 0 from n1 as t1 cross join n1 as t2),
n3(c) as ( select 0 from n2 as t1 cross join n2 as t2),
n4(c) as (select 0 from n3 as t1 cross join n3 as t2),
ids(id) as (select ROW_NUMBER() over (order by (select null)) from n4)
insert into seecond(col1,col2)
select id,id
from ids;
----Craig Freedman's query
select *
from fiirst
where fiirst.col1 > (
select min(seecond.col1)
from seecond
where seecond.col2 < fiirst.col2
);
And I got an [index spool](https://www.brentozar.com/pastetheplan/?id=QpaT80Kj8T) , even though the table is a heap. The question is, how did this happen? Why do I get an index spool on a heap? In the example mentioned in the above link, there was no rows, so no spools, but here I see them?
Suleyman Essa
(167 rep)
Apr 13, 2025, 12:55 PM
• Last activity: Apr 13, 2025, 02:08 PM
1
votes
1
answers
2674
views
Rolling sum with a floor in Athena/Presto
In Athena, I want to calculate a rolling sum, over a window, that can't go below 0. For example, if summing up a column that has values ```(1, 2, -1, -2, -1, -2, 1, 2)``` I should get ```(1, 3, 2, 0, 0, 0, 1, 3)```. Without the floor constraint, it's easy - ```SUM(X) OVER (PARTITION BY some_group OR...
In Athena, I want to calculate a rolling sum, over a window, that can't go below 0. For example, if summing up a column that has values
(1, 2, -1, -2, -1, -2, 1, 2)
I should get (1, 3, 2, 0, 0, 0, 1, 3)
.
Without the floor constraint, it's easy - (X) OVER (PARTITION BY some_group ORDER BY ordering_col)
. Or if we just wanted to clip all the values at 0 after doing an actual cumulative sum. But I cannot figure out how to use window functions to achieve my desired result in general, though I can make it work for some special cases.
Is this even possible?
Max Gibiansky
(111 rep)
Mar 17, 2022, 06:15 PM
• Last activity: Apr 7, 2025, 01:09 PM
2
votes
1
answers
136
views
UNION Two Queries and then Aggregate
I have two different queries that I need to UNION together and then aggregate each rate type group together into a final table. For each rate group (Rate_001mo, Rate_003mo, etc.) SELECT EffectiveDate, SUM(CASE WHEN RateName = 'FHLBCSOFR006M' THEN InterestRate ELSE 0 END) / 100.0 AS Rate_001mo, SUM(C...
I have two different queries that I need to UNION together and then aggregate each rate type group together into a final table. For each rate group (Rate_001mo, Rate_003mo, etc.)
SELECT EffectiveDate,
SUM(CASE WHEN RateName = 'FHLBCSOFR006M' THEN InterestRate ELSE 0 END) / 100.0 AS Rate_001mo,
SUM(CASE WHEN RateName = 'FHLBCSOFR006M' THEN InterestRate ELSE 0 END) / 100.0 AS Rate_003mo,
SUM(CASE WHEN RateName = 'FHLBCSOFR006M' THEN InterestRate ELSE 0 END) / 100.0 AS Rate_006mo,
SUM(CASE WHEN RateName = 'FHLBCSOFR012M' THEN InterestRate ELSE 0 END) / 100.0 AS Rate_012mo,
SUM(CASE WHEN RateName = 'FHLBCSOFR024M' THEN InterestRate ELSE 0 END) / 100.0 AS Rate_024mo,
SUM(CASE WHEN RateName = 'FHLBCSOFR036M' THEN InterestRate ELSE 0 END) / 100.0 AS Rate_036mo
FROM dbo.STAGING_FhlbRates WHERE EffectiveDate = '20250131'
group by EffectiveDate
UNION
SELECT EffectiveDate,
SUM(CASE WHEN RateName = 'USOSFRA' THEN InterestRate ELSE 0 END) / 100.0 AS Rate_001mo,
SUM(CASE WHEN RateName = 'USOSFRC' THEN InterestRate ELSE 0 END) / 100.0 AS Rate_003mo,
SUM(CASE WHEN RateName = 'USOSFRF' THEN InterestRate ELSE 0 END) / 100.0 AS Rate_006mo,
SUM(CASE WHEN RateName = 'USOSFR1' THEN InterestRate ELSE 0 END) / 100.0 AS Rate_012mo,
SUM(CASE WHEN RateName = 'USOSFR2' THEN InterestRate ELSE 0 END) / 100.0 AS Rate_024mo,
SUM(CASE WHEN RateName = 'USOSFR3' THEN InterestRate ELSE 0 END) / 100.0 AS Rate_036mo
FROM dbo.STAGING_SofrOisRates WHERE EffectiveDate = '20250131'
GROUP BY EffectiveDate
The columns for the final results table should include Effective Date and then each rate group (Rate_001mo, Rate_003mo, etc.)
Daylon Hunt
(27 rep)
Apr 4, 2025, 06:56 PM
• Last activity: Apr 4, 2025, 07:17 PM
0
votes
1
answers
974
views
Slow performance of query with $and & $or operators
My query with `$and`, `$or` operators is performing poorly. Even though I've built a compound index on all three fields that are specified in my query criteria, the query does a complete collection scan. I have tried both simple find and aggregation and the Winning plan for both is COLLSCAN. ### Sim...
My query with
$and
, $or
operators is performing poorly. Even though I've built a compound index on all three fields that are specified in my query criteria, the query does a complete collection scan. I have tried both simple find and aggregation and the Winning plan for both is COLLSCAN.
### Simple find
Query:
db.test.find({ $or: [ {id1: "900004"}, {relatedid: "900004"}], $and: [ { isValid: "Y" } ] } )
Execution stats:
{
"queryPlanner" : {
"plannerVersion" : 1,
"namespace" : "XXX",
"indexFilterSet" : false,
"parsedQuery" : {
"$and" : [
{
"$or" : [
{
"Id1" : {
"$eq" : "900004"
}
},
{
"relatedID" : {
"$eq" : "900004"
}
}
]
},
{
"isValid" : {
"$eq" : "Y"
}
}
]
},
"winningPlan" : {
"stage" : "COLLSCAN",
"filter" : {
"$and" : [
{
"$or" : [
{
"Id1" : {
"$eq" : "900004"
}
},
{
"relatedID" : {
"$eq" : "900004"
}
}
]
},
{
"isValid" : {
"$eq" : "Y"
}
}
]
},
"direction" : "forward"
},
"rejectedPlans" : []
},
"executionStats" : {
"executionSuccess" : true,
"nReturned" : 1,
"executionTimeMillis" : 61,
"totalKeysExamined" : 0,
"totalDocsExamined" : 100006,
"executionStages" : {
"stage" : "COLLSCAN",
"filter" : {
"$and" : [
{
"$or" : [
{
"Id1" : {
"$eq" : "900004"
}
},
{
"relatedID" : {
"$eq" : "900004"
}
}
]
},
{
"isValid" : {
"$eq" : "Y"
}
}
]
},
"nReturned" : 1,
"executionTimeMillisEstimate" : 60,
"works" : 100008,
"advanced" : 1,
"needTime" : 100006,
"needYield" : 0,
"saveState" : 781,
"restoreState" : 781,
"isEOF" : 1,
"invalidates" : 0,
"direction" : "forward",
"docsExamined" : 100006
}
},
"serverInfo" : {
"host" : "XXXX",
"port" : ,
"version" : "4.0.9",
"gitVersion" : "XXX"
},
"ok" : 1.0
}
### Aggregation
Query:
db.test.aggregate( [{ "$match":{ "$and": [ {"isValid": "Y"}, { "$or": [ {"Id1": "900004"}, {"relatedID": "900004"} ]}] }} ] )
Execution Stats:
{
"stages" : [
{
"$cursor" : {
"query" : {
"$and" : [
{
"isValid" : "Y"
},
{
"$or" : [
{
"Id1" : "900004"
},
{
"relatedID" : "900004"
}
]
}
]
},
"queryPlanner" : {
"plannerVersion" : 1,
"namespace" : "XXXXX",
"indexFilterSet" : false,
"parsedQuery" : {
"$and" : [
{
"$or" : [
{
"Id1" : {
"$eq" : "900004"
}
},
{
"relatedID" : {
"$eq" : "900004"
}
}
]
},
{
"isValid" : {
"$eq" : "Y"
}
}
]
},
"winningPlan" : {
"stage" : "COLLSCAN",
"filter" : {
"$and" : [
{
"$or" : [
{
"Id" : {
"$eq" : "900004"
}
},
{
"relatedID" : {
"$eq" : "900004"
}
}
]
},
{
"isValid" : {
"$eq" : "Y"
}
}
]
},
"direction" : "forward"
},
"rejectedPlans" : []
}
}
}
],
"ok" : 1.0
}
How can I get a fast a query for my results?
Anban
(45 rep)
Jun 6, 2019, 01:01 PM
• Last activity: Apr 3, 2025, 05:09 AM
0
votes
3
answers
3750
views
Is it a good idea to join data from two different databases?
I am building an application that requires a lot of tables in one database and while the joining and aggregation of data is really nice and seamless, I am starting to wonder if I am creating too many tables in one database rather than organizing better by creating multiple databases. 1. Is this stan...
I am building an application that requires a lot of tables in one database and while the joining and aggregation of data is really nice and seamless, I am starting to wonder if I am creating too many tables in one database rather than organizing better by creating multiple databases.
1. Is this standard practice in enterprise level companies?
2. How do you usually join data from two different databases if its normal to do so?
3. Does this cause latency issues?
Any help or guidance would help,
jaffer_syed
(13 rep)
Sep 21, 2023, 11:22 PM
• Last activity: Mar 25, 2025, 05:30 PM
-1
votes
1
answers
334
views
How does MySQL handles GROUP BY?
I'm on Windows 10, MySQL Workbench. I'm on MySQL 8.0. Here's the dataset. northwind.sql. https://pastebin.com/bMgjXvfT Objective: **Write a query to get the order ID, customer's name, grand total of each order, and the name of the employee who handled each order. See below the expected first four ro...
I'm on Windows 10, MySQL Workbench. I'm on MySQL 8.0.
Here's the dataset. northwind.sql.
https://pastebin.com/bMgjXvfT
Objective:
**Write a query to get the order ID, customer's name, grand total of each order, and the name of the employee who handled each order. See below the expected first four rows.**
Output should look like this.
This is the database schema diagram.
https://brucebauer.info/assets/ITEC3610/Northwind/Northwind-Sample-Database-Diagram.pdf
It's northwind.sql database.
This is my query.

SELECT
o.orderid,
c.contactname,
SUM(od.unitprice * od.quantity),
CONCAT(e.lastname, ' ', e.firstname) AS emp_name
FROM
orders o
INNER JOIN
customers c ON o.customerid = c.customerid
INNER JOIN
orderdetails od ON o.orderid = od.OrderID
INNER JOIN
employees e ON o.EmployeeID = e.EmployeeID
GROUP BY emp_name
ORDER BY orderid
LIMIT 4;
But it was producing an error.
**Error Code: 1055. Expression #1 of SELECT list is not in GROUP BY clause and contains nonaggregated column 'northwind.o.OrderID' which is not functionally dependent on columns in GROUP BY clause; this is incompatible with sql_mode=only_full_group_by**
But if I order by orderid alone, it'll work, Why does it work? Should not it be producing the same error as above?
achhainsan
(159 rep)
Jun 9, 2023, 05:55 AM
• Last activity: Mar 22, 2025, 06:03 PM
1
votes
1
answers
337
views
How to use reference to another table in aggregate function
I have following easy-to-explain task: we have a table, and we should calculate total count of rows grouped by columns. Column number is fixed so we write: SELECT [GroupId] = cr.[GroupId], [OnCreateCount] = SUM(CASE WHEN cr.StatusOfRequest = 1 THEN 1 ELSE 0 END), [NewCount] = SUM(CASE WHEN cr.Status...
I have following easy-to-explain task: we have a table, and we should calculate total count of rows grouped by columns. Column number is fixed so we write:
SELECT [GroupId] = cr.[GroupId],
[OnCreateCount] = SUM(CASE WHEN cr.StatusOfRequest = 1 THEN 1 ELSE 0 END),
[NewCount] = SUM(CASE WHEN cr.StatusOfRequest = 2 THEN 1 ELSE 0 END),
[CreatedCount] = SUM(CASE WHEN cr.StatusOfRequest = 1 AND THEN 1 ELSE 0 END),
[ReviewCount] = SUM(CASE WHEN cr.StatusOfRequest = 3 THEN 1 ELSE 0 END),
[RejectedCount] = SUM(CASE WHEN cr.StatusOfRequest = 4 THEN 1 ELSE 0 END),
[TotalCount] = COUNT(*)
FROM [FK].[CertificateRequest] cr
GROUP BY cr.[GroupId]
Here
OnCreateCount
and CreatedCount
are the same, but it's wrong. CreatedCount
is number of rows with status 1 which are referenced from table [Certificate]
. OnCreate
is number of rows with status 1 which are **not** referenced from table [Certificate]
Here is how I solved it:
SELECT [GroupId] = cr.[GroupId],
[OnCreateCount] = SUM(CASE WHEN cr.StatusOfRequest = 1 AND c.ID IS NULL THEN 1 ELSE 0 END),
[NewCount] = SUM(CASE WHEN cr.StatusOfRequest = 2 THEN 1 ELSE 0 END),
[CreatedCount] = SUM(CASE WHEN cr.StatusOfRequest = 1 AND c.ID IS NOT NULL THEN 1 ELSE 0 END),
[ReviewCount] = SUM(CASE WHEN cr.StatusOfRequest = 3 THEN 1 ELSE 0 END),
[RejectedCount] = SUM(CASE WHEN cr.StatusOfRequest = 4 THEN 1 ELSE 0 END),
[TotalCount] = COUNT(*)
FROM [FK].[CertificateRequest] cr
OUTER APPLY
(
SELECT TOP 1 *
FROM [FK].[Certificate] c
WHERE c.CertificateRequestID = cr.ID
) c
GROUP BY cr.[GroupId]
But I really don't like OUTER APPLY
here which is used just to determine whenever there is any row in referenced table.
What is the easiest way to get required result?
Alex Zhukovskiy
(343 rep)
Nov 2, 2016, 10:28 AM
• Last activity: Mar 21, 2025, 05:04 AM
Showing page 1 of 20 total questions