Database Administrators
Q&A for database professionals who wish to improve their database skills
Latest Questions
1
votes
1
answers
2674
views
Rolling sum with a floor in Athena/Presto
In Athena, I want to calculate a rolling sum, over a window, that can't go below 0. For example, if summing up a column that has values ```(1, 2, -1, -2, -1, -2, 1, 2)``` I should get ```(1, 3, 2, 0, 0, 0, 1, 3)```. Without the floor constraint, it's easy - ```SUM(X) OVER (PARTITION BY some_group OR...
In Athena, I want to calculate a rolling sum, over a window, that can't go below 0. For example, if summing up a column that has values
(1, 2, -1, -2, -1, -2, 1, 2)
I should get (1, 3, 2, 0, 0, 0, 1, 3)
.
Without the floor constraint, it's easy - (X) OVER (PARTITION BY some_group ORDER BY ordering_col)
. Or if we just wanted to clip all the values at 0 after doing an actual cumulative sum. But I cannot figure out how to use window functions to achieve my desired result in general, though I can make it work for some special cases.
Is this even possible?
Max Gibiansky
(111 rep)
Mar 17, 2022, 06:15 PM
• Last activity: Apr 7, 2025, 01:09 PM
1
votes
1
answers
3363
views
(Athena/Presto) Window function (lag) is not reading the previous lag row
I have two tables, ```credit_memos``` and ```credit_memo_items``` and want to show each event individually and will thus do a insert (doing union in this case which is logically the same). I know the key for both tables and am doing a union with them while sharing the same partition key ```credit_me...
I have two tables,
The code for the second column looks as follows:
Rows 3 and 12 should have the invoice_nk based on the credit_memo
cm is the subquery to grab the credit memos
cmi is the subquery to grab the credit memo items
combined is just to precompute the order by key in the window functions
Note that both are pulled from a database that has the full change log (CDC).
The end goal is to make one row for each event, I'm defining an event for the final table to be any time either credit memo or credit memo item undergoes a change. I would like to be able to snapshot the state of a credit memo combined with the credit memo item by windowing at the
and
and want to show each event individually and will thus do a insert (doing union in this case which is logically the same).
I know the key for both tables and am doing a union with them while sharing the same partition key
that I'll be using for the window.
I'm trying to use lag to fill up the rest of the null rows as shown.

(invoice_nk, lag(invoice_nk, 1, '') over (partition by cm_dd order by event_at)) as invoice_nk
For some reason the third row shown in the picture is not able to interpret the previous lagged row and insert a value its row as a result. I presume it may have to do something where Athena does not actually execute the lag function recursively and takes the previous row in each partition literally and doesn't wait for the previous rows to get executed.
Is there a way to grab values recursively as I'm trying to do?
Thanks.
Update:
Here's the full query:
with
cme as(
with
cm as(
select
-- change_log
row_number() over (partition by id order by updated_at desc) r
, cast(cm.invoice_id as varchar(99)) as invoice_nk
, cast(cm.id as varchar(99)) as cm_dd
, cast(cm.created_at as timestamp) as cm_created_at
, cast(cm.updated_at as timestamp) as cm_updated_at
from "postgres-replica-parquet".credit_memos cm
)
select
case when r = 1 then cm_created_at
else cm_updated_at end as cm_begin_at
, null as cmi_begin_at
, invoice_nk
, cm_dd
, cm_created_at
, cm_updated_at
from cm
)
, cmi as(
with
ci as(
select
-- ci.change_log
row_number() over (partition by id order by updated_at desc) r
, cast(ci.credit_memo_id as varchar(99)) as cm_id
, cast(ci.created_at as timestamp) as cmi_created_at
, cast(ci.updated_at as timestamp) as cmi_updated_at
from "postgres-replica-parquet".credit_memo_items ci
)
select -- row 1 is the snapshot when cdc first started
null as cm_begin_at
, case when r = 1 then cmi_created_at
else cmi_updated_at end as cmi_begin_at
, null as invoice_nk
, cm_id as cm_dd
, null as cm_created_at
, null as cm_updated_at
from ci
)
, combined as(
with
all as(
select * from cme
union select * from cmi
)
select
greatest(
coalesce(cm_begin_at, cast(date_parse('1970-01-01', '%Y-%m-%d') as timestamp))
, coalesce(cmi_begin_at, cast(date_parse('1970-01-01', '%Y-%m-%d') as timestamp))
) as event_at
, *
from all
)
select
event_at
, cm_begin_at as event_credit_memo_last_at
, cmi_begin_at as event_credit_memo_item_last_at
, coalesce(invoice_nk, lag(invoice_nk, 1, '') over (partition by cm_dd order by event_at)) as invoice_nk
, cm_dd as credit_memo_dd
from combined
And a snapshot of the full result:

for reporting.
Eugene Choi
(11 rep)
Mar 11, 2022, 03:11 PM
• Last activity: Nov 13, 2024, 02:01 PM
1
votes
3
answers
158
views
Presto/Mysql join validation
I have 2 tables, and the CASE statement below should only evaluate to 1/true if the record with the same lead_id exists in both tables. Here I have 2 ways of achieving this but just want to verify if its in fact the most optimal way of achieving this and which one is the better option here? The lead...
I have 2 tables, and the CASE statement below should only evaluate to 1/true if the record with the same lead_id exists in both tables. Here I have 2 ways of achieving this but just want to verify if its in fact the most optimal way of achieving this and which one is the better option here? The lead_id in t1 is the primary key and will always exist. I need to verify if its moved into the 2nd table
**Query 1**
SELECT
CASE
WHEN
t1.lead_id = t2.lead_id
THEN
1
ELSE
0
END
AS valid
FROM
table1 t1
LEFT JOIN
table2 t2
ON t1.lead_id = t2.lead_id
**Query 2**
SELECT
CASE
WHEN
t2.lead_id IS NOT NULL
AND t2.lead_id != 0
THEN
1
ELSE
0
END
AS valid
FROM
table1 t1
LEFT JOIN
table2 t2
ON t1.lead_id = t2.lead_id
code-is-life
(117 rep)
Apr 28, 2022, 01:13 PM
• Last activity: Apr 28, 2022, 06:09 PM
0
votes
2
answers
3170
views
PRESTO Left join using multiple operators
- I have 2 tables, one listing all our customers and then a 2nd listing every clients list of creditors/obligations - Firstly, I need to return all clients in t1 and then be able to determine if the client owns a vehicle or not by looking at t2 - This query seems to do the trick and works as expecte...
- I have 2 tables, one listing all our customers and then a 2nd listing every clients list of creditors/obligations
- Firstly, I need to return all clients in t1 and then be able to determine if the client owns a vehicle or not by looking at t2
- This query seems to do the trick and works as expected
SELECT t1.first_name, t1.last_name, t1.id_number, CASE
WHEN
t2.uid IS NOT NULL
THEN
1
ELSE
0 END AS vehicle_finance FROM db.client t1 LEFT JOIN
db.obligation t2
ON t1.uid = t2.uid
AND t2.creditor_name LIKE '%Vehicle Finance%'
- But recently I discovered additional variants/keywords to include for the creditor name and this is the updated query I came up with
SELECT
t1.first_name,
t1.last_name,
t1.id_number,
CASE
WHEN
t2.uid IS NOT NULL
THEN
1
ELSE
0
END
AS vehicle_finance
FROM
db.client t1
LEFT JOIN
db.obligation t2
ON t1.uid = t2.uid
AND lower(t2.creditor_name) LIKE '%vehicle finance%'
OR lower(t2.creditor_name) LIKE '%vehicle and asset finance%'
OR lower(t2.creditor_name) LIKE '%vehicle + asset finance%'
OR lower(t2.creditor_name) LIKE '%motor finance%'
- Can I use the operators(AND & OR) like this in a LEFT JOIN after ON?
- I am not worried about duplicate rows being returned as the client will have several obligations, I have another process that takes care of the duplicates
- Is this query correct and most efficient way of achieving this?
code-is-life
(117 rep)
Feb 28, 2022, 06:05 PM
• Last activity: Feb 28, 2022, 09:54 PM
3
votes
1
answers
3862
views
Getting the row before a row with a certain value in SQL
I have a table like below where user actions are stored with a timestamp. My goal is to identify the action that happened before a specific action (named reference_action) and count the number of those actions to see which actions happens before the specific actions and how they are distributed. I a...
I have a table like below where user actions are stored with a timestamp. My goal is to identify the action that happened before a specific action (named reference_action) and count the number of those actions to see which actions happens before the specific actions and how they are distributed.
I am aware of window functions like LAG() where I can get the row before a certain row but can't figure out how to include a constraint like
WHERE action_name = "reference_action"
.
The query engine is Presto and the tables are Hive tables but I'm mostly interested in the general SQL approach, therefore that shouldn't matter much.
| session | action_name | timestamp |
| ------- | ----------- | --------- |
| 1 | "some_action" | 1970-01-01 00:01:00 |
| 1 | "some_action" | 1970-01-01 00:02:00 |
| 1 | "some_action" | 1970-01-01 00:03:00 |
| 1 | "desired_action1" | 1970-01-01 00:04:00 |
| 1 | "reference_action" | 1970-01-01 00:05:00 |
| 1 | "some_action" | 1970-01-01 00:06:00 |
| 1 | "some_action" | 1970-01-01 00:07:00 |
| 2 | "some_action" | 1970-01-01 01:23:00 |
| 2 | "some_action" | 1970-01-01 02:34:00 |
| 2 | "desired_action1" | 1970-01-01 03:45:00 |
| 2 | "reference_action" | 1970-01-01 04:56:00 |
| 2 | "some_action" | 1970-01-01 05:58:00 |
| 3 | "some_action" | 1970-01-01 01:23:00 |
| 3 | "some_action" | 1970-01-01 02:34:00 |
| 3 | "desired_action2" | 1970-01-01 03:45:00 |
| 3 | "reference_action" | 1970-01-01 04:56:00 |
| 3 | "some_action" | 1970-01-01 05:58:00 |
The result should look like:
| action | count |
| ------ | ----- |
| "desired_action1" | 2 |
| "desired_action2" | 1 |
There are two rows where "desired_action1" is directly followed by a row with "reference_action", when ordered by timestamp
, hence the count being 2. The same logic applies for why the count is 1 for "desired_action2".
The goal is to know what a user did before he made a purchase (purchase = reference_action). To understand what he did before, I want to look up the action that happened before a purchase. Therefore I need to know the action_name in the row before a reference_action. desired_actions have to be counted, reference_actions are just the rows after the actions I want to count and used to determine which values should be counted.
Daniel Müller
(133 rep)
May 18, 2021, 12:35 PM
• Last activity: May 19, 2021, 10:08 AM
1
votes
1
answers
293
views
How to query from the result of a changed column of a table in postgresql
I am working with Presto on Amazon Athena. I have a string time column in a table and now I want to change that time to date time type and then query data for selected dates. Is there a direct way to do so? One way I could think of is 1) add a new column 2) insert values into it with converted date...
I am working with Presto on Amazon Athena. I have a string time column in a table and now I want to change that time to date time type and then query data for selected dates.
Is there a direct way to do so? One way I could think of is
1) add a new column
2) insert values into it with converted date
3) Query using the new column
Here I am stuck with the 2nd step with INSERT so need help with that
ALTER TABLE "db"."table"
ADD COLUMN new_date timestamp
-- Need correction in select statement that I don't understand
INSERT INTO "db"."table" (new_date)
(SELECT new_date from_iso8601_date(substr(timestamp,1,10))
Could some one help me with correction and if possible a better way of doing it?
Atihska
(129 rep)
Sep 20, 2018, 07:34 PM
• Last activity: Oct 5, 2020, 09:08 PM
1
votes
1
answers
2621
views
AWS Athena: syntax for grouping data for week ending on a Saturday
I'm having some issues with the Presto syntax in grouping data by separate weeks, ending on Saturdays. I have this field called "timestamp," and I've been using the the following to group by weeks. However, I noticed that date_truncs weeks ending on a Monday and using a date_add doesn't necessary gr...
I'm having some issues with the Presto syntax in grouping data by separate weeks, ending on Saturdays. I have this field called "timestamp," and I've been using the the following to group by weeks. However, I noticed that date_truncs weeks ending on a Monday and using a date_add doesn't necessary group all rows correctly.
select date_add('day', 5, date_trunc('week', date(substr(timestamp, 1, 10)))) as week_ending
Any suggestions on how I can group weeks ending on Saturdays? Thanks!
Jamal
(11 rep)
Aug 27, 2019, 06:17 AM
• Last activity: Oct 3, 2020, 09:06 AM
1
votes
1
answers
1052
views
SQL query - Do I need a full join to produce this combined result?
With two tables structured like this: Table1 Table2 ---------------- ---------------- id, time, x id, time, y, z 1 1 1 1 2 2 2 1 2 2 1 3 3 3 I am trying to join the tables to get a result that looks like this: Result -------------------------------- id, time, x, y, z 1 1 1 1 2 2 2 2 1 3 3 3 I was ab...
With two tables structured like this:
Table1 Table2
---------------- ----------------
id, time, x id, time, y, z
1 1 1 1 2 2 2
1 2 2 1 3 3 3
I am trying to join the tables to get a result that looks like this:
Result
--------------------------------
id, time, x, y, z
1 1 1
1 2 2 2 2
1 3 3 3
I was able to get a partial result with a query like this, but it is returning all of the values from table2 (including undesired ids) along with the joined values of table1.
select *
from Table1
full outer join Table2 on Table1.time=Table2.time and Table1.id ='1' and Table2.id ='1'
order by Table2.time
Does anyone know how to fix? I'm trying to run this query in AWS Athena/Prestodb if that matters.
Thanks for the help!
Jordan
(13 rep)
Jan 4, 2019, 05:28 AM
• Last activity: Jan 4, 2019, 06:09 PM
Showing page 1 of 8 total questions