Sample Header Ad - 728x90

AWS RDS Postgres: How to diagnose CheckpointLag and potential slowups using AWS' Monitoring suite?

1 vote
1 answer
5047 views
We are currently hosting a postgres RDS database and our team is noticing slowup in our querying service. I'm noticing a spike in the metric, CheckpointLag and I've been tasked in trying to find where this occurs specifically on the AWS side of things. In monitoring detailed performance, we've seen that our queries are much below (20%) what our expected average active sessions (AAS) are said to reach. I also monitored the queries individually with EXPLAIN ANALYZE and the most extreme query is takes 0.5s to compute. This leads me to believe there's something else taking too long. After checking other potential metrics, CPU, BurstBalance, etc... all appear normal, there is one metric CheckpointLag which appears to have a spike under use and I can't seem to find documentation on. I can't seem to find what this means and the expected *acceptable* value we should expect with a db.m4.xLarge. With no, to low, usage -- it appears to be ~140 seconds. Under normal, expected usage it jumps to ~400 seconds. I'm asking what this metric really means, if the values are of *expected* or *normal* values, and if there's any other ways I can see if my RDS instance is the cause of my slowup? **EDIT:** Checkpoint lag is defined as a metric here: https://docs.aws.amazon.com/AmazonRDS/latest/UserGuide/rds-metrics.html with the description of The amount of time since the most recent checkpoint. It was fairly vague and hard to decipher the true meaning. For my metrics, it appears that its pulling from this already pre-defined metric, but if there's a way to dive deeper in how its querying the instance, please let me know. **Follow-Up** I ended up editing queries to group results and reduce the number of rows being exported at one time as our team was querying way too many rows to begin with. With this, CheckpointLag went down and I associated it with time taken to either reach or perform queries on RDS (duh!), but I still have not pinpointed exact meaning. There must've been some bottleneck in outputting all of the rows and cause the "lag" to rise...
Asked by Andrew Narvaez (11 rep)
Dec 15, 2023, 09:42 PM
Last activity: Jul 16, 2025, 03:35 PM