A few days ago, our database server, (see specs below), spiked to 4800% CPU usage. The database is about 100GB. For 3 hours we tried to figure out what the issue was. We finally figured out a feature in our software was doing it, so we turned it off and rebooted the server and everything was fine. The next day we tried to turn the feature back on, but it started running at 4800% CPU usage. We tried to see if the usage would subside after we turned part of the feature off, but the usage didn’t drop. After 30 total minutes running, we ran the following commands.
- systemctl reload mariadb.server -- Got error that it could not stop
- systemctl resart mariadb.server -- Got error that it could not stop
- systemctl restart mysql -- It tried to restart but then the database was corrupted.
Here is a sample output from the sql logs. There are thousands of entries with similar statements:
2021-09-01 18:57:00 140014482597632 [ERROR] InnoDB: Page
[page id: space=4, page number=119737] log sequence number
568999191060 is in the future! Current system log sequence number
568998232259.
2021-09-01 18:57:00 140014482597632 [ERROR] InnoDB: Your database may be corrupt or you may have copied the InnoDB
tablespace but not the InnoDB log files. Please refer to
https://mariadb.com/kb/en/library/innodb-recovery-modes/ for
information about forcing recovery.`
After the restart, we couldn’t get the database back up again. We tried every recovery options available.
We restored to a backup from the day before. The back up was done after the first 4800% CPU usage spike. We ran a health test against the database, and it came back 100% healthy. So, now the un-qualified consensus is that it was the restart that crashed the database.
Here are the server specifications
- AWS EC2 Instance c5.12xlarge
- 48 vCPU
- 69 GB RAM
- 12 Gbps Network Bandwidth
- 9.5K Mbps EBS Bandwidth
- Maria DB 10.2.39
- Centos 7
Any thoughts on whether or not the service restart could have caused this catastrophic failure?
I know this is a wide-open question, but I have been looking through may articles and nothing that states a service restart would do this. However, we need to rule things out, so any input would be helpful.
Asked by guidamedia
(11 rep)
Sep 3, 2021, 09:20 PM
Last activity: Sep 3, 2021, 09:44 PM
Last activity: Sep 3, 2021, 09:44 PM