SQL Server Assertion error On Availability Groups

0 votes
1 answer
184 views
sql-server availability-groups sql-server-2019 distributed-availability-groups
                          We are having constant memory dumps from our SQL 2019 cu21 instance. We upgraded this instance to cu26 (latest patch) hoping this can resolve the issue but it did not.

The error log is filled with this error.  We have Availability Group configured on this server.

*SQL Server Assertion: File: , line=373 Failed Assertion = 'cbDecoded == cbDecodedData'. This error may be timing-related. If the error persists after rerunning the statement, use DBCC CHECKDB to check the database for structural integrity, or restart the server to ensure in-memory data structures are not corrupted.*

Any idea on what could be causing this and how to address this?

History
-------
This happened in staging environment. We are migrating from old staging server to new staging server. The old staging server is windows 2016 and SQL 2019. The new staging server is windows 2022 and SQL2022. There are 15 AGs (each DB is in its own AG) on old staging server and 15 AGs on the new staging server. We created a DAG between the old and new AG. All this work was done few weeks ago and working fine as we verified that by comparing the LSNs between the AGs (old and new). We verified the DAG status before initiating the failover and it was fine. 

After we started the migration, On the global primary (Old AG), the first memory dump reported "Access Violation occurred writing address" and the Old AG's rolled over. The instance never recovered from that event and the AG's were stuck in resolving state ( they would go offline\not synchronizing state etc). We could not even connect to both the old staging instance. The new staging instance is fine. Based on the above suggestion, I removed few AGs from the WSFC and it stabilized after that. There are 5 AGs left on the old infra and they are stable now.


Other Observations
------------------
We have about 15 AGs on these servers ( 2 replicas). After dropping few AG's, the memory dumps stopped and instance has stabilized. We dropped the AG's sort of randomly. So, I assume, we must have dropped the AG or few AGs that have the corrupted registry.

The old staging servers were on - SQL 2019 CU 21. The first memory dump was - Access Violation occurred writing address 0000000000000000 The command in the input buffer that generated the access violation was " Drop Availability Group > Noticed this fix in CU22  which talks about Access violation issue when dropping DAG if the AG is in suspect state. I am wondering if something like that might have happened.

New Questions
-------------
I reviewed the registry settings for these AG, they have entries in the configuration folder. I am not sure what they is correct or expected. 1. Is it possible for DAG migration to have caused this? 2. Is it possible for something like this happen even on regular AG ( no DAG)? 3. Can we remove the AG from WSFC, if we cannot access the instance from SQL?
                        
Asked by SqlData (39 rep)
Apr 25, 2024, 04:32 AM
Last activity: Apr 29, 2024, 11:09 AM
SQL Server Assertion error On Availability Groups

Related Questions