Sample Header Ad - 728x90

Database Administrators

Q&A for database professionals who wish to improve their database skills

Latest Questions

3 votes
1 answers
103 views
What does it mean if all of my wait_resources in the System Health Extended Event are the same pointer?
It's notoriously impossible to decode [what the wait_resource pointer in the System Health Extended Event means](https://dba.stackexchange.com/questions/320594/identifying-wait-resource-for-wait-info-extended-events). But if all of my long waits in that Extended Event for a certain time period are r...
It's notoriously impossible to decode [what the wait_resource pointer in the System Health Extended Event means](https://dba.stackexchange.com/questions/320594/identifying-wait-resource-for-wait-info-extended-events) . But if all of my long waits in that Extended Event for a certain time period are reporting the same pointer, then does that indicate anything at all? I'm experiencing high PAGELATCH_SH waits for this time period and expect that the reason why all of pointers are the same will be related to that. I've not included the XML here because I am absolutely certain that it won't help answer the question. I'm just trying to fathom if there is ever a case where I should pay attention to the wait_resource value in this Extended Event when it is obviously a pointer.
J. Mini (1225 rep)
Mar 5, 2025, 06:28 PM • Last activity: Mar 11, 2025, 03:04 AM
1 votes
1 answers
118 views
What is an example to understand that having multiple data files helps with latch pages in parallel?
> You will always benefit from multiple data files, because then SQL > Server is able to latch multiple system pages in parallel in the > Buffer Pool. Link: https://www.sqlpassion.at/archive/2016/08/29/files-and-file-groups-in-sql-server/ I currently have one data file. I'm exploring the use case of...
> You will always benefit from multiple data files, because then SQL > Server is able to latch multiple system pages in parallel in the > Buffer Pool. Link: https://www.sqlpassion.at/archive/2016/08/29/files-and-file-groups-in-sql-server/ I currently have one data file. I'm exploring the use case of adding a new data file, or increasing the size of an existing data file. Given that my IO subsystem (Data drive) can withstand high IOPS, I understand that there is no performance improvement whether I have one or more data files. And because there is no limit to how many threads can use the data file for read/write in parallel, SQL server must allow parallel latch of pages. So then, why does the author say that - having multiple data files gives the Server ability latch multiple system pages in parallel in the Buffer Pool. Why would this not be possible with single data file?
variable (3590 rep)
Jan 17, 2024, 07:07 PM • Last activity: Feb 1, 2024, 10:32 AM
2 votes
1 answers
13075 views
Non-yielding Scheduler Error with dump caused by heavy queries on SQL 2016
After bringing our databases to new servers with new hardware and new os (2012 server) and new SQL Server version (2016 sp1) we face the following issue: During nightly reporting database update runs, the SQL Server service suddenly gets into trouble. Those report DB updates are performed using `mer...
After bringing our databases to new servers with new hardware and new os (2012 server) and new SQL Server version (2016 sp1) we face the following issue: During nightly reporting database update runs, the SQL Server service suddenly gets into trouble. Those report DB updates are performed using merge statements that collect, update and insert delta within multiple steps. The first step of the job started at 01:00 AM and caused around 500 Mio IO READS, but it succeeded. The second step started at 01:31 am but failed. It's all not highly efficient, I know and I also know there are still lots of indexes that have to be created to make those queries and the whole process more efficient. But this is not the point. Nobody cares if it would take some hours. The point is that so far on the old SQL Server systems, those same procedures took long, but they never made the SQL Server stumble. Now, we observe the following: The jobs start and run for a while, around 30 minutes and complete some job steps. During 01:35 and 01:55 AM the CPU load gets close to or to 100%. Between 01:55 AM and 02:11 AM the load varies btw 70% an 95%. In the SQL logs we see those entries btw 01.35 and 01:53.: > **Unable to get thread context for spid 0 > BEGIN STACK DUMP: > 05/18/17 01:35:02 spid 5416 > Non-yielding Scheduler > Stack Signature for the dump is 0x0000000000000266 > External dump process return code 0x20000001. > External dump process returned no errors. > > Process 62:0:7 (0x3ab4) Worker 0x00000066A0FD8160 appears to be non-yielding on Scheduler 0. Thread creation time: 13139533134164. > Approx Thread CPU Used: kernel 0 ms, user 50156 ms. Process > Utilization 97%. System Idle 1%. Interval: 73484 ms. > > Process 62:0:6 (0x3b64) Worker 0x00000066A0FBC160 appears to be non-yielding on Scheduler 1. Thread creation time: 13139503419122. > Approx Thread CPU Used: kernel 0 ms, user 57343 ms. Process > Utilization 95%. System Idle 3%. Interval: 78640 ms. > > Process 0:0:0 (0x2464) Worker 0x00000066A0E42160 appears to be > non-yielding on Scheduler 2. Thread creation time: 13139456436014. > Approx Thread CPU Used: kernel 15 ms, user 62515 ms. Process > Utilization 94%. System Idle 4%. Interval: 71437 ms. > > Process 62:0:5 (0x1688) Worker 0x00000066A0E46160 appears to be > non-yielding on Scheduler 3. Thread creation time: 13139225220768. > Approx Thread CPU Used: kernel 0 ms, user 53156 ms. Process > Utilization 95%. System Idle 3%. Interval: 73437 ms. > > SQL Trace ID 2 was started by login "mydomain\mySentryOneServiceUser". > > I/O is frozen on database SDB. No user action is required. However, if I/O is not resumed promptly, you could cancel the backup. > > A time-out occurred while waiting for buffer latch -- type 3, bp 000000751B746240, page 4:218376, stat 0x10f, database id: 2, > allocation unit Id: 6488064, task 0x00000066A0FEE8C8 : 5, waittime 300 > seconds, flags 0x1000000019, owning task 0x00000066A0DF5848. Not > continuing to wait. At 01:53 the netapp SnapManager tasks start to fail taking snapshots: > Error: 3041, Severity: 16, State: 1. BACKUP failed to complete the > command BACKUP DATABASE DB1. Check the backup application log for > detailed messages. Between the same timeslots, Job Agent log shows this: 01:36 AM - 02:08 AM > SQLServer Error: 65535, SQL Server Network Interfaces: Error > Locating Server/Instance Specified [xFFFFFFFF]. [SQLSTATE 08001] > ODBC Error: 0, Login timeout expired [SQLSTATE HYT00] > SQLServer Error: 65535, A network-related or instance-specific error has > occurred while establishing a connection to SQL Server. Server is not > found or not accessible. Check if instance name is correct and if SQL > Server is configured to allow remote connections. For more information > see SQL Server Books Online. [SQLSTATE 08001] > Logon to server 'xxx\xxx' failed (ConnUpdateStartExecutionDate) Most of them repeat multiple times per minute. This is the CPU spec: INTEL XEON E5-3690 v3 30M 4x 2,6GHz 4 virtual CPUs Software: Microsoft SQL Server Enterprise (64-bit) 13.0.4001.0 Instance Settings: All settings are default, MAXDOP = 4, CTFP = 50, system memory=64 GB, SQL MAX_MEMORY = 58982 CONTENT OF DUMP.txt: Current time is 01:35:02 05/18/17. BugCheck Dump This file is generated by Microsoft SQL Server version 13.0.4001.0 upon detection of fatal unexpected error. Please return this file, the query or program that produced the bugcheck, the database and the error log, and any other pertinent information with a Service Request. Computer type is Intel(R) Xeon(R) CPU E5-2690 v3 @ 2.60GHz. Bios Version is INTEL - 6040000 PhoenixBIOS 4.0 Release 6.0 4 X64 level 8664, 2 Mhz processor (s). Windows NT 6.2 Build 9200 CSD . Memory MemoryLoad = 94% Total Physical = 65535 MB Available Physical = 3317 MB Total Page File = 73535 MB Available Page File = 10601 MB Total Virtual = 134217727 MB Available Virtual = 134041378 MB ***Unable to get thread context for spid 0 * ******************************************************************************* * * BEGIN STACK DUMP: * 05/18/17 01:35:02 spid 5416 * * Non-yielding Scheduler * * ******************************************************************************* What can I do to prevent the system from running into trouble and getting all those error log entries? Is there configuration to check?
Magier (4827 rep)
May 18, 2017, 09:55 AM • Last activity: Mar 3, 2023, 01:02 PM
5 votes
1 answers
454 views
NOLOCK or Read Uncommitted locking / latching behaviours
I've read many, many different takes on how NOLOCK or the read uncommitted isolation levels function in terms of locks / latches taken. When using a SELECT with NOLOCK or in the Read Uncommitted Isolation level, is the only lock taken out a Schema Stability Lock, or do shared locks get taken on a ro...
I've read many, many different takes on how NOLOCK or the read uncommitted isolation levels function in terms of locks / latches taken. When using a SELECT with NOLOCK or in the Read Uncommitted Isolation level, is the only lock taken out a Schema Stability Lock, or do shared locks get taken on a rolling basis as the query works through the rows? (Obviously these locks would need to be dropped straight away) What about latches? How are the pages in memory dealt with as I assume referencing an in-memory object that's in the middle of being modified isn't allowed?
George.Palacios (5591 rep)
Jul 5, 2019, 12:30 PM • Last activity: Feb 12, 2022, 05:38 PM
6 votes
1 answers
4493 views
DBCC CHECKDB on a user database: Time-out occurred while waiting for buffer latch type 2 for page (X:XXX), database ID 2
Our Ola Hallengren IntegrityCheck job failed due to a buffer latch timeout while running DBCC CHECKDB on a user database. However, the buffer latch timeout reported was in TempDB (database ID 2). Output from the job: ``` Date and time: 2022-01-22 09:04:15 [SQLSTATE 01000] Database context: [master]...
Our Ola Hallengren IntegrityCheck job failed due to a buffer latch timeout while running DBCC CHECKDB on a user database. However, the buffer latch timeout reported was in TempDB (database ID 2). Output from the job:
Date and time: 2022-01-22 09:04:15 [SQLSTATE 01000]
Database context: [master] [SQLSTATE 01000]
Command: SET LOCK_TIMEOUT 600000; DBCC CHECKDB ([SentryOne]) WITH NO_INFOMSGS, ALL_ERRORMSGS, MAXDOP = 4 [SQLSTATE 01000]
Msg 845, Sev 17, State 1, Line 1 : Time-out occurred while waiting for buffer latch type 2 for page (6:222), database ID 2. [SQLSTATE 42000]
Outcome: Failed [SQLSTATE 01000]
Duration: 12:40:32 [SQLSTATE 01000]
Date and time: 2022-01-22 21:44:47 [SQLSTATE 01000]
Messages in the SQL Error Log:
Date 1/22/2022 9:35:22 PM
Log SQL Server (Archive #1 - 1/23/2022 12:00:00 AM)

Source spid777

Message
A time-out occurred while waiting for buffer latch -- type 2, bp 0000016D7A1DE340, page 6:222, stat 0x40d, database id: 2, allocation unit Id: 536870912/281475513581568, task 0x0000016CD624E4E8 : 2, waittime 300 seconds, flags 0x1a, owning task 0x0000016CD624E4E8. Not continuing to wait.

BobMgr::GetBuf: Sort Big Output Buffer write not complete after 60 seconds.

DBCC CHECKDB (SentryOne) WITH all_errormsgs, no_infomsgs, maxdop = 4 executed by NT SERVICE\SQLSERVERAGENT terminated abnormally due to error state 6. Elapsed time: 12 hours 39 minutes 57 seconds.

[INFO] Database ID: . Cleaning up StorageArray. LastClosedCheckpointEndTs: '1225176490'

[WARNING] ALTER or DROP TABLE could not clean up root row within 10 seconds.
The page in question is an IAM page in TempDB. DBCC PAGE output for it:
PAGE: (6:222)

BUFFER:

BUF @0x0000016D6B622C40

bpage = 0x00000161A6C52000 bPmmpage = 0x0000000000000000 bsort_r_nextbP = 0x0000016D6B622AD0
bsort_r_prevbP = 0x0000016D6B622B80 bhash = 0x0000000000000000 bpageno = (6:222)
bpart = 4 ckptGen = 0x0000000000000000 bDirtyRefCount = 0
bstat = 0x9 breferences = 3 berrcode = 0
bUse1 = 19321 bstat2 = 0x0 blog = 0x215a215a
bsampleCount = 1 bIoCount = 0 resPoolId = 0
bcputicks = 330 bReadMicroSec = 848 bDirtyContext = 0x0000000000000000
bDbPageBroker = 0x0000000000000000 bdbid = 2 bpru = 0x0000016D3A170040

PAGE HEADER:

Page @0x00000161A6C52000

m_pageId = (6:222) m_headerVersion = 1 m_type = 10
m_typeFlagBits = 0x0 m_level = 0 m_flagBits = 0x200
m_objId (AllocUnitId.idObj) = 1837007664 m_indexId (AllocUnitId.idInd) = 28675
Metadata: AllocUnitId = 8071415347312328704
Metadata: PartitionId = 2162691495132069888 Metadata: IndexId = 0
Metadata: ObjectId = -1083263203 m_prevPage = (0:0) m_nextPage = (0:0)
pminlen = 90 m_slotCnt = 2 m_freeCnt = 6
m_freeData = 8182 m_reservedCnt = 0 m_lsn = (324:1435184:42)
m_xactReserved = 0 m_xdesId = (0:0) m_ghostRecCnt = 0
m_tornBits = -551517712 DB Frag ID = 1

Allocation Status

GAM (6:2) = ALLOCATED SGAM (6:3) = ALLOCATED
PFS (6:1) = 0x70 IAM_PG MIXED_EXT ALLOCATED 0_PCT_FULL DIFF (6:6) = NOT CHANGED
ML (6:7) = NOT MIN_LOGGED

IAM: Header @0x00000036C4278064 Slot 0, Offset 96

sequenceNumber = 0 status = 0x0 objectId = 0
indexId = 0 page_count = 0 start_pg = (8:511232)

IAM: Single Page Allocations @0x00000036C427808E

Slot 0 = (0:0) Slot 1 = (0:0) Slot 2 = (0:0)
Slot 3 = (0:0) Slot 4 = (0:0) Slot 5 = (0:0)
Slot 6 = (0:0) Slot 7 = (0:0)

IAM: Extent Alloc Status Slot 1 @0x00000036C42780C2

(8:511232) - (8:592656) = NOT ALLOCATED
(8:592664) - = ALLOCATED
(8:592672) - (8:770424) = NOT ALLOCATED
I found this DBA StackExchange question which seems to be most closely relevant: https://dba.stackexchange.com/questions/117864/getting-error-time-out-occurred-while-waiting-for-buffer-latch-type-2-for-page However, in that scenario, the timed-out page reported is in a user database and not TempDB. Also, we're on the latest SQL Server version (2019-CU14 Enterprise) so this should not be happening anyway. I did notice that there was very high latency on TempDB data files around 6 PM and 9:40 PM. I'm unsure whether it's related, though, because the latency spike at 6 PM is higher and yet the job failed only around 9:44 PM. TempDB Data Files Latency These latency spikes do not correlate with any spikes in the Azure Monitor metrics VM Cached IOPS Consumed Percentage and VM Cached Bandwidth Consumed Percentage. Additional information: - The server is installed on an Azure VM: Standard E8s v3 (8 vcpus, 64 GiB memory) - The user database in question (SentryOne) has memory-optimized tables. - TempDB Memory-Optimized metadata is DISABLED. - Instant File Initialization is ENABLED. - Lock Pages in Memory is DISABLED. Any ideas why this happened and how to prevent it in the future? Thanks!
Eitan Blumin (483 rep)
Jan 23, 2022, 11:44 AM • Last activity: Jan 23, 2022, 03:51 PM
2 votes
1 answers
297 views
LATCH_EX and CX_PACKETS for a stored proc seen when timing out
I am trying to troubleshoot an issue with below things noticed:- There is this stored proc which gets timed out on some random days while other days or after retry may complete faster. Time out setting 1 hour. When completes, it does under 15 mins. What we have noticed below for this DB server- - Th...
I am trying to troubleshoot an issue with below things noticed:- There is this stored proc which gets timed out on some random days while other days or after retry may complete faster. Time out setting 1 hour. When completes, it does under 15 mins. What we have noticed below for this DB server- - The database in question is a LS standby readonly database which keeps getting restored every 4 hours from Primary. - Therefore the SP resides on secondary LS read only database and user query once its accessible. - After restore is completed certain mornings we see this SP gets timed out with below as TOP 2 waits LATCH_EX and CXPACKETS (wait time showing accumulating upto 8 hours for that 1 hour run, possibly because MAXDOP is 8) However when we retry the same proc after next cycle of restore of database, same SP just completes under 15 minutes. When it Completes- - We do see CXPACKETS but that wait time is very less - We do not see any LATCH_EX waits What possibly is going on here and how can i troubleshoot this further? We have already shared the code level tuning which can be done but DEVs are more interested why same SP gets completed in under 15 mins sometime and fails may be one random day in a week Server level settings: > MAXDOP=8 CTOP=5 Edit- We found within SP the particular statement which tends to be the slowest amongst all and spending 90% of that 1 hour time is insert into # (temptable) after selecting data from multiple tables via job going parallel. - When the issue is there above statement also waits on LATCH_EX in addition to CXPACKETS - When running faster CXPACKETS are there but no LATCH_EX waits
Newbie-DBA (804 rep)
Aug 10, 2021, 04:25 PM • Last activity: Aug 16, 2021, 03:15 PM
2 votes
1 answers
1810 views
SQL Server - High buffer time
I'm noticing a high number of waitingRequestsCount on SQL server. The latch_class is buffer. See attached screenshots. Going through SQL server dashboard, I see high number (32221334) waits for Buffer Pool. Can someone please explain what could contribute to high number of waits?[![enter image descr...
I'm noticing a high number of waitingRequestsCount on SQL server. The latch_class is buffer. See attached screenshots. Going through SQL server dashboard, I see high number (32221334) waits for Buffer Pool. Can someone please explain what could contribute to high number of waits?enter image description here
Adit Bhatia (21 rep)
Oct 15, 2019, 03:13 PM • Last activity: Feb 20, 2020, 11:00 PM
2 votes
1 answers
1800 views
Differences of latch and lock
Recently when i was trying to find information about latch_ex wait type, i come across one blog as below which stated about latch and lock. [![enter image description here][1]][1] [1]: https://i.sstatic.net/2DdrR.png After reading this blog , i m just curious in one thing. When app submit a request,...
Recently when i was trying to find information about latch_ex wait type, i come across one blog as below which stated about latch and lock. enter image description here After reading this blog , i m just curious in one thing. When app submit a request, SQL server will look for the information in buffer cache first and if the page not in buffer cache ,then only it will read from disk and put it in buffer cache before send info to app. My question based on the screenshot as above where it stated latch and lock need to avoid two thread updating same page. Basically all request come to SQL server will go to buffer cache first, if the page in buffer cache is busy updating, another thread will have to wait. It wont go back to disk because the page already in memory. Then what is the purpose of lock since every request will be done through memory and there is latch to protect the page
jack (79 rep)
Nov 9, 2019, 04:29 PM • Last activity: Nov 9, 2019, 07:00 PM
0 votes
1 answers
50 views
Wait Stats on Microsoft Azure test
On a particular test we're doing with Microsoft Azure, we are seeing a wait_category called: PAGE LATCH (non-I/O) that we're unfamiliar with. In our test, CaptureSeconds = 1595 and for PAGE LATCH (non-I/O) it reports: wait_time_ms = 126666216 wait_time_ms_per_sec = 79364 max_waiting_tasks = 13086110...
On a particular test we're doing with Microsoft Azure, we are seeing a wait_category called: PAGE LATCH (non-I/O) that we're unfamiliar with. In our test, CaptureSeconds = 1595 and for PAGE LATCH (non-I/O) it reports: wait_time_ms = 126666216 wait_time_ms_per_sec = 79364 max_waiting_tasks = 13086110 average_wait_time_ms = 9 max_Wait_time_ms = 2526 Can someone tell me what PAGE LATCH (non-I/O) wait category is and whether this is indicating a problem of some sort? Thanks, RW
Robert West (21 rep)
Oct 3, 2018, 03:06 PM • Last activity: Oct 3, 2018, 03:24 PM
0 votes
1 answers
2641 views
ACCESS_METHODS_DATASET_PARENT Problem
We had an AlwaysOn Async replica we had some problems so we deleted the availability group After that the database and CPU seem to work abnormally It's very slow and we have waittype = LATCH_EX , ACCESS_METHODS_DATASET_PARENT We didn't have such a problem till today Can anybody help ?????
We had an AlwaysOn Async replica we had some problems so we deleted the availability group After that the database and CPU seem to work abnormally It's very slow and we have waittype = LATCH_EX , ACCESS_METHODS_DATASET_PARENT We didn't have such a problem till today Can anybody help ?????
Milad Firouzi (139 rep)
Jul 19, 2016, 11:42 AM • Last activity: Jul 15, 2018, 10:06 AM
5 votes
1 answers
678 views
Database ID of a non-existent database - time-out waiting for buffer latch on page appearing in error log
**Background:** I've recently started at a new shop with lots of databases in various states. I arrived this morning (having just been added to the SQL DBA mailing group) to find an email about the CHECKDB job failing last night on one of the servers (one I've been warned has been problematic). I wo...
**Background:** I've recently started at a new shop with lots of databases in various states. I arrived this morning (having just been added to the SQL DBA mailing group) to find an email about the CHECKDB job failing last night on one of the servers (one I've been warned has been problematic). I won't dwell on the details of this here, it appears to be down to a number of time-outs waiting for buffer latches - by itself it's hard to tell why this is, as there should've been (and appears to have been) very little system usage last night when this was run. There are ongoing problems with I/O requests longer than 15 seconds in the logs, and the vmware guys say that there have been communication problems with storage (on-going network issues), so this goes someway to explain the problem I think. **Important Note:** I reran CHECKDB this morning and it completely quickly, finding 0 errors and as far as I'm aware nothing has happened since the failure last night, so (correct me if I'm wrong), I'm pretty confident there is no database problems, and last night was an anomaly, possibly triggered by network or storage problems (I'll know more about this when the network guys get back to me... if they do). **The actual point of this:** In the logs, right at the point where last night's CHECKDB job failed, there is a time-out waiting for a buffer latch - the strange thing to me is that the database id of the page it's supposedly trying to latch is 11, and there is no database with that id on the instance - ids only go up to 10. This is followed up by the CHECKDB line with 1 errors found, 0 repaired, pointing to an Internal Database snapshot having a split point (see below for error log) **Specific Questions:** 1. Why would there be a reference to a database ID that doesn't exist (do Internal database snapshots get assigned a database ID at some point and that's what it was referring to)? 2. Is there anyway of finding out what the database snapshot in question is (is it something to do with the way CHECKDB works, or is it something else)? 3. Is there a way of finding what happened to the snapshot in question (as it obviously wasn't a problem when I ran CHECKDB again this morning). **Error log:** 01/04/2017 23:50 spid101 Unknown DBCC CHECKDB (zenworks_UAL) WITH no_infomsgs executed by ARTSLOCAL\svc_zcmsql found 1 errors and repaired 0 errors. Elapsed time: 0 hours 28 minutes 51 seconds. Internal database snapshot has split point LSN = 00adf51f:00019bf4:0001 and first LSN = 00adf51f:00019bf3:0001. 01/04/2017 23:50 spid101 Unknown Unable to read and latch page (1:2067184) with latch type SH. Latch failed. 01/04/2017 23:50 spid101 Unknown Error: 8966 Severity: 16 State: 2. 01/04/2017 23:50 spid101 Unknown A time-out occurred while waiting for buffer latch -- type 2 bp 00000004FFFF4A80 page 1:2067184 stat 0xc2040d **database id: 11** allocation unit Id: 0/281474976710656 task 0x000000015BA6A088 : 0 waittime 300 flags 0x100000001a owning task 0x000000015BA6A088. Not continuing to wait. enter image description here **Note:** *As far as a know, no one else was working the database or running anything significant between the failure last night and me running CHECKDB this morning. The system was in use, but not significantly. Nothing is showing in the logs.* *To confirm, there is no DB of ID 11, and I don't think there ever was (and as certain as I can be not between last night and this morning* Query: SELECT db_name(11) Results: NULL Query: SELECT * FROM sys.databases Results: ╔════════════════════╦═════════════╗ ║ name ║ database_id ║ ╠════════════════════╬═════════════╣ ║ master ║ 1 ║ ║ tempdb ║ 2 ║ ║ model ║ 3 ║ ║ msdb ║ 4 ║ ║ ReportServer ║ 5 ║ ║ ReportServerTempDB ║ 6 ║ ║ z_xxxxxx_L ║ 7 ║ ║ z_xxxxxx_p ║ 8 ║ ║ S_xxxxxx_t ║ 9 ║ ║ z_xxxxxx_t ║ 10 ║ ╚════════════════════╩═════════════╝
Ian_H (1674 rep)
Jan 5, 2017, 03:28 PM • Last activity: Mar 29, 2018, 08:18 PM
5 votes
3 answers
17253 views
How to reduce huge CXPACKET & LATCH_EX (ACCESS_METHODS_DATASET_PARENT) wait times?
**Problem** We have been experiencing high levels of user disruption due to SQL timeouts accross our systems since the beginning of the year. The SQL-Server instance in question has very high CPU usage (higher than 90% on all 16 cores all the time) during business hours. We have also noticed very hi...
**Problem** We have been experiencing high levels of user disruption due to SQL timeouts accross our systems since the beginning of the year. The SQL-Server instance in question has very high CPU usage (higher than 90% on all 16 cores all the time) during business hours. We have also noticed very high wait times: The combination of CXPACKET & LATCH_EX accounts for about 97% of all waits. This is split about 50/50 between CXPACKET & LATCH_EX. The non-buffered latch wait accounting for the vast majority (>95%) of LATCH_EX is ACCESS_METHODS_DATASET_PARENT. This suggests the problem is to do with parallelism. An example of the scale of wait times is: CXPACKET : 332,301,799 ms LATCH_EX : 267,955,752 ms PAGEIOLATCH_SH : 2,955,160 ms This was for the period between 08:00-16:24 on Jan 11th. **Options under consideration** 1) Change MAXDOP from 0 to something between 4 and 8 2) Modify the cost threshold of parallelism from 50 to a higher number Suggestions most welcome on how to ease the very high CPU load we are seeing, and reduce timeouts, in particular whether the proposed course of action is wise, and which numbers to change MAXDOP and cost threshold of parallellism to. **Background information** - SQL-Server 2008 R2 running on AMD Opteron 6180 SE, of which 16 cores are given to this instance of SQL-Server. - Type of workload: something of the order of 800 connections at the same time during business hours; majority OLTP type workload with some OLAP mixed in. - Microsoft SQL Server 2008 R2 (SP1) - 10.50.2500.0 (X64) ... Enterprise Edition (64-bit) on Windows NT 6.1 (Build 7601: Service Pack 1). Memory is appx 128 Gigs between 24 cores. 16 of the cores are available to this instance
matskm (53 rep)
Jan 13, 2016, 10:59 AM • Last activity: Jun 1, 2017, 10:14 AM
6 votes
1 answers
96 views
SELECT xyz INTO #Temp from SSMS is 5x longer in one environment than another
First time posting, I hope I'm doing it right. I'm a "dba by accident" who is learning a lot but admittedly still has a lot more to learn. Here is my brain dump: Prod server has 5 million rows. I copied those rows to a Test server. Using SNAPSHOT or READ UNCOMMITTED produce different elapsed times,...
First time posting, I hope I'm doing it right. I'm a "dba by accident" who is learning a lot but admittedly still has a lot more to learn. Here is my brain dump:
  1. Prod server has 5 million rows. I copied those rows to a Test server.
  2. Using SNAPSHOT or READ UNCOMMITTED produce different elapsed times, but the ratio between Prod and Test is always ~5:1 (14 minutes:3 minutes with SNAPSHOT, 4 minutes:sub-1 with READ UNCOMMITTED)
  3. Using STATISTICS IO ON reveals that most of the reads on Prod are logical (~90%) while none of the reads on Test are logical. I think that should mean Prod would be faster if anything.
  4. Most of the time appears to be going to waits of type "PAGEIOLATCH_SH."
  5. I don't think different statistics can matter, since no index should be used on an unfiltered, unjoined, unsorted query.
  6. SELECTing the top n rows on both servers reveals a very linear scale... the ratio is always ~5:1.
  7. The comparison is not apples-to-apples, but I think the only relevant differences are activity and storage.
  8. I have performed the comparison when activity was next to nothing, and saw the same results.
  9. I discovered that autogrowth was set to 1MB (ugh), which I have since changed but I wonder if this has led to fragmented storage at the file system level.
  10. The database is on a separate volume from the Windows installation, but it does share a volume with tempdb and with a FILESTREAM store, which I assume could also lead to fragmented storage.
  11. The "optimize drives" windows utility says the volume is "OK (98% space efficiency)" which seems to suggest that maybe file system fragmentation isn't a problem.
Questions:
  • Are my thoughts and assumptions above correct?
  • Is there anything I'm blatantly overlooking in my thought process?
  • What are good candidates for the next steps in this troubleshooting process?
  • What other information can/should I have provided?
  • Thank you for your attention and help! EDIT: Added rest of time information in 2nd statement.
Kyle Chandler (63 rep)
Dec 29, 2015, 10:45 PM • Last activity: Dec 30, 2015, 05:45 AM
2 votes
1 answers
221 views
Non Buffer Latch - Forefront Identity Manager
We are using Forefront Identity Manger and we are doing an initial load to which we are experiencing slow performance. I always start with wait types for such things and can see `%30 CXPACKET, %25 LATCH_EX and %15 ASYNC_NETWORK_IO.` I understand that `LATCH_EX` is a non buffer wait type so I looked...
We are using Forefront Identity Manger and we are doing an initial load to which we are experiencing slow performance. I always start with wait types for such things and can see %30 CXPACKET, %25 LATCH_EX and %15 ASYNC_NETWORK_IO. I understand that LATCH_EX is a non buffer wait type so I looked in sys.dm_os_latch_stats and can see that %87 of the latch waits is down to ACCESS_METHODS_DATASET_PARENT. Is this a problem? Even on Paul Randals blog I am unclear what my issue is if at all. Thanks
Tom (1569 rep)
Aug 14, 2015, 03:04 PM • Last activity: Aug 14, 2015, 04:12 PM
Showing page 1 of 14 total questions