How to troubleshoot disk controller on Illumos based systems?
2
votes
1
answer
342
views
I am using OmniOS which is based off of Illumos.
I have a ZFS pool of two SSD's that are mirrored; the pool, known as
data
is reporting its %b
as 100; below is iostat -xn
:
r/s w/s kr/s kw/s wait actv wsvc_t asvc_t %w %b device
0.0 8.0 0.0 61.5 8.7 4.5 1092.6 556.8 39 100 data
Unfortunately, there is not actually a lot of throughput going on; iotop
reports about 23552
bytes a second.
I also ran iostat -E
and it reported quite a bit of Transport Errors
; we changed the port and they went away.
I figured there might be an issue with the drives; SMART reports no issues; I've ran multiple smartctl -t short
and smartctl -t long
; no issues reported.
I ran fmadm faulty
and it reported the following:
--------------- ------------------------------------ -------------- ---------
TIME EVENT-ID MSG-ID SEVERITY
--------------- ------------------------------------ -------------- ---------
Jun 01 18:34:01 5fdf0c4c-5627-ccaa-d41e-fc5b2d282ab2 ZFS-8000-D3 Major
Host : sys1
Platform : xxxx-xxxx Chassis_id : xxxxxxx
Product_sn :
Fault class : fault.fs.zfs.device
Affects : zfs://pool=data/vdev=cad34c3e3be42919
faulted but still in service
Problem in : zfs://pool=data/vdev=cad34c3e3be42919
faulted but still in service
Description : A ZFS device failed. Refer to http://illumos.org/msg/ZFS-8000-D3
for more information.
Response : No automated response will occur.
Impact : Fault tolerance of the pool may be compromised.
Action : Run 'zpool status -x' and replace the bad device.
Like it suggests I ran zpool status -x
and it reports all pools are healthy
.
I ran some DTraces and found that all the IO activity is from `` (for the file); which is metadata; so there actually isn't any file IO going on.
When I run kstat -p zone_vfs
it reports the following:
zone_vfs:0:global:100ms_ops 21412
zone_vfs:0:global:10ms_ops 95554
zone_vfs:0:global:10s_ops 1639
zone_vfs:0:global:1s_ops 20752
zone_vfs:0:global:class zone_vfs
zone_vfs:0:global:crtime 0
zone_vfs:0:global:delay_cnt 0
zone_vfs:0:global:delay_time 0
zone_vfs:0:global:nread 69700628762
zone_vfs:0:global:nwritten 42450222087
zone_vfs:0:global:reads 14837387
zone_vfs:0:global:rlentime 229340224122
zone_vfs:0:global:rtime 202749379182
zone_vfs:0:global:snaptime 168018.106250637
zone_vfs:0:global:wlentime 153502283827640
zone_vfs:0:global:writes 2599025
zone_vfs:0:global:wtime 113171882481275
zone_vfs:0:global:zonename global
The high amount of 1s_ops
and 10s_ops
are very concerning.
I'm thinking that it's the controller but I can't be sure; anyone have any ideas? Or where I can get more info?
Asked by user26053
Jun 3, 2015, 07:29 PM
Last activity: Feb 4, 2019, 12:50 PM
Last activity: Feb 4, 2019, 12:50 PM