Sample Header Ad - 728x90

How to troubleshoot disk controller on Illumos based systems?

2 votes
1 answer
342 views
I am using OmniOS which is based off of Illumos. I have a ZFS pool of two SSD's that are mirrored; the pool, known as data is reporting its %b as 100; below is iostat -xn: r/s w/s kr/s kw/s wait actv wsvc_t asvc_t %w %b device 0.0 8.0 0.0 61.5 8.7 4.5 1092.6 556.8 39 100 data Unfortunately, there is not actually a lot of throughput going on; iotop reports about 23552 bytes a second. I also ran iostat -E and it reported quite a bit of Transport Errors; we changed the port and they went away. I figured there might be an issue with the drives; SMART reports no issues; I've ran multiple smartctl -t short and smartctl -t long; no issues reported. I ran fmadm faulty and it reported the following: --------------- ------------------------------------ -------------- --------- TIME EVENT-ID MSG-ID SEVERITY --------------- ------------------------------------ -------------- --------- Jun 01 18:34:01 5fdf0c4c-5627-ccaa-d41e-fc5b2d282ab2 ZFS-8000-D3 Major Host : sys1 Platform : xxxx-xxxx Chassis_id : xxxxxxx Product_sn : Fault class : fault.fs.zfs.device Affects : zfs://pool=data/vdev=cad34c3e3be42919 faulted but still in service Problem in : zfs://pool=data/vdev=cad34c3e3be42919 faulted but still in service Description : A ZFS device failed. Refer to http://illumos.org/msg/ZFS-8000-D3 for more information. Response : No automated response will occur. Impact : Fault tolerance of the pool may be compromised. Action : Run 'zpool status -x' and replace the bad device. Like it suggests I ran zpool status -x and it reports all pools are healthy. I ran some DTraces and found that all the IO activity is from `` (for the file); which is metadata; so there actually isn't any file IO going on. When I run kstat -p zone_vfs it reports the following: zone_vfs:0:global:100ms_ops 21412 zone_vfs:0:global:10ms_ops 95554 zone_vfs:0:global:10s_ops 1639 zone_vfs:0:global:1s_ops 20752 zone_vfs:0:global:class zone_vfs zone_vfs:0:global:crtime 0 zone_vfs:0:global:delay_cnt 0 zone_vfs:0:global:delay_time 0 zone_vfs:0:global:nread 69700628762 zone_vfs:0:global:nwritten 42450222087 zone_vfs:0:global:reads 14837387 zone_vfs:0:global:rlentime 229340224122 zone_vfs:0:global:rtime 202749379182 zone_vfs:0:global:snaptime 168018.106250637 zone_vfs:0:global:wlentime 153502283827640 zone_vfs:0:global:writes 2599025 zone_vfs:0:global:wtime 113171882481275 zone_vfs:0:global:zonename global The high amount of 1s_ops and 10s_ops are very concerning. I'm thinking that it's the controller but I can't be sure; anyone have any ideas? Or where I can get more info?
Asked by user26053
Jun 3, 2015, 07:29 PM
Last activity: Feb 4, 2019, 12:50 PM