How to troubleshoot disk controller on Illumos based systems?

2 votes
1 answer
342 views
                          I am using OmniOS which is based off of Illumos.

I have a ZFS pool of two SSD's that are mirrored; the pool, known as data is reporting its %b as 100; below is iostat -xn:

    r/s    w/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
    0.0    8.0    0.0   61.5  8.7  4.5 1092.6  556.8  39 100 data

Unfortunately, there is not actually a lot of throughput going on; iotop reports about 23552 bytes a second.

I also ran iostat -E and it reported quite a bit of Transport Errors; we changed the port and they went away.

I figured there might be an issue with the drives; SMART reports no issues; I've ran multiple smartctl -t short and smartctl -t long; no issues reported.

I ran fmadm faulty and it reported the following:

    --------------- ------------------------------------  -------------- ---------
    TIME            EVENT-ID                              MSG-ID         SEVERITY
    --------------- ------------------------------------  -------------- ---------
    Jun 01 18:34:01 5fdf0c4c-5627-ccaa-d41e-fc5b2d282ab2  ZFS-8000-D3    Major     
    
    Host        : sys1
    Platform    : xxxx-xxxx       Chassis_id  : xxxxxxx
    Product_sn  : 
    
    Fault class : fault.fs.zfs.device
    Affects     : zfs://pool=data/vdev=cad34c3e3be42919
                      faulted but still in service
    Problem in  : zfs://pool=data/vdev=cad34c3e3be42919
                      faulted but still in service
    
    Description : A ZFS device failed.  Refer to http://illumos.org/msg/ZFS-8000-D3 
                  for more information.
    
    Response    : No automated response will occur.
    
    Impact      : Fault tolerance of the pool may be compromised.
    
    Action      : Run 'zpool status -x' and replace the bad device.

Like it suggests I ran zpool status -x and it reports all pools are healthy.

I ran some DTraces and found that all the IO activity is from `` (for the file); which is metadata; so there actually isn't any file IO going on.

When I run kstat -p zone_vfs it reports the following:

    zone_vfs:0:global:100ms_ops     21412
    zone_vfs:0:global:10ms_ops      95554
    zone_vfs:0:global:10s_ops       1639
    zone_vfs:0:global:1s_ops        20752
    zone_vfs:0:global:class zone_vfs
    zone_vfs:0:global:crtime        0
    zone_vfs:0:global:delay_cnt     0
    zone_vfs:0:global:delay_time    0
    zone_vfs:0:global:nread 69700628762
    zone_vfs:0:global:nwritten      42450222087
    zone_vfs:0:global:reads 14837387
    zone_vfs:0:global:rlentime      229340224122
    zone_vfs:0:global:rtime 202749379182
    zone_vfs:0:global:snaptime      168018.106250637
    zone_vfs:0:global:wlentime      153502283827640
    zone_vfs:0:global:writes        2599025
    zone_vfs:0:global:wtime 113171882481275
    zone_vfs:0:global:zonename      global

The high amount of 1s_ops and 10s_ops are very concerning.

I'm thinking that it's the controller but I can't be sure; anyone have any ideas? Or where I can get more info?
                        
Asked by user26053
Jun 3, 2015, 07:29 PM
Last activity: Feb 4, 2019, 12:50 PM
How to troubleshoot disk controller on Illumos based systems?

Related Questions