Sample Header Ad - 728x90

Isolating I/O issue with NVME or hardware?

0 votes
2 answers
3320 views
Hardware: - Samsung 980 PRO M.2 NVMe SSD (MZ-V8P2T0BW) (2TB) - Beelink GTR6, with the SSD in the NVMe slot Since the hardware arrived, I've installed Ubuntu Server on it as well as a bunch of services (mostly in docker, DBs and services like Kafka). After 2-3 days of uptime (record is almost a week, but usually it's 2-3 days), I typically start getting buffer i/o errors on the nvme slot (which is also the boot drive): screen1 If I'm quick enough, I can still login via SSH but the system becomes increasingly unstable before commands start failing with an I/O error. When I did manage to login, it did seem to think there's no connected NVME SSDs: screen2 Another instance of the buffer I/O error on the nvme slot: screen3 Because of this and trying to check everything I could find, I ran FSCK on boot to see if there was anything obvious - this is quite common after the hard reset: # cat /run/initramfs/fsck.log Log of fsck -C -f -y -V -t ext4 /dev/mapper/ubuntu--vg-ubuntu--lv Fri Dec 30 17:26:21 2022 fsck from util-linux 2.37.2 [/usr/sbin/fsck.ext4 (1) -- /dev/mapper/ubuntu--vg-ubuntu--lv] fsck.ext4 -f -y -C0 /dev/mapper/ubuntu--vg-ubuntu--lv e2fsck 1.46.5 (30-Dec-2021) /dev/mapper/ubuntu--vg-ubuntu--lv: recovering journal Clearing orphaned inode 524449 (uid=1000, gid=1000, mode=0100664, size=6216) Pass 1: Checking inodes, blocks, and sizes Inode 6947190 extent tree (at level 1) could be shorter. Optimize? yes Inode 6947197 extent tree (at level 1) could be shorter. Optimize? yes Inode 6947204 extent tree (at level 1) could be shorter. Optimize? yes Inode 6947212 extent tree (at level 1) could be shorter. Optimize? yes Inode 6947408 extent tree (at level 1) could be shorter. Optimize? yes Inode 6947414 extent tree (at level 1) could be shorter. Optimize? yes Inode 6947829 extent tree (at level 1) could be shorter. Optimize? yes Inode 6947835 extent tree (at level 1) could be shorter. Optimize? yes Inode 6947841 extent tree (at level 1) could be shorter. Optimize? yes Pass 1E: Optimizing extent trees Pass 2: Checking directory structure Pass 3: Checking directory connectivity Pass 4: Checking reference counts Pass 5: Checking group summary information Free blocks count wrong (401572584, counted=405399533). Fix? yes Free inodes count wrong (121360470, counted=121358242). Fix? yes /dev/mapper/ubuntu--vg-ubuntu--lv: ***** FILE SYSTEM WAS MODIFIED ***** /dev/mapper/ubuntu--vg-ubuntu--lv: 538718/121896960 files (0.2% non-contiguous), 82178067/487577600 blocks fsck exited with status code 1 Fri Dec 30 17:26:25 2022 ---------------- Running smart-log doesn't seem to show anything concerning, other than the number of unsafe shutdowns (the number of times this has happened so far)... # nvme smart-log /dev/nvme0 Smart Log for NVME device:nvme0 namespace-id:ffffffff critical_warning : 0 temperature : 32 C (305 Kelvin) available_spare : 100% available_spare_threshold : 10% percentage_used : 0% endurance group critical warning summary: 0 data_units_read : 8,544,896 data_units_written : 5,175,904 host_read_commands : 39,050,379 host_write_commands : 191,366,905 controller_busy_time : 1,069 power_cycles : 21 power_on_hours : 142 unsafe_shutdowns : 12 media_errors : 0 num_err_log_entries : 0 Warning Temperature Time : 0 Critical Composite Temperature Time : 0 Temperature Sensor 1 : 32 C (305 Kelvin) Temperature Sensor 2 : 36 C (309 Kelvin) Thermal Management T1 Trans Count : 0 Thermal Management T2 Trans Count : 0 Thermal Management T1 Total Time : 0 Thermal Management T2 Total Time : 0 I have reached out to support and their initial suggestion along with a bunch of questions was whether I had tried to reinstall the OS. I've given this a go too, formatting the drive and reinstalling the OS (Ubuntu Server 22 LTS). After that, the issue hadn't happened for 4 days before it finally showed itself as a kernel panic: enter image description here Any ideas what I can do to identify if the problem is with the SSD itself or the hardware that the SSD is slotted into (the GTR6)? I have until the 31st to return the SSD, so would love to pin down the most likely cause of the issue sooner rather than later... I'm even more concerned after seeing reports that others are having serious health issues with the Samsung 990 Pro: https://www.reddit.com/r/hardware/comments/10jkwwh/samsung_990_pro_ssd_with_rapid_health_drops/ Edit: although I realised those reported issues are with the 990 pro, not the 980 pro that I have! Edit2: someone in overclockers was kind enough to suggest hd sentinel, which does show a health metric, which seems ok: # ./hdsentinel-019c-x64 Hard Disk Sentinel for LINUX console 0.19c.9986 (c) 2021 info@hdsentinel.com Start with -r [reportfile] to save data to report, -h for help Examining hard disk configuration ... HDD Device 0: /dev/nvme0 HDD Model ID : Samsung SSD 980 PRO 2TB HDD Serial No: S69ENL0T905031A HDD Revision : 5B2QGXA7 HDD Size : 1907729 MB Interface : NVMe Temperature : 41 °C Highest Temp.: 41 °C Health : 99 % Performance : 100 % Power on time: 21 days, 12 hours Est. lifetime: more than 1000 days Total written: 8.30 TB The status of the solid state disk is PERFECT. Problematic or weak sectors were not found. The health is determined by SSD specific S.M.A.R.T. attribute(s): Available Spare (Percent), Percentage Used No actions needed. Lastly, none of the things I tried such as the smart-log seem to show something like a health metric. How can I check this in ubuntu? Thanks!
Asked by Tiago (101 rep)
Jan 26, 2023, 10:57 AM
Last activity: Jul 18, 2025, 09:03 AM