Unix & Linux Stack Exchange

Q&A for users of Linux, FreeBSD and other Unix-like operating systems

Latest Questions

6 votes

1 answers

10569 views

How to check/fix nvme health?

I'm running debian stable with a 2 x nvme Raid 1. Here is the hardware/hoster it's running on https://www.hetzner.com/dedicated-rootserver/ex62-nvme?country=us Almost every second day mdadm monitoring reports a fail event and leaves the array degraded. It only disables 1 partition as you can see her...

                                  I'm running debian stable with a 2 x nvme Raid 1.  
Here is the hardware/hoster it's running on
https://www.hetzner.com/dedicated-rootserver/ex62-nvme?country=us   
Almost every second day mdadm monitoring reports a fail event and leaves the array degraded.  
It only disables 1 partition as you can see here:

    This is an automatically generated mail message from mdadm
    running on xxx
    
    A Fail event had been detected on md device /dev/md/2.
    
    It could be related to component device /dev/nvme1n1p3.
    
    Faithfully yours, etc.
    
    P.S. The /proc/mdstat file currently contains the following:
    
    Personalities : [raid1] [linear] [multipath] [raid0] [raid6] [raid5] [raid4] [raid10] 
    md2 : active raid1 nvme1n1p3(F) nvme0n1p3
          465895744 blocks super 1.2 [2/1] [U_]
          bitmap: 4/4 pages [16KB], 65536KB chunk
    
    md0 : active (auto-read-only) raid1 nvme1n1p1 nvme0n1p1
          33521664 blocks super 1.2 [2/2] [UU]
          
    md1 : active raid1 nvme0n1p2 nvme1n1p2
          523712 blocks super 1.2 [2/2] [UU]
          
    unused devices: 

This happens on both disks. One time it's nvme0n1p3 and next time it's nvme1n1p3.  
I then just re-add the failed partition with  

    mdadm --re-add /dev/md2 /dev/nvme0n1p3

or

    mdadm --re-add /dev/md2 /dev/nvme1n1p3

and after the resync it works for a day or two.

In dmesg I found this:

    [94879.144892] nvme nvme1: I/O 311 QID 1 timeout, reset controller
    [94879.252851] nvme nvme1: completing aborted command with status: 0007
    [94879.252970] blk_update_request: I/O error, dev nvme1n1, sector 452352001
    [94879.253091] nvme nvme1: completing aborted command with status: fffffffc
    [94879.253223] blk_update_request: I/O error, dev nvme1n1, sector 68159504
    [94879.253418] md: super_written gets error=-5

I tried to check the health of the devices with these commands, but they don't give me stats like "Reallocated_Sector_Ct" or "Reported_Uncorrect".

    smartctl -x /dev/nvme1

    smartctl 6.6 2016-05-31 r4324 [x86_64-linux-4.9.0-8-amd64] (local build)
    Copyright (C) 2002-16, Bruce Allen, Christian Franke, www.smartmontools.org
    
    === START OF INFORMATION SECTION ===
    Model Number:                       KXG50ZNV512G TOSHIBA
    Serial Number:                      28SS10F6TYST
    Firmware Version:                   AAGA4102
    PCI Vendor/Subsystem ID:            0x1179
    IEEE OUI Identifier:                0x00080d
    Total NVM Capacity:                 512,110,190,592 [512 GB]
    Unallocated NVM Capacity:           0
    Controller ID:                      0
    Number of Namespaces:               1
    Namespace 1 Size/Capacity:          512,110,190,592 [512 GB]
    Namespace 1 Formatted LBA Size:     512
    Local Time is:                      Mon May 13 10:34:11 2019 CEST
    Firmware Updates (0x14):            2 Slots, no Reset required
    Optional Admin Commands (0x0017):   Security Format Frmw_DL *Other*
    Optional NVM Commands (0x005f):     Comp Wr_Unc DS_Mngmt Wr_Zero Sav/Sel_Feat *Other*
    Maximum Data Transfer Size:         512 Pages
    Warning  Comp. Temp. Threshold:     78 Celsius
    Critical Comp. Temp. Threshold:     82 Celsius
    Namespace 1 Features (0x02):        NA_Fields
    
    Supported Power States
    St Op     Max   Active     Idle   RL RT WL WT  Ent_Lat  Ex_Lat
     0 +     6.00W       -        -    0  0  0  0        0       0
     1 +     2.40W       -        -    1  1  1  1        0       0
     2 +     1.90W       -        -    2  2  2  2        0       0
     3 -   0.0500W       -        -    3  3  3  3     1500    1500
     4 -   0.0050W       -        -    4  4  4  4     6000   14000
     5 -   0.0030W       -        -    5  5  5  5    50000   80000
    
    Supported LBA Sizes (NSID 0x1)
    Id Fmt  Data  Metadt  Rel_Perf
     0 +     512       0         2
     1 -    4096       0         1
    
    === START OF SMART DATA SECTION ===
    SMART overall-health self-assessment test result: PASSED
    
    SMART/Health Information (NVMe Log 0x02, NSID 0xffffffff)
    Critical Warning:                   0x00
    Temperature:                        47 Celsius
    Available Spare:                    100%
    Available Spare Threshold:          10%
    Percentage Used:                    57%
    Data Units Read:                    31,858,921 [16.3 TB]
    Data Units Written:                 293,589,002 [150 TB]
    Host Read Commands:                 4,130,502,428
    Host Write Commands:                889,121,505
    Controller Busy Time:               13,552
    Power Cycles:                       7
    Power On Hours:                     6,720
    Unsafe Shutdowns:                   0
    Media and Data Integrity Errors:    0
    Error Information Log Entries:      0
    Warning  Comp. Temperature Time:    0
    Critical Comp. Temperature Time:    0
    Temperature Sensor 1:               47 Celsius
    
    Error Information (NVMe Log 0x01, max 128 entries)
    No Errors Logged

    nvme smart-log /dev/nvme1

    Smart Log for NVME device:nvme1 namespace-id:ffffffff
    critical_warning                    : 0
    temperature                         : 47 C
    available_spare                     : 100%
    available_spare_threshold           : 10%
    percentage_used                     : 57%
    data_units_read                     : 31,858,921
    data_units_written                  : 293,589,023
    host_read_commands                  : 4,130,502,429
    host_write_commands                 : 889,122,059
    controller_busy_time                : 13,552
    power_cycles                        : 7
    power_on_hours                      : 6,720
    unsafe_shutdowns                    : 0
    media_errors                        : 0
    num_err_log_entries                 : 0
    Warning Temperature Time            : 0
    Critical Composite Temperature Time : 0
    Temperature Sensor 1                : 47 C
    Temperature Sensor 2                : 0 C
    Temperature Sensor 3                : 0 C
    Temperature Sensor 4                : 0 C
    Temperature Sensor 5                : 0 C
    Temperature Sensor 6                : 0 C
    Temperature Sensor 7                : 0 C
    Temperature Sensor 8                : 0 C

    nvme smart-log-add /dev/nvme1

    NVMe Status:INVALID_LOG_PAGE(4109)

    smartctl -A /dev/nvme1
    
    smartctl 6.6 2016-05-31 r4324 [x86_64-linux-4.9.0-8-amd64] (local build)
    Copyright (C) 2002-16, Bruce Allen, Christian Franke, www.smartmontools.org
    
    === START OF SMART DATA SECTION ===
    SMART/Health Information (NVMe Log 0x02, NSID 0xffffffff)
    Critical Warning:                   0x00
    Temperature:                        46 Celsius
    Available Spare:                    100%
    Available Spare Threshold:          10%
    Percentage Used:                    57%
    Data Units Read:                    31,858,924 [16.3 TB]
    Data Units Written:                 293,591,327 [150 TB]
    Host Read Commands:                 4,130,502,490
    Host Write Commands:                889,172,096
    Controller Busy Time:               13,552
    Power Cycles:                       7
    Power On Hours:                     6,721
    Unsafe Shutdowns:                   0
    Media and Data Integrity Errors:    0
    Error Information Log Entries:      0
    Warning  Comp. Temperature Time:    0
    Critical Comp. Temperature Time:    0
    Temperature Sensor 1:               46 Celsius

I only noticed the issue after apache failed to start and I repaired the filesystem with fsck.ext4 -f. Before I didn't have setup root mail correctly.

So looks to me like a hardware error and I should get rid of both nvmes.  
Is there anything I can try to fix these issues and save the nvmes? Or at least to get all the smart values like "Reported_Uncorrect" or "Offline_Uncorrectable".



                                

treffner (61 rep)

May 13, 2019, 08:55 AM • Last activity: May 17, 2025, 06:00 AM

2 votes

0 answers

270 views

How to configure smartd, s-nail and selinux to get sending mails to work?

fedora selinux mail-transport-agent smartmontools

I am trying to configure smartd to send mails via s-nail on Fedora 41. I created a .mailrc file (in which I have set the mta variable to directly send via smtps, there is no sendmail installed) in roots home directory and can successfully send mails via: echo "Test" | mail -s Test I also managed to...

                                  I am trying to configure smartd to send mails via s-nail on Fedora 41.
I created a .mailrc file (in which I have set the mta variable to directly send via smtps, there is no sendmail installed) in roots home directory and can successfully send mails via:

    echo "Test" | mail -s Test 
I also managed to send mails in a bash script started by a custom systemd service.
But smartd isn't able to send mails. The following error is shown in the log:

    Executing test of /usr/libexec/smartmontools/smartdnotify to  ...    
    Test of /usr/libexec/smartmontools/smartdnotify to  produced unexpected output (163 bytes) to STDOUT/STDERR:
    s-nail: Cannot start /usr/sbin/sendmail: executable not found (adjust *mta* variable)
    s-nail: Cannot save to $DEAD: Permission denied
    s-nail: ... message not sent

Selinux is blocking the access to the .mailrc file (therefore s-nail is trying /usr/sbin/sendmail as a default fallback):

    type=AVC msg=audit(1744370186.375:606): avc: denied { read } for pid=42644 comm="mail" name=".mailrc" dev="nvme0n1p3" ino=140324 scontext=system_u:system_r:smartdwarn_t:s0 tcontext=unconfined_u:object_r:mail_home_t:s0 tclass=file permissive=0

I tried the suggested

    ausearch -c 'mail' --raw | audit2allow -M my-mail
    semodule -X 300 -i my-mail.pp
    systemctl restart smartd.service
a couple of times until no new selinux errors appeared. Now I get the following error:

    Test of /usr/libexec/smartmontools/smartdnotify to  produced unexpected output (130 bytes) to STDOUT/STDERR:
    s-nail: could not initiate TLS connection: error:00000000:lib(0)::reason(0)
    /root/dead.letter 23/578
    s-nail: ... message not sent

s-nail now can access the .mailrc file and can connect to the server. But no successfull communication with the server (Error 0 ?). The content of the mail is written to the dead.letter file instead.

What could be the reason for this? Is it an improper selinux config?
Am I missing an selinux option? Do I have to switch mta client?
                                

AckderIII (21 rep)

Apr 10, 2025, 09:17 PM • Last activity: Apr 11, 2025, 11:55 AM

1 votes

3 answers

1472 views

can we automatically wait the required time for smartmontools/smartctl?

smartctl smartmontools

Can we do something like this in a script (preferably zsh): smartctl -t long /dev/sda smartctl -t long /dev/sdb smartctl -t long /dev/sdc [Wait however long smartctl needs] smartctl -H /dev/sda smartctl -H /dev/sdb smartctl -H /dev/sdc As is obvious I'm just trying to automate this.

                                      Can we do something like this in a script (preferably zsh):
    
    smartctl -t long /dev/sda
    smartctl -t long /dev/sdb
    smartctl -t long /dev/sdc
    
    [Wait however long smartctl needs]
    
    smartctl -H /dev/sda
    smartctl -H /dev/sdb
    smartctl -H /dev/sdc

As is obvious I'm just trying to automate this.

                                

Ray Andrews (2615 rep)

Aug 8, 2017, 12:37 AM • Last activity: Nov 9, 2024, 11:05 AM

4 votes

2 answers

1270 views

smartctl lies that NVME has lifespan of ~2800TBW? What is the real lifespan of my NVME?

smartctl nvme smartmontools

`smartctl -x` on my Samsung SSD 860 EVO M.2 2TB shows: ``` Device Statistics (GP Log 0x04) Page Offset Size Value Flags Description 0x01 ===== = = === == General Statistics (rev 1) == 0x01 0x008 4 1132 --- Lifetime Power-On Resets 0x01 0x010 4 6584 --- Power-on Hours 0x01 0x018 6 59675855461 --- Log...

smartctl -x on my Samsung SSD 860 EVO M.2 2TB shows:

Device Statistics (GP Log 0x04)
Page  Offset Size        Value Flags Description
0x01  =====  =               =  ===  == General Statistics (rev 1) ==
0x01  0x008  4            1132  ---  Lifetime Power-On Resets
0x01  0x010  4            6584  ---  Power-on Hours
0x01  0x018  6     59675855461  ---  Logical Sectors Written
0x01  0x020  6      1711777462  ---  Number of Write Commands
0x01  0x028  6     51882440157  ---  Logical Sectors Read
0x01  0x030  6      1869976194  ---  Number of Read Commands
0x01  0x038  6          293000  ---  Date and Time TimeStamp
0x04  =====  =               =  ===  == General Errors Statistics (rev 1) ==
0x04  0x008  4               0  ---  Number of Reported Uncorrectable Errors
0x04  0x010  4              97  ---  Resets Between Cmd Acceptance and Completion
0x05  =====  =               =  ===  == Temperature Statistics (rev 1) ==
0x05  0x008  1              40  ---  Current Temperature
0x05  0x020  1              64  ---  Highest Temperature
0x05  0x028  1              18  ---  Lowest Temperature
0x05  0x058  1              70  ---  Specified Maximum Operating Temperature
0x06  =====  =               =  ===  == Transport Statistics (rev 1) ==
0x06  0x008  4           20530  ---  Number of Hardware Resets
0x06  0x010  4               0  ---  Number of ASR Events
0x06  0x018  4               0  ---  Number of Interface CRC Errors
0x07  =====  =               =  ===  == Solid State Device Statistics (rev 1) ==
0x07  0x008  1               1  N--  Percentage Used Endurance Indicator
                                |||_ C monitored condition met
                                ||__ D supports DSN
                                |___ N normalized value

A paltry 28TB written sounds a little low for the past year I've had this NVME but it's believable. However, the Percentaged Used Endurance Indicator is only at 1%. That would suggest there's still around 100x that or 2800TBW left in this device, which is more than twice the rated 1200TBW so it can't be a rounding error. Is smartctl lying? (Not that it would lie; I mean, is my NVME lying to smartctl, is smartctl misinterpreting my NVME, etc and etc?) How do I find out the real TBW life remaining in my NVME for sure?

Jack G (269 rep)

Jul 18, 2024, 01:33 AM • Last activity: Jul 18, 2024, 01:42 PM

1 votes

2 answers

752 views

Disable automatic S.M.A.R.T. tests

debian smartctl smartmontools

I have a small (single-board) Zimaboard server that is running Debian 12 (bookworm) 24/7 in my bedroom. This server has a single HDD hooked-up, which remains unmounted and in sleep mode except for a daily back-up cycle (after which it is unmounted and goes back to sleep). Unfortunately, every Monday...

                                  I have a small (single-board) Zimaboard server that is running Debian 12 (bookworm) 24/7 in my bedroom. This server has a single HDD hooked-up, which remains unmounted and in sleep mode except for a daily back-up cycle (after which it is unmounted and goes back to sleep).

Unfortunately, every Monday at 00:45 AM the server decides to wake up the HDD (and myself...) to execute what sounds like a short S.M.A.R.T test. I then have to grab my phone and issue a sleep command to stop the HDD from making its typical HDD noises afterwards (humming, occasional clicks, ...). As you can imagine, this is incredibly annoying, so I want to fix it.

I first looked for crontab schedules (executed as user or root), but I didn't see anything relevant. journalctl --since ... --until ... didn't report anything useful either. The only uncommented line in /etc/smartd.conf says: DEVICESCAN -d removable -n standby -m root -M exec /usr/share/smartmontools/smartd-runner. I don't see anything there that could point to a weekly maintenance schedule.

Is there any way to find out what triggered the S.M.A.R.T. test (or similar)? Where do I look for log entries? And how do I prevent it from executing in the middle of the night?

MPA (113 rep)

May 4, 2024, 12:40 PM • Last activity: May 6, 2024, 07:51 PM

2 votes

0 answers

490 views

Are SMART offline data collection and offline attributes obsolete?

smartctl smart smartmontools

**TLDR;** I tried to understand the difference between SMART `Offline` and `Always` attributes, and thus to understand what SMART offline data collection is, and if I should enable it on my HDDs. `smartmontools`' [official wiki][1] states that : > Note that the SMART automatic offline test command i...

**TLDR;** I tried to understand the difference between SMART Offline and Always attributes, and thus to understand what SMART offline data collection is, and if I should enable it on my HDDs. smartmontools' official wiki states that : > Note that the SMART automatic offline test command is listed as Obsolete in every version of the ATA and ATA/ATAPI Specifications. (...) However it is implemented and used by many vendors. After some extensive reading on the web, and also some tests, the conclusion I reached is: - Nowadays SMART offline data collection is obsolete - All data is updated in real time (e.g. Offline and Always attributes behave the same way) - There is no need to enable "Auto Offline Data Collection" (# smartctl --offlineauto=on /dev/sda), nor to ever launch it manually (# smartctl -t offline /dev/sda). - As for the reason all this offline stuff is still in smartmontools, it's probably to keep it compatible with some very old HDDs that indeed implemented real offline attributes. Am I right ? Or do I miss something ? ---------- **MORE DETAILS** I did some tests on a HDD, which has 3 offline attributes (and has Auto Offline Data Collection disabled):

# smartctl -a /dev/sda
(...)
Offline data collection status:  (0x00)    Offline data collection activity
                    was never started.
                    Auto Offline Data Collection: Disabled.
(...)
240 Head_Flying_Hours       0x0000   100   253   000    Old_age   Offline      -       235 (114 97 0)
241 Total_LBAs_Written      0x0000   100   253   000    Old_age   Offline      -       13381561756
242 Total_LBAs_Read         0x0000   100   253   000    Old_age   Offline      -       20472945077
(...)

I then wrote some data on that drive, and noted that all 3 attributes were updated in real time. Thus, they are in fact online (or Always) attributes, and not Offline ones. I did the same test on a few other HDDs, the behavior was identical.

ChennyStar (1969 rep)

Nov 6, 2023, 04:55 PM • Last activity: Nov 6, 2023, 05:01 PM

0 votes

0 answers

871 views

`smartctl` and `smartd` commands not working

hard-disk syslog error-handling smartctl smartmontools

I have been receiving hard disk warnings from the smartd daemon for a while now (every 24 hours), saying that my error logs have increased. I have been trying to examine this by checking my log files, but I haven't found anything. The error logs are for my hard disk `/dev/nvme0`. I've been trying to...

                                  I have been receiving hard disk warnings from the smartd daemon for a while now (every 24 hours), saying that my error logs have increased. I have been trying to examine this by checking my log files, but I haven't found anything. The error logs are for my hard disk /dev/nvme0. I've been trying to find out what this problem is and have been trying to scan my hard disk using smartctl or smartd command, but I keep getting back command not found. I have the packages installed and I even can open the manual pages; man smartctl and man smartd. How can I resolve this issue? 

The warning I receive is the following:

tobibox (1 rep)

Oct 23, 2023, 09:30 AM

1 votes

1 answers

232 views

SMART error (CurrentPendingSector) and (OfflineUncorrectableSector)

hard-disk smartctl smart smartmontools

I have been receiving the following error messages every day for several months now, and I do not know how to stop receiving these messages. `CurrentPendingSector` ``` This message was generated by the smartd daemon running on: host name: myhost DNS domain: [Empty] The following warning/error was lo...

I have been receiving the following error messages every day for several months now, and I do not know how to stop receiving these messages. CurrentPendingSector

This message was generated by the smartd daemon running on:

   host name:  myhost
   DNS domain: [Empty]

The following warning/error was logged by the smartd daemon:

Device: /dev/sda [SAT], 6 Currently unreadable (pending) sectors

Device info:
KingFast, S/N:03112222C0002, FW:U0803A0, 256 GB

For details see host's SYSLOG.

You can also use the smartctl utility for further investigation.
The original message about this issue was sent at Fri Feb  3 19:41:29 2023 PST
Another message will be sent in 24 hours if the problem persists.

OfflineUncorrectableSector

This message was generated by the smartd daemon running on:

   host name:  myhost
   DNS domain: [Empty]

The following warning/error was logged by the smartd daemon:

Device: /dev/sda [SAT], 3 Offline uncorrectable sectors

Device info:
KingFast, S/N:03112222C0002, FW:U0803A0, 256 GB

For details see host's SYSLOG.

You can also use the smartctl utility for further investigation.
The original message about this issue was sent at Fri Feb  3 19:41:29 2023 PST
Another message will be sent in 24 hours if the problem persists.

smartctl -a /dev/sda

smartctl 7.2 2020-12-30 r5155 [x86_64-linux-5.19.0-46-generic] (local build)
Copyright (C) 2002-20, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Device Model:     KingFast
Serial Number:    03112222C0002
Firmware Version: U0803A0
User Capacity:    256,060,514,304 bytes [256 GB]
Sector Size:      512 bytes logical/physical
Rotation Rate:    Solid State Device
Form Factor:      2.5 inches
TRIM Command:     Available
Device is:        Not in smartctl database [for details use: -P showall]
ATA Version is:   ACS-2 T13/2015-D revision 3
SATA Version is:  SATA 3.2, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is:    Sat Jul  8 15:44:59 2023 PDT
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x02)	Offline data collection activity
					was completed without error.
					Auto Offline Data Collection: Disabled.
Self-test execution status:      (   0)	The previous self-test routine completed
					without error or no self-test has ever 
					been run.
Total time to complete Offline 
data collection: 		(  120) seconds.
Offline data collection
capabilities: 			 (0x11) SMART execute Offline immediate.
					No Auto Offline data collection support.
					Suspend Offline collection upon new
					command.
					No Offline surface scan supported.
					Self-test supported.
					No Conveyance Self-test supported.
					No Selective Self-test supported.
SMART capabilities:            (0x0002)	Does not save SMART data before
					entering power-saving mode.
					Supports SMART auto save timer.
Error logging capability:        (0x01)	Error logging supported.
					General Purpose Logging supported.
Short self-test routine 
recommended polling time: 	 (   2) minutes.
Extended self-test routine
recommended polling time: 	 (  10) minutes.

SMART Attributes Data Structure revision number: 1
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x0032   100   100   050    Old_age   Always       -       0
  5 Reallocated_Sector_Ct   0x0032   100   100   050    Old_age   Always       -       6
  9 Power_On_Hours          0x0032   100   100   050    Old_age   Always       -       3335
 12 Power_Cycle_Count       0x0032   100   100   050    Old_age   Always       -       440
160 Unknown_Attribute       0x0032   100   100   050    Old_age   Always       -       3
161 Unknown_Attribute       0x0033   100   100   050    Pre-fail  Always       -       86
163 Unknown_Attribute       0x0032   100   100   050    Old_age   Always       -       26
164 Unknown_Attribute       0x0032   100   100   050    Old_age   Always       -       79004
165 Unknown_Attribute       0x0032   100   100   050    Old_age   Always       -       481
166 Unknown_Attribute       0x0032   100   100   050    Old_age   Always       -       6
167 Unknown_Attribute       0x0032   100   100   050    Old_age   Always       -       114
168 Unknown_Attribute       0x0032   100   100   050    Old_age   Always       -       5050
169 Unknown_Attribute       0x0032   100   100   050    Old_age   Always       -       98
175 Program_Fail_Count_Chip 0x0032   100   100   050    Old_age   Always       -       0
176 Erase_Fail_Count_Chip   0x0032   100   100   050    Old_age   Always       -       0
177 Wear_Leveling_Count     0x0032   100   100   050    Old_age   Always       -       0
178 Used_Rsvd_Blk_Cnt_Chip  0x0032   100   100   050    Old_age   Always       -       6
181 Program_Fail_Cnt_Total  0x0032   100   100   050    Old_age   Always       -       0
182 Erase_Fail_Count_Total  0x0032   100   100   050    Old_age   Always       -       0
192 Power-Off_Retract_Count 0x0032   100   100   050    Old_age   Always       -       88
194 Temperature_Celsius     0x0022   100   100   050    Old_age   Always       -       35
195 Hardware_ECC_Recovered  0x0032   100   100   050    Old_age   Always       -       0
196 Reallocated_Event_Count 0x0032   100   100   050    Old_age   Always       -       3
197 Current_Pending_Sector  0x0032   100   100   050    Old_age   Always       -       6
198 Offline_Uncorrectable   0x0032   100   100   050    Old_age   Always       -       3
199 UDMA_CRC_Error_Count    0x0032   100   100   050    Old_age   Always       -       0
232 Available_Reservd_Space 0x0032   100   100   050    Old_age   Always       -       86
241 Total_LBAs_Written      0x0030   100   100   050    Old_age   Offline      -       168900
242 Total_LBAs_Read         0x0030   100   100   050    Old_age   Offline      -       815543
245 Unknown_Attribute       0x0032   100   100   050    Old_age   Always       -       191939

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Short offline       Completed without error       00%      3329         -
# 2  Short offline       Completed without error       00%      3325         -
# 3  Short offline       Completed without error       00%      3321         -
# 4  Short offline       Completed without error       00%      3313         -
# 5  Short offline       Completed without error       00%      3309         -
# 6  Short offline       Completed without error       00%      3306         -
# 7  Extended offline    Completed without error       00%      3250         -
# 8  Extended offline    Completed without error       00%      3232         -
# 9  Extended offline    Completed without error       00%      3229         -
#10  Extended offline    Completed without error       00%       976         -
#11  Extended offline    Completed without error       00%       968         -

Selective Self-tests/Logging not supported

I have tried to ignore the 197 and 198 errors in /etc/smartd.conf with

/dev/sda -d removable -n standby -H -l error -l selftest -f -t -I 197 -I 198 -s (S/../.././(01|09|17)|L/../../3/11) -m root -M exec /usr/share/smartmontools/smartd-runner

to no avail. I also do not see any LBA_of_first_error in the self-test section. To me, it appears that `SMART overall-health self-assessment test result: PASSED ` is healthy and the self-tests return no errors. My current understanding is that the disk appears to be healthy but is still sending these messages erroneously. Is there something that I'm missing? The /dev/sda drive is a KingFast 256 GB SSD, and I'm not sure if this would be relevant as I could not find anything online for this particular drive or manufacturer. How would I be able to stop receiving these messages but still have SMART monitoring for other genuine issues on the drive, and how would I fix the issue if this error message really does indicate some problem with the drive? Thanks! Edit: After running smartctl -t long /dev/sda, I have

smartctl -a /dev/sda
smartctl 7.2 2020-12-30 r5155 [x86_64-linux-5.19.0-46-generic] (local build)
Copyright (C) 2002-20, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Device Model:     KingFast
Serial Number:    03112222C0002
Firmware Version: U0803A0
User Capacity:    256,060,514,304 bytes [256 GB]
Sector Size:      512 bytes logical/physical
Rotation Rate:    Solid State Device
Form Factor:      2.5 inches
TRIM Command:     Available
Device is:        Not in smartctl database [for details use: -P showall]
ATA Version is:   ACS-2 T13/2015-D revision 3
SATA Version is:  SATA 3.2, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is:    Sun Jul  9 10:05:33 2023 PDT
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x03)	Offline data collection activity
					is in progress.
					Auto Offline Data Collection: Disabled.
Self-test execution status:      ( 241)	Self-test routine in progress...
					10% of test remaining.
Total time to complete Offline 
data collection: 		(  600) seconds.
Offline data collection
capabilities: 			 (0x11) SMART execute Offline immediate.
					No Auto Offline data collection support.
					Suspend Offline collection upon new
					command.
					No Offline surface scan supported.
					Self-test supported.
					No Conveyance Self-test supported.
					No Selective Self-test supported.
SMART capabilities:            (0x0002)	Does not save SMART data before
					entering power-saving mode.
					Supports SMART auto save timer.
Error logging capability:        (0x01)	Error logging supported.
					General Purpose Logging supported.
Short self-test routine 
recommended polling time: 	 (   2) minutes.
Extended self-test routine
recommended polling time: 	 (  10) minutes.

SMART Attributes Data Structure revision number: 1
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x0032   100   100   050    Old_age   Always       -       0
  5 Reallocated_Sector_Ct   0x0032   100   100   050    Old_age   Always       -       6
  9 Power_On_Hours          0x0032   100   100   050    Old_age   Always       -       3341
 12 Power_Cycle_Count       0x0032   100   100   050    Old_age   Always       -       441
160 Unknown_Attribute       0x0032   100   100   050    Old_age   Always       -       3
161 Unknown_Attribute       0x0033   100   100   050    Pre-fail  Always       -       86
163 Unknown_Attribute       0x0032   100   100   050    Old_age   Always       -       26
164 Unknown_Attribute       0x0032   100   100   050    Old_age   Always       -       79553
165 Unknown_Attribute       0x0032   100   100   050    Old_age   Always       -       482
166 Unknown_Attribute       0x0032   100   100   050    Old_age   Always       -       6
167 Unknown_Attribute       0x0032   100   100   050    Old_age   Always       -       115
168 Unknown_Attribute       0x0032   100   100   050    Old_age   Always       -       5050
169 Unknown_Attribute       0x0032   100   100   050    Old_age   Always       -       98
175 Program_Fail_Count_Chip 0x0032   100   100   050    Old_age   Always       -       0
176 Erase_Fail_Count_Chip   0x0032   100   100   050    Old_age   Always       -       0
177 Wear_Leveling_Count     0x0032   100   100   050    Old_age   Always       -       0
178 Used_Rsvd_Blk_Cnt_Chip  0x0032   100   100   050    Old_age   Always       -       6
181 Program_Fail_Cnt_Total  0x0032   100   100   050    Old_age   Always       -       0
182 Erase_Fail_Count_Total  0x0032   100   100   050    Old_age   Always       -       0
192 Power-Off_Retract_Count 0x0032   100   100   050    Old_age   Always       -       88
194 Temperature_Celsius     0x0022   100   100   050    Old_age   Always       -       46
195 Hardware_ECC_Recovered  0x0032   100   100   050    Old_age   Always       -       0
196 Reallocated_Event_Count 0x0032   100   100   050    Old_age   Always       -       3
197 Current_Pending_Sector  0x0032   100   100   050    Old_age   Always       -       6
198 Offline_Uncorrectable   0x0032   100   100   050    Old_age   Always       -       3
199 UDMA_CRC_Error_Count    0x0032   100   100   050    Old_age   Always       -       0
232 Available_Reservd_Space 0x0032   100   100   050    Old_age   Always       -       86
241 Total_LBAs_Written      0x0030   100   100   050    Old_age   Offline      -       170468
242 Total_LBAs_Read         0x0030   100   100   050    Old_age   Offline      -       815560
245 Unknown_Attribute       0x0032   100   100   050    Old_age   Always       -       193199

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Short offline       Completed without error       00%      3337         -
# 2  Short offline       Completed without error       00%      3329         -
# 3  Short offline       Completed without error       00%      3325         -
# 4  Short offline       Completed without error       00%      3321         -
# 5  Short offline       Completed without error       00%      3313         -
# 6  Short offline       Completed without error       00%      3309         -
# 7  Short offline       Completed without error       00%      3306         -
# 8  Extended offline    Completed without error       00%      3250         -
# 9  Extended offline    Completed without error       00%      3232         -
#10  Extended offline    Completed without error       00%      3229         -
#11  Extended offline    Completed without error       00%       976         -
#12  Extended offline    Completed without error       00%       968         -

Selective Self-tests/Logging not supported

The #12 Extended offline test Completed without error, so I'm not really sure what I'm supposed to do from here. **Edit #2:** I have also run the following which I believe indicate that there are no errors with the drive:

badblocks -sv /dev/sda
Checking blocks 0 to 250059095
Checking for bad blocks (read-only test): done                                                 
Pass completed, 0 bad blocks found. (0/0/0 errors)

dd if=/dev/sda of=/dev/null bs=64K conv=noerror
3907173+1 records in
3907173+1 records out
256060514304 bytes (256 GB, 238 GiB) copied, 485.648 s, 527 MB/s

jameszp (93 rep)

Jul 9, 2023, 03:04 AM • Last activity: Jul 18, 2023, 06:01 PM

2 votes

2 answers

999 views

smartd output to screen, not email

notifications smart smartmontools

I'm trying to get smartd working; it is determined that messages will be sent via email via 'mail' which I've never used. But I recall that a few years ago I had smartd sending it's warnings directly to the screen via a popup text box. I'm trying to figure out how to do that again. The info for the...

                                  I'm trying to get smartd working; it is determined that messages will be sent via email via 'mail' which I've never used.  But I recall that a few years ago I had smartd sending it's warnings directly to the screen via a popup text box.  I'm trying to figure out how to do that again.  The info for the 'screen' command baffles me.  tmux likewise.  Or I suppose it could be a notification.  When I have a few weeks to study it, I'll get 'mail' working but for now I'd prefer a popup message anyway.

==================================================

In 'smartd.conf':

    DEVICESCAN -a -m  -M exec notify -M test

... Ok, added full path, much better:

    DEVICESCAN -a -m  -M exec /bin/notify -M test

... 'notify' runs fine from CLI, is an executable script:

    /bin/notify-send "$(systemctl status smartd)"

... but although:

     systemctl restart smartd; systemctl status smartd

... reports no errors, I get no 'test' result.

BTW, so far no results at all using those variables you mentioned.

... 

$ smartd ... shows two notifications, one for each of my two disks!  So why does 'systemctl restart smartd' show nothing?

Ray Andrews (2615 rep)

Oct 17, 2022, 01:56 PM • Last activity: Oct 17, 2022, 10:08 PM

1 votes

2 answers

1479 views

Why smartd need database?

smartctl smart smartmontools

Do smartd need the database? or smartctl needs the database? I saw smart tool github keep updating database: https://github.com/smartmontools/smartmontools/labels/drivedb In my understanding, smartd will scan all disks then why does it need a database? what's the function/purpose to use a database i...

                                  Do smartd need the database?

or

smartctl needs the database?

I saw smart tool github keep updating database:

https://github.com/smartmontools/smartmontools/labels/drivedb 

In my understanding, smartd will scan all disks then why does it need a database? what's the function/purpose to use a database in smartd/smartctl?

Mark K (955 rep)

Aug 12, 2022, 03:32 AM • Last activity: Aug 12, 2022, 11:52 AM

0 votes

0 answers

581 views

Why smartd can't find my nvme device

nvme smartmontools

``` greentea smartd[1147]: Configuration file /etc/smartd.conf was parsed, found DEVICESCAN, scanning devices greentea smartd[1147]: Problem creating device name scan list greentea smartd[1147]: In the system's table of devices NO devices found to scan greentea smartd[1147]: Unable to monitor any SM...

greentea smartd: Configuration file /etc/smartd.conf was parsed, found DEVICESCAN, scanning devices
greentea smartd: Problem creating device name scan list
greentea smartd: In the system's table of devices NO devices found to scan
greentea smartd: Unable to monitor any SMART enabled devices. Try debug (-d) option. Exiting...

My smartd.conf has no change and has this line:

DEVICESCAN

I had changed the DEVICESCAN to /dev/nvme0n1 still got the same error. I tested on Debian 9.

Mark K (955 rep)

Jul 13, 2022, 07:37 AM • Last activity: Jul 13, 2022, 09:53 AM

1 votes

1 answers

1322 views

Smartd how to wake disks only for scheduled scans

smartctl smartmontools

I am using smartd to monitor my disks in Ubuntu and have it configured to run a short scan daily at 2am and a long test 3am every Saturday morning: `/dev/sda -a -n standby -o on -S on -s (S/../.././02|L/../../6/03) -m name@mail.com` I understand that smartd periodically polls the disks (every 30mins...

                                  I am using smartd to monitor my disks in Ubuntu and have it configured to run a short scan daily at 2am and a long test 3am every Saturday morning:

/dev/sda -a -n standby -o on -S on -s (S/../.././02|L/../../6/03) -m name@mail.com

I understand that smartd periodically polls the disks (every 30mins?) causing them to wakeup if in stand by, hence I have added the -n standby flag in the above config. However, this also stops the scheduled scans from running if the disk is in standby.

Is there a way to force the scheduled scans to start at the given times and wake the disks if needed, but stop the periodic polling form waking the disks ?

Dibly (11 rep)

Jun 8, 2022, 09:24 AM • Last activity: Jun 8, 2022, 09:41 AM

1 votes

0 answers

287 views

SmartMonTools: tests get cancelled without any traces

hard-disk smartctl smartmontools

I am testing hard disks with [SmartMonTools][1] under Ubuntu 20.04. Tests for some hard disks are not working - they disappear without leaving any warnings or errors. Hard drive status before the test (note the time **Sun Mar 13 16:25:12 2022 UTC**): smartctl 7.1 2019-12-30 r5022 [x86_64-linux-5.4.0...

                                  I am testing hard disks with SmartMonTools  under Ubuntu 20.04.

Tests for some hard disks are not working - they disappear without leaving any warnings or errors.

Hard drive status before the test (note the time **Sun Mar 13 16:25:12 2022 UTC**):

    smartctl 7.1 2019-12-30 r5022 [x86_64-linux-5.4.0-104-generic] (local build)
    Copyright (C) 2002-19, Bruce Allen, Christian Franke, www.smartmontools.org
    
    === START OF INFORMATION SECTION ===
    Device Model:     MB4000GDUPB
    Serial Number:    26F5K1J3F17A
    LU WWN Device Id: 5 000039 6db900727
    Firmware Version: HPG3
    User Capacity:    4,000,787,030,016 bytes [4.00 TB]
    Sector Size:      512 bytes logical/physical
    Rotation Rate:    7200 rpm
    Form Factor:      3.5 inches
    Device is:        Not in smartctl database [for details use: -P showall]
    ATA Version is:   ACS-2, ATA8-ACS T13/1699-D revision 6
    SATA Version is:  SATA 3.0, 6.0 Gb/s (current: 3.0 Gb/s)
    Local Time is:    Sun Mar 13 16:25:12 2022 UTC
    SMART support is: Available - device has SMART capability.
    SMART support is: Enabled
    
    === START OF READ SMART DATA SECTION ===
    SMART overall-health self-assessment test result: PASSED
    
    General SMART Values:
    Offline data collection status:  (0x82)	Offline data collection activity
    					was completed without error.
    					Auto Offline Data Collection: Enabled.
    Self-test execution status:      (   0)	The previous self-test routine completed
    					without error or no self-test has ever 
    					been run.
    Total time to complete Offline 
    data collection: 		(  120) seconds.
    Offline data collection
    capabilities: 			 (0x7b) SMART execute Offline immediate.
    					Auto Offline data collection on/off support.
    					Suspend Offline collection upon new
    					command.
    					Offline surface scan supported.
    					Self-test supported.
    					Conveyance Self-test supported.
    					Selective Self-test supported.
    SMART capabilities:            (0x0003)	Saves SMART data before entering
    					power-saving mode.
    					Supports SMART auto save timer.
    Error logging capability:        (0x01)	Error logging supported.
    					General Purpose Logging supported.
    Short self-test routine 
    recommended polling time: 	 (   2) minutes.
    Extended self-test routine
    recommended polling time: 	 ( 532) minutes.
    Conveyance self-test routine
    recommended polling time: 	 (   2) minutes.
    SCT capabilities: 	       (0x0025)	SCT Status supported.
    					SCT Data Table supported.
    
    SMART Attributes Data Structure revision number: 16
    Vendor Specific SMART Attributes with Thresholds:
    ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
      1 Raw_Read_Error_Rate     0x000f   100   100   050    Pre-fail  Always       -       0
      2 Throughput_Performance  0x0007   100   100   050    Pre-fail  Always       -       0
      3 Spin_Up_Time            0x0003   100   100   002    Pre-fail  Always       -       11957
      5 Reallocated_Sector_Ct   0x0033   100   100   010    Pre-fail  Always       -       0
      7 Seek_Error_Rate         0x000f   100   100   050    Pre-fail  Always       -       0
      8 Seek_Time_Performance   0x0005   100   100   050    Pre-fail  Offline      -       0
      9 Power_On_Hours          0x0032   001   001   000    Old_age   Always       -       41134
     10 Spin_Retry_Count        0x0013   105   100   030    Pre-fail  Always       -       0
    180 Unknown_HDD_Attribute   0x003b   100   100   001    Pre-fail  Always       -       0
    194 Temperature_Celsius     0x0022   100   100   000    Old_age   Always       -       34 (Min/Max 8/58)
    196 Reallocated_Event_Count 0x0033   100   100   010    Pre-fail  Always       -       0
    
    SMART Error Log Version: 1
    No Errors Logged
    
    SMART Self-test log structure revision number 1
    Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
    # 1  Extended offline    Completed without error       00%     41109         -
    # 2  Short offline       Completed without error       00%     41038         -
    
    SMART Selective self-test log data structure revision number 1
     SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
        1        0        0  Not_testing
        2        0        0  Not_testing
        3        0        0  Not_testing
        4        0        0  Not_testing
        5        0        0  Not_testing
    Selective self-test flags (0x0):
      After scanning selected spans, do NOT read-scan remainder of disk.
    If Selective self-test is pending on power-up, resume after 0 minute delay.

Begin the **long** test:

    smartctl 7.1 2019-12-30 r5022 [x86_64-linux-5.4.0-104-generic] (local build)
    Copyright (C) 2002-19, Bruce Allen, Christian Franke, www.smartmontools.org
    
    === START OF OFFLINE IMMEDIATE AND SELF-TEST SECTION ===
    Sending command: "Execute SMART Extended self-test routine immediately in off-line mode".
    Drive command "Execute SMART Extended self-test routine immediately in off-line mode" successful.
    Testing has begun.
    Please wait 532 minutes for test to complete.
    Test will complete after Mon Mar 14 01:17:34 2022 UTC
    Use smartctl -X to abort test.


Check the test status - test is in progress (time **Sun Mar 13 16:26:05 2022 UTC**):

    smartctl 7.1 2019-12-30 r5022 [x86_64-linux-5.4.0-104-generic] (local build)
    Copyright (C) 2002-19, Bruce Allen, Christian Franke, www.smartmontools.org
    
    === START OF INFORMATION SECTION ===
    Device Model:     MB4000GDUPB
    Serial Number:    26F5K1J3F17A
    LU WWN Device Id: 5 000039 6db900727
    Firmware Version: HPG3
    User Capacity:    4,000,787,030,016 bytes [4.00 TB]
    Sector Size:      512 bytes logical/physical
    Rotation Rate:    7200 rpm
    Form Factor:      3.5 inches
    Device is:        Not in smartctl database [for details use: -P showall]
    ATA Version is:   ACS-2, ATA8-ACS T13/1699-D revision 6
    SATA Version is:  SATA 3.0, 6.0 Gb/s (current: 3.0 Gb/s)
    Local Time is:    Sun Mar 13 16:26:05 2022 UTC
    SMART support is: Available - device has SMART capability.
    SMART support is: Enabled
    
    === START OF READ SMART DATA SECTION ===
    SMART overall-health self-assessment test result: PASSED
    
    General SMART Values:
    Offline data collection status:  (0x82)	Offline data collection activity
    					was completed without error.
    					Auto Offline Data Collection: Enabled.
    Self-test execution status:      ( 249)	Self-test routine in progress...
    					90% of test remaining.
    Total time to complete Offline 
    data collection: 		(  120) seconds.
    Offline data collection
    capabilities: 			 (0x7b) SMART execute Offline immediate.
    					Auto Offline data collection on/off support.
    					Suspend Offline collection upon new
    					command.
    					Offline surface scan supported.
    					Self-test supported.
    					Conveyance Self-test supported.
    					Selective Self-test supported.
    SMART capabilities:            (0x0003)	Saves SMART data before entering
    					power-saving mode.
    					Supports SMART auto save timer.
    Error logging capability:        (0x01)	Error logging supported.
    					General Purpose Logging supported.
    Short self-test routine 
    recommended polling time: 	 (   2) minutes.
    Extended self-test routine
    recommended polling time: 	 ( 532) minutes.
    Conveyance self-test routine
    recommended polling time: 	 (   2) minutes.
    SCT capabilities: 	       (0x0025)	SCT Status supported.
    					SCT Data Table supported.
    
    SMART Attributes Data Structure revision number: 16
    Vendor Specific SMART Attributes with Thresholds:
    ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
      1 Raw_Read_Error_Rate     0x000f   100   100   050    Pre-fail  Always       -       0
      2 Throughput_Performance  0x0007   100   100   050    Pre-fail  Always       -       0
      3 Spin_Up_Time            0x0003   100   100   002    Pre-fail  Always       -       11957
      5 Reallocated_Sector_Ct   0x0033   100   100   010    Pre-fail  Always       -       0
      7 Seek_Error_Rate         0x000f   100   100   050    Pre-fail  Always       -       0
      8 Seek_Time_Performance   0x0005   100   100   050    Pre-fail  Offline      -       0
      9 Power_On_Hours          0x0032   001   001   000    Old_age   Always       -       41134
     10 Spin_Retry_Count        0x0013   105   100   030    Pre-fail  Always       -       0
    180 Unknown_HDD_Attribute   0x003b   100   100   001    Pre-fail  Always       -       0
    194 Temperature_Celsius     0x0022   100   100   000    Old_age   Always       -       36 (Min/Max 8/58)
    196 Reallocated_Event_Count 0x0033   100   100   010    Pre-fail  Always       -       0
    
    SMART Error Log Version: 1
    No Errors Logged
    
    SMART Self-test log structure revision number 1
    Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
    # 1  Extended offline    Self-test routine in progress 90%     41134         -
    # 2  Extended offline    Completed without error       00%     41109         -
    # 3  Short offline       Completed without error       00%     41038         -
    
    SMART Selective self-test log data structure revision number 1
     SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
        1        0        0  Not_testing
        2        0        0  Not_testing
        3        0        0  Not_testing
        4        0        0  Not_testing
        5        0        0  Not_testing
    Selective self-test flags (0x0):
      After scanning selected spans, do NOT read-scan remainder of disk.
    If Selective self-test is pending on power-up, resume after 0 minute delay.


Check the test progress again - **no tests running** (time **Sun Mar 13 16:26:46 2022 UTC**):

    smartctl 7.1 2019-12-30 r5022 [x86_64-linux-5.4.0-104-generic] (local build)
    Copyright (C) 2002-19, Bruce Allen, Christian Franke, www.smartmontools.org
    
    === START OF INFORMATION SECTION ===
    Device Model:     MB4000GDUPB
    Serial Number:    26F5K1J3F17A
    LU WWN Device Id: 5 000039 6db900727
    Firmware Version: HPG3
    User Capacity:    4,000,787,030,016 bytes [4.00 TB]
    Sector Size:      512 bytes logical/physical
    Rotation Rate:    7200 rpm
    Form Factor:      3.5 inches
    Device is:        Not in smartctl database [for details use: -P showall]
    ATA Version is:   ACS-2, ATA8-ACS T13/1699-D revision 6
    SATA Version is:  SATA 3.0, 6.0 Gb/s (current: 3.0 Gb/s)
    Local Time is:    Sun Mar 13 16:26:46 2022 UTC
    SMART support is: Available - device has SMART capability.
    SMART support is: Enabled
    
    === START OF READ SMART DATA SECTION ===
    SMART overall-health self-assessment test result: PASSED
    
    General SMART Values:
    Offline data collection status:  (0x82)	Offline data collection activity
    					was completed without error.
    					Auto Offline Data Collection: Enabled.
    Self-test execution status:      (   0)	The previous self-test routine completed
    					without error or no self-test has ever 
    					been run.
    Total time to complete Offline 
    data collection: 		(  120) seconds.
    Offline data collection
    capabilities: 			 (0x7b) SMART execute Offline immediate.
    					Auto Offline data collection on/off support.
    					Suspend Offline collection upon new
    					command.
    					Offline surface scan supported.
    					Self-test supported.
    					Conveyance Self-test supported.
    					Selective Self-test supported.
    SMART capabilities:            (0x0003)	Saves SMART data before entering
    					power-saving mode.
    					Supports SMART auto save timer.
    Error logging capability:        (0x01)	Error logging supported.
    					General Purpose Logging supported.
    Short self-test routine 
    recommended polling time: 	 (   2) minutes.
    Extended self-test routine
    recommended polling time: 	 ( 532) minutes.
    Conveyance self-test routine
    recommended polling time: 	 (   2) minutes.
    SCT capabilities: 	       (0x0025)	SCT Status supported.
    					SCT Data Table supported.
    
    SMART Attributes Data Structure revision number: 16
    Vendor Specific SMART Attributes with Thresholds:
    ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
      1 Raw_Read_Error_Rate     0x000f   100   100   050    Pre-fail  Always       -       0
      2 Throughput_Performance  0x0007   100   100   050    Pre-fail  Always       -       0
      3 Spin_Up_Time            0x0003   100   100   002    Pre-fail  Always       -       858
      5 Reallocated_Sector_Ct   0x0033   100   100   010    Pre-fail  Always       -       0
      7 Seek_Error_Rate         0x000f   100   100   050    Pre-fail  Always       -       0
      8 Seek_Time_Performance   0x0005   100   100   050    Pre-fail  Offline      -       0
      9 Power_On_Hours          0x0032   001   001   000    Old_age   Always       -       41134
     10 Spin_Retry_Count        0x0013   105   100   030    Pre-fail  Always       -       0
    180 Unknown_HDD_Attribute   0x003b   100   100   001    Pre-fail  Always       -       0
    194 Temperature_Celsius     0x0022   100   100   000    Old_age   Always       -       32 (Min/Max 8/58)
    196 Reallocated_Event_Count 0x0033   100   100   010    Pre-fail  Always       -       0
    
    SMART Error Log Version: 1
    No Errors Logged
    
    SMART Self-test log structure revision number 1
    Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
    # 1  Extended offline    Completed without error       00%     41109         -
    # 2  Short offline       Completed without error       00%     41038         -
    
    SMART Selective self-test log data structure revision number 1
     SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
        1        0        0  Not_testing
        2        0        0  Not_testing
        3        0        0  Not_testing
        4        0        0  Not_testing
        5        0        0  Not_testing
    Selective self-test flags (0x0):
      After scanning selected spans, do NOT read-scan remainder of disk.
    If Selective self-test is pending on power-up, resume after 0 minute delay.

In short:
- Sun Mar 13 16:25:12 2022 UTC: no tests running.
- Sun Mar 13 16:26:05 2022 UTC: long test running.
- Sun Mar 13 16:26:46 2022 UTC: no tests running, no test results recorded.

How can I find out **why those tests get cancelled - any logs available**???


                                

Andriy (111 rep)

Mar 13, 2022, 05:02 PM

0 votes

1 answers

3897 views

Is my hard drive failing? / Need help with smartctl -a output

hard-disk disk storage smartctl smartmontools

I have an old Seagate 4TB internal drive from a crapped out pc that I was planning to repurpose as a spare drive for gaming. Figured I'd run some smartctl scans on it first just to be safe so I did smartctl -t short /dev/sdb and got back results. They looked ok to me bc I didn't see anything listed in the 'WHEN_FAILED' column (and originally I had been mostly concerned with the temperature related errors). But then I saw [an article from 2018](https://harddrivegeek.com/current-pending-sector-count/) mentioning that 'Current_Pending_Sector' is pretty serious... And mine is not zero... And I did have some errors besides... Since I can't really make sense of whether or not to be concerned about them, I figured I'd try SE. My best guess so far is that I shouldn't put anything critical on it but that it might be fine to use for games if I symlink the save folders so they exist somewhere else (on a drive with better smart results) and don't mind re-downloading the installed game in the event of drive failure. Also not sure if the 'READ DMA EXT' errors are indications of imminent failure or if that could be a cable or other one-time event (I can only see errors 35-39 and they all occurred at "16936 hours"... not sure if there's a way to see all of the errors or if literally only the last 5 are stored like it says). OTOH, I didn't have any issues mounting it or copying data off it (it was a relative's and they didn't want it anymore; just some pics/videos off it). If there's at least decent odds that the drive might have some life left, I don't mind chancing it for less important stuff. But it is highly likely to fail in the near future, I'd prefer not to waste any time with it for anything but acquiring a new magnet :-) Any advice / recommendations? Anyway, I reran with smartctl -t long /dev/sdb waited till the next day and ran smartctl -a /dev/sdb. Here are the results for that:

I_AM_ROOT@fedora35:~
# smartctl -a /dev/sdb
smartctl 7.2 2020-12-30 r5155 [x86_64-linux-5.15.7-200.fc35.x86_64] (local build)
Copyright (C) 2002-20, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Family:     Seagate Desktop HDD.15
Device Model:     ST4000DM000-1F2168
Serial Number:    
LU WWN Device Id: 
Firmware Version: CC54
User Capacity:    4,000,787,030,016 bytes [4.00 TB]
Sector Sizes:     512 bytes logical, 4096 bytes physical
Rotation Rate:    5900 rpm
Form Factor:      3.5 inches
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   ACS-2, ACS-3 T13/2161-D revision 3b
SATA Version is:  SATA 3.1, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is:    Fri Dec 17 11:44:49 2021 EST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x00)	Offline data collection activity
					was never started.
					Auto Offline Data Collection: Disabled.
Self-test execution status:      ( 118)	The previous self-test completed having
					the read element of the test failed.
Total time to complete Offline 
data collection: 		(  168) seconds.
Offline data collection
capabilities: 			 (0x73) SMART execute Offline immediate.
					Auto Offline data collection on/off support.
					Suspend Offline collection upon new
					command.
					No Offline surface scan supported.
					Self-test supported.
					Conveyance Self-test supported.
					Selective Self-test supported.
SMART capabilities:            (0x0003)	Saves SMART data before entering
					power-saving mode.
					Supports SMART auto save timer.
Error logging capability:        (0x01)	Error logging supported.
					General Purpose Logging supported.
Short self-test routine 
recommended polling time: 	 (   1) minutes.
Extended self-test routine
recommended polling time: 	 ( 528) minutes.
Conveyance self-test routine
recommended polling time: 	 (   2) minutes.
SCT capabilities: 	       (0x1085)	SCT Status supported.

SMART Attributes Data Structure revision number: 10
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x000f   119   099   006    Pre-fail  Always       -       233492808
  3 Spin_Up_Time            0x0003   092   091   000    Pre-fail  Always       -       0
  4 Start_Stop_Count        0x0032   099   099   020    Old_age   Always       -       1890
  5 Reallocated_Sector_Ct   0x0033   100   100   010    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x000f   044   039   030    Pre-fail  Always       -       678608011490
  9 Power_On_Hours          0x0032   065   065   000    Old_age   Always       -       30836
 10 Spin_Retry_Count        0x0013   100   100   097    Pre-fail  Always       -       0
 12 Power_Cycle_Count       0x0032   099   099   020    Old_age   Always       -       1206
183 Runtime_Bad_Block       0x0032   100   100   000    Old_age   Always       -       0
184 End-to-End_Error        0x0032   100   100   099    Old_age   Always       -       0
187 Reported_Uncorrect      0x0032   061   061   000    Old_age   Always       -       39
188 Command_Timeout         0x0032   100   100   000    Old_age   Always       -       0 0 0
189 High_Fly_Writes         0x003a   100   100   000    Old_age   Always       -       0
190 Airflow_Temperature_Cel 0x0022   071   058   045    Old_age   Always       -       29 (Min/Max 27/32)
191 G-Sense_Error_Rate      0x0032   100   100   000    Old_age   Always       -       0
192 Power-Off_Retract_Count 0x0032   100   100   000    Old_age   Always       -       304
193 Load_Cycle_Count        0x0032   084   084   000    Old_age   Always       -       32204
194 Temperature_Celsius     0x0022   029   042   000    Old_age   Always       -       29 (0 12 0 0 0)
197 Current_Pending_Sector  0x0012   100   100   000    Old_age   Always       -       16
198 Offline_Uncorrectable   0x0010   100   100   000    Old_age   Offline      -       16
199 UDMA_CRC_Error_Count    0x003e   200   200   000    Old_age   Always       -       0
240 Head_Flying_Hours       0x0000   100   253   000    Old_age   Offline      -       23293h+16m+41.533s
241 Total_LBAs_Written      0x0000   100   253   000    Old_age   Offline      -       19236444339
242 Total_LBAs_Read         0x0000   100   253   000    Old_age   Offline      -       27220280383

SMART Error Log Version: 1
ATA Error Count: 39 (device log contains only the most recent five errors)
	CR = Command Register [HEX]
	FR = Features Register [HEX]
	SC = Sector Count Register [HEX]
	SN = Sector Number Register [HEX]
	CL = Cylinder Low Register [HEX]
	CH = Cylinder High Register [HEX]
	DH = Device/Head Register [HEX]
	DC = Device Command Register [HEX]
	ER = Error register [HEX]
	ST = Status register [HEX]
Powered_Up_Time is measured from power on, and printed as
DDd+hh:mm:SS.sss where DD=days, hh=hours, mm=minutes,
SS=sec, and sss=millisec. It "wraps" after 49.710 days.

Error 39 occurred at disk power-on lifetime: 16936 hours (705 days + 16 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  40 51 00 ff ff ff 0f  Error: UNC at LBA = 0x0fffffff = 268435455

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  25 00 08 ff ff ff ef 00      04:59:22.764  READ DMA EXT
  25 00 40 ff ff ff ef 00      04:59:22.762  READ DMA EXT
  25 00 00 ff ff ff ef 00      04:59:22.736  READ DMA EXT
  25 00 08 ff ff ff ef 00      04:59:22.735  READ DMA EXT
  ef 10 02 00 00 00 a0 00      04:59:22.735  SET FEATURES [Enable SATA feature]

Error 38 occurred at disk power-on lifetime: 16936 hours (705 days + 16 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  40 51 00 ff ff ff 0f  Error: UNC at LBA = 0x0fffffff = 268435455

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  25 00 08 ff ff ff ef 00      04:59:18.709  READ DMA EXT
  25 00 00 ff ff ff ef 00      04:59:18.696  READ DMA EXT
  25 00 00 ff ff ff ef 00      04:59:18.693  READ DMA EXT
  25 00 00 ff ff ff ef 00      04:59:18.631  READ DMA EXT
  25 00 08 ff ff ff ef 00      04:59:18.631  READ DMA EXT

Error 37 occurred at disk power-on lifetime: 16936 hours (705 days + 16 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  40 51 00 ff ff ff 0f  Error: UNC at LBA = 0x0fffffff = 268435455

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  25 00 08 ff ff ff ef 00      04:57:53.914  READ DMA EXT
  25 00 08 ff ff ff ef 00      04:57:53.914  READ DMA EXT
  25 00 00 ff ff ff ef 00      04:57:53.882  READ DMA EXT
  ef 10 02 00 00 00 a0 00      04:57:53.881  SET FEATURES [Enable SATA feature]
  27 00 00 00 00 00 e0 00      04:57:53.881  READ NATIVE MAX ADDRESS EXT [OBS-ACS-3]

Error 36 occurred at disk power-on lifetime: 16936 hours (705 days + 16 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  40 51 00 ff ff ff 0f  Error: UNC at LBA = 0x0fffffff = 268435455

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  25 00 08 ff ff ff ef 00      04:57:49.903  READ DMA EXT
  25 00 08 ff ff ff ef 00      04:57:49.903  READ DMA EXT
  25 00 08 ff ff ff ef 00      04:57:49.903  READ DMA EXT
  25 00 08 ff ff ff ef 00      04:57:49.903  READ DMA EXT
  25 00 08 ff ff ff ef 00      04:57:49.903  READ DMA EXT

Error 35 occurred at disk power-on lifetime: 16936 hours (705 days + 16 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  40 51 00 ff ff ff 0f  Error: UNC at LBA = 0x0fffffff = 268435455

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  25 00 00 ff ff ff ef 00      04:57:45.210  READ DMA EXT
  25 00 00 ff ff ff ef 00      04:57:45.181  READ DMA EXT
  25 00 00 ff ff ff ef 00      04:57:45.179  READ DMA EXT
  25 00 00 ff ff ff ef 00      04:57:45.178  READ DMA EXT
  25 00 58 ff ff ff ef 00      04:57:45.149  READ DMA EXT

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Extended offline    Completed: read failure       60%     30817         3723785408
# 2  Short offline       Completed without error       00%     30812         -

SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

zpangwin (1061 rep)

Dec 17, 2021, 05:15 PM • Last activity: Dec 17, 2021, 07:10 PM

2 votes

1 answers

5551 views

Best practices to enable SMART disk notifications on a Linux workstation?

linux smartctl smart smartmontools

I enabled SMART notifications on my laptop running Debian. Basically I just want to get a notification pop up when something goes wrong on a disk. I don't want to get an email, I think a notification is better indicated on the workstation where I spend my days (while emails are off course better for servers). It works, I even tested it (but what exactly did I test ?), but I still have doubts if I did it the right way, and if what I did is really useful. Basically, what I did : 1. I installed smartmontools and smart-notifier

# apt-get install smartmontools smart-notifier

2. I then configured the smartd daemon to monitor /dev/sda and send its messages to the notifier. This is done in /etc/smartd.conf, in which I have only 1 line :

/dev/sda -a -m myUsername -M exec /usr/share/smartmontools/smartd-runner -M test

3. The -M test option in the previous command displays a test notification popup as soon as I restart the smartd daemon (you have to log out and log back in in order for it to work). And it works, restarting the smartd daemon displays the test notification popup. 4. And finally I removed the -M test option and restarted smartd again. So, can I be at ease now ? Will this setup send me a popup as soon as something goes wrong with /dev/sda ? I have a lot of unanswered questions : 1. With the -M test option, the test notification popup is only displayed when I restart smartd. Nothing is displayed when I restart my laptop and log in (probably because smartd is already running at that point). Can I be confident that a notification will pop up if smartd detects something wrong on my disks ? 2. What event exactly will trigger that pop up ? In other words, what is "something wrong" ? $ man smartd states that : > smartd will attempt to enable SMART monitoring on ATA devices (equivalent to smartctl -s on) and polls these and SCSI devices every 30 minutes (configurable), logging SMART errors and changes of SMART Attributes via the SYSLOG interface. And indeed, checking /var/log/syslog I can see a log entry after 30 minutes (last line) :

Jul 30 13:17:06 precision7520 smartd: smartd 6.6 2016-05-31 r4324 [x86_64-linux-4.19.0-0.bpo.5-amd64] (local build)
Jul 30 13:17:06 precision7520 smartd: Copyright (C) 2002-16, Bruce Allen, Christian Franke, www.smartmontools.org
Jul 30 13:17:06 precision7520 smartd: Opened configuration file /etc/smartd.conf
Jul 30 13:17:06 precision7520 smartd: Configuration file /etc/smartd.conf parsed.
Jul 30 13:17:06 precision7520 smartd: Device: /dev/sda, type changed from 'scsi' to 'sat'
Jul 30 13:17:06 precision7520 smartd: Device: /dev/sda [SAT], opened
Jul 30 13:17:06 precision7520 smartd: Device: /dev/sda [SAT], Samsung SSD 850 EVO 2TB, S/N:S2RMNB0J801642K, WWN:5-002538-c407b1fd2, FW:EMT02B6Q, 2.00 TB
Jul 30 13:17:06 precision7520 smartd: Device: /dev/sda [SAT], not found in smartd database.
Jul 30 13:17:06 precision7520 smartd: Device: /dev/sda [SAT], can't monitor Current_Pending_Sector count - no Attribute 197
Jul 30 13:17:06 precision7520 smartd: Device: /dev/sda [SAT], can't monitor Offline_Uncorrectable count - no Attribute 198
Jul 30 13:17:06 precision7520 smartd: Device: /dev/sda [SAT], is SMART capable. Adding to "monitor" list.
Jul 30 13:17:06 precision7520 smartd: Device: /dev/sda [SAT], state read from /var/lib/smartmontools/smartd.Samsung_SSD_850_EVO_2TB-S2RMNB0J801642K.ata.state
Jul 30 13:17:06 precision7520 smartd: Monitoring 1 ATA/SATA, 0 SCSI/SAS and 0 NVMe devices
Jul 30 13:17:06 precision7520 smartd: Device: /dev/sda [SAT], state written to /var/lib/smartmontools/smartd.Samsung_SSD_850_EVO_2TB-S2RMNB0J801642K.ata.state


Jul 30 13:47:06 precision7520 smartd: Device: /dev/sda [SAT], SMART Usage Attribute: 190 Airflow_Temperature_Cel changed from 67 to 68

But no pop up. Maybe because the log entry was just a minor information (a 1 degree temperature increase) ? But then, what kind of event exactly is supposed to trigger the notification ? 3. Finally, there are a lot of examples in /etc/smartd.conf, with even more in $ man smartd.conf, some performing (-s) short (-s S) or extended (-s L) self tests at given intervals. Are those self tests necessary ? Isn't SMART supposed to integrate its own self test procedures (the SM of SMART stands for Self-Monitoring) ? How useful are results without performing self tests ? For information, my # smartctl /dev/sda results :

$ sudo smartctl -a /dev/sda
smartctl 6.6 2016-05-31 r4324 [x86_64-linux-4.19.0-0.bpo.5-amd64] (local build)
Copyright (C) 2002-16, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Device Model:     Samsung SSD 850 EVO 2TB
Serial Number:    S2RMNB0J801642K
LU WWN Device Id: 5 002538 c407b1fd2
Firmware Version: EMT02B6Q
User Capacity:    2 000 398 934 016 bytes [2,00 TB]
Sector Size:      512 bytes logical/physical
Rotation Rate:    Solid State Device
Form Factor:      2.5 inches
Device is:        Not in smartctl database [for details use: -P showall]
ATA Version is:   ACS-2, ATA8-ACS T13/1699-D revision 4c
SATA Version is:  SATA 3.1, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is:    Fri Jul 30 14:15:22 2021 WAT
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED
(...)

No self test seems to be ever performed :

(...)
General SMART Values:
Offline data collection status:  (0x00)	Offline data collection activity
					was never started.
					Auto Offline Data Collection: Disabled.
Self-test execution status:      (   0)	The previous self-test routine completed
					without error or no self-test has ever 
					been run.
Total time to complete Offline 
data collection: 		(    0) seconds.
Offline data collection
capabilities: 			 (0x53) SMART execute Offline immediate.
					Auto Offline data collection on/off support.
					Suspend Offline collection upon new
					command.
					No Offline surface scan supported.
					Self-test supported.
					No Conveyance Self-test supported.
					Selective Self-test supported.
SMART capabilities:            (0x0003)	Saves SMART data before entering
					power-saving mode.
					Supports SMART auto save timer.
Error logging capability:        (0x01)	Error logging supported.
					General Purpose Logging supported.
Short self-test routine 
recommended polling time: 	 (   2) minutes.
Extended self-test routine
recommended polling time: 	 ( 265) minutes.
SCT capabilities: 	       (0x003d)	SCT Status supported.
					SCT Error Recovery Control supported.
					SCT Feature Control supported.
					SCT Data Table supported.
(...)

Are these data of any use, even without self-tests ?

(...)
SMART Attributes Data Structure revision number: 1
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  5 Reallocated_Sector_Ct   0x0033   100   100   010    Pre-fail  Always       -       0
  9 Power_On_Hours          0x0032   094   094   000    Old_age   Always       -       27805
 12 Power_Cycle_Count       0x0032   098   098   000    Old_age   Always       -       1055
177 Wear_Leveling_Count     0x0013   099   099   000    Pre-fail  Always       -       21
179 Used_Rsvd_Blk_Cnt_Tot   0x0013   100   100   010    Pre-fail  Always       -       0
181 Program_Fail_Cnt_Total  0x0032   100   100   010    Old_age   Always       -       0
182 Erase_Fail_Count_Total  0x0032   100   100   010    Old_age   Always       -       0
183 Runtime_Bad_Block       0x0013   100   099   010    Pre-fail  Always       -       0
187 Reported_Uncorrect      0x0032   100   100   000    Old_age   Always       -       0
190 Airflow_Temperature_Cel 0x0032   067   043   000    Old_age   Always       -       33
195 Hardware_ECC_Recovered  0x001a   200   200   000    Old_age   Always       -       0
199 UDMA_CRC_Error_Count    0x003e   100   100   000    Old_age   Always       -       0
235 Unknown_Attribute       0x0012   099   099   000    Old_age   Always       -       71
241 Total_LBAs_Written      0x0032   099   099   000    Old_age   Always       -       26330052507

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Short offline       Completed without error       00%     14903         -
# 2  Short offline       Completed without error       00%     14709         -
# 3  Short offline       Aborted by host               70%      2733         -

SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
  255        0    65535  Read_scanning was never started
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

A lot of question, but basically they all boil down to one : what are the best practices to enable SMART disk notifications on a Linux workstation ? I was kind of surprised that googling this question didn't provide any useful informations

ChennyStar (1969 rep)

Jul 30, 2021, 01:29 PM • Last activity: Jul 31, 2021, 02:39 AM

0 votes

1 answers

386 views

NAS server smartd.conf assistance

linux backup nas smartctl smartmontools

I want to perform the following using smartd: Run short smartctl test once a week. Run long smartctl test once a month. Get the results for each run on mail. I tried to read the 'man' page for both smartd and smartd.conf (https://linux.die.net/man/5/smartd.conf), but I just can't seem to understand...

#
# Three disks connected to a MegaRAID controller
# Start short self-tests daily between 1-2, 2-3, and
# 3-4 am.
  /dev/sda -d megaraid,0 -a -s S/../.././01
  /dev/sda -d megaraid,1 -a -s S/../.././02
  /dev/sda -d megaraid,2 -a -s S/../.././03

That doesn't make any sense to me, and I can't understand how to apply that to my use case. Help would be appreciated. Thanks.

displainame (3 rep)

Mar 11, 2021, 11:33 PM • Last activity: Mar 12, 2021, 12:05 AM

1 votes

1 answers

743 views

Can I determine the real lifetime with “Error occurred at power lifetime: 19132h” and "Power_On_Hours 0h" in smartctl?

hard-disk external-hdd smartctl smart smartmontools

I just bought a "new" and very cheap hdd online. I used some kinda usb3.0 hdd box to connect to my PC. By running smartctl, I can see the following outputs ``` 9 Power_On_Hours 0x0012 100 100 000 Old_age Always - 0 Error 4 occurred at disk power-on lifetime: 19132 hours SMART Self-test log structure...

I just bought a "new" and very cheap hdd online. I used some kinda usb3.0 hdd box to connect to my PC. By running smartctl, I can see the following outputs

9 Power_On_Hours          0x0012   100   100   000    Old_age   Always       -       0

Error 4 occurred at disk power-on lifetime: 19132 hours

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Vendor (0xb0)       Completed without error       00%     47354         -
# 2  Vendor (0x71)       Completed without error       00%     47354         -

The complete output: https://hastebin.com/zafejecopu.yaml What do those errors mean? Can I determine the real lifetime value from the smartctl output? Thanks a lot.

sgon00 (457 rep)

Nov 16, 2020, 07:05 AM • Last activity: Feb 24, 2021, 10:59 AM

1 votes

1 answers

642 views

Can I recover a 500GB Seagate Momentus with bad sectors?

partition hard-disk data-recovery badblocks smartmontools

I've received a Seagate 2.5" 5400rpm 500gb HDD that was throwing up a Boot configuration error post some windows updates. I've tried the following on it and nothing seems to work: **First step**: I tried Windows repair to re-install the bootloader but the installer wouldn't interact with this partic...

to recover data from it, which I did successfully.(disk connected to my laptop through a Sata to USB adapter) **Third Step**: I tried formatting the HDD with

but it threw a read error when trying to create partition table and exited. I was able to delete de old NTFS/FAT32 partitions but not able to create new ones. **Fourth Step**: Started Windows installer and tried formatting from the installer but it threw an error saying it cannot format the disk. (again with the HDD in it's original machine) After this things get weird. Some times my laptop would recognize the HDD other times not. **Fifth Step**: I checked the disk with

and it did show some read errors in some sectors. I tried to write zeroes to those sectors, which seemed to work but not sure it did. I tried partitioning the disk but now

would throw

: cannot open /dev/sdb: Input/output error

. **Sixth Step**: I tried partitioning with

which would open up the disk and it did threw some errors to which I said ignore. After about 8 ignores for the partition table and some more for the actual partition(which I set to take up the entire disk space, from 1 to 500G) the following happen:

now sometimes shows disk

with partition

sometimes it only shows the disk, sometimes not at all.

now sometimes shows data/executes test, but more often it throws

Device Identity failed: scsi error unsupported scsi opcode

-I /dev/sdb

throws

/dev/sdb: HDIO_DRIVE_CMD(identify) failed: Invalid argument

throw

Bad magic number in super-block while trying to open /dev/sdb1

If I disconnect the disk and reconnect it I have to recreate a partition table and re-partition in with

log when connecting the drive:

[ 4708.480583] sd 3:0:0:0: [sdb] tag#0 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE cmd_age=0s
[ 4708.480592] sd 3:0:0:0: [sdb] tag#0 Sense Key : Medium Error [current]
[ 4708.480598] sd 3:0:0:0: [sdb] tag#0 Add. Sense: Unrecovered read error
[ 4708.480603] sd 3:0:0:0: [sdb] tag#0 CDB: Read(10) 28 00 00 00 00 00 00 00 01 00
[ 4708.480610] blk_update_request: critical medium error, dev sdb, sector 0 op 0x0:(READ) flags 0x0 phys_seg 1 prio class 0
[ 4708.480617] buffer_io_error: 6 callbacks suppressed
[ 4708.480620] Buffer I/O error on dev sdb, logical block 0, async page read
[ 4708.843190] sd 3:0:0:0: [sdb] tag#0 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE cmd_age=0s
[ 4708.843199] sd 3:0:0:0: [sdb] tag#0 Sense Key : Medium Error [current]
[ 4708.843204] sd 3:0:0:0: [sdb] tag#0 Add. Sense: Unrecovered read error
[ 4708.843210] sd 3:0:0:0: [sdb] tag#0 CDB: Read(10) 28 00 00 00 00 01 00 00 07 00
[ 4708.843216] blk_update_request: critical medium error, dev sdb, sector 1 op 0x0:(READ) flags 0x0 phys_seg 7 prio class 0
[ 4708.843223] Buffer I/O error on dev sdb, logical block 1, async page read
[ 4708.843229] Buffer I/O error on dev sdb, logical block 2, async page read
[ 4708.843232] Buffer I/O error on dev sdb, logical block 3, async page read
[ 4708.843235] Buffer I/O error on dev sdb, logical block 4, async page read
[ 4708.843238] Buffer I/O error on dev sdb, logical block 5, async page read
[ 4708.843240] Buffer I/O error on dev sdb, logical block 6, async page read
[ 4708.843244] Buffer I/O error on dev sdb, logical block 7, async page read
[ 4708.976204] sd 3:0:0:0: [sdb] tag#0 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE cmd_age=0s
[ 4708.976212] sd 3:0:0:0: [sdb] tag#0 Sense Key : Medium Error [current]
[ 4708.976217] sd 3:0:0:0: [sdb] tag#0 Add. Sense: Unrecovered read error
[ 4708.976223] sd 3:0:0:0: [sdb] tag#0 CDB: Read(10) 28 00 00 00 00 00 00 00 01 00
[ 4708.976228] blk_update_request: critical medium error, dev sdb, sector 0 op 0x0:(READ) flags 0x0 phys_seg 1 prio class 0
[ 4708.976235] Buffer I/O error on dev sdb, logical block 0, async page read
[ 4709.153850] sd 3:0:0:0: [sdb] tag#0 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE cmd_age=0s
[ 4709.153860] sd 3:0:0:0: [sdb] tag#0 Sense Key : Medium Error [current]
[ 4709.153865] sd 3:0:0:0: [sdb] tag#0 Add. Sense: Unrecovered read error
[ 4709.153871] sd 3:0:0:0: [sdb] tag#0 CDB: Read(10) 28 00 00 00 00 01 00 00 07 00
[ 4709.153877] blk_update_request: critical medium error, dev sdb, sector 1 op 0x0:(READ) flags 0x0 phys_seg 7 prio class 0
[ 4709.153885] Buffer I/O error on dev sdb, logical block 1, async page read
[ 4709.320307] sd 3:0:0:0: [sdb] tag#0 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE cmd_age=0s
[ 4709.320316] sd 3:0:0:0: [sdb] tag#0 Sense Key : Medium Error [current]
[ 4709.320321] sd 3:0:0:0: [sdb] tag#0 Add. Sense: Unrecovered read error
[ 4709.320327] sd 3:0:0:0: [sdb] tag#0 CDB: Read(10) 28 00 00 00 00 00 00 00 01 00
[ 4709.320333] blk_update_request: critical medium error, dev sdb, sector 0 op 0x0:(READ) flags 0x0 phys_seg 1 prio class 0
[ 4709.486795] sd 3:0:0:0: [sdb] tag#0 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE cmd_age=0s
[ 4709.486803] sd 3:0:0:0: [sdb] tag#0 Sense Key : Medium Error [current]
[ 4709.486809] sd 3:0:0:0: [sdb] tag#0 Add. Sense: Unrecovered read error
[ 4709.486814] sd 3:0:0:0: [sdb] tag#0 CDB: Read(10) 28 00 00 00 00 01 00 00 07 00
[ 4709.486820] blk_update_request: critical medium error, dev sdb, sector 1 op 0x0:(READ) flags 0x0 phys_seg 7 prio class 0
[ 4709.488688] audit: type=1106 audit(1606925626.528:133): pid=2818 uid=0 auid=1000 ses=1 msg='op=PAM:session_close grantors=pam_limits,pam_unix,pam_permit acct="root" exe="/usr/bin/sudo" hostname=? addr=? terminal=/dev/pts/2 res=success'
[ 4709.489637] audit: type=1104 audit(1606925626.528:134): pid=2818 uid=0 auid=1000 ses=1 msg='op=PAM:setcred grantors=pam_faillock,pam_permit,pam_faillock acct="root" exe="/usr/bin/sudo" hostname=? addr=? terminal=/dev/pts/2 res=success'
[ 4709.653391] sd 3:0:0:0: [sdb] tag#0 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE cmd_age=0s
[ 4709.653395] sd 3:0:0:0: [sdb] tag#0 Sense Key : Medium Error [current]
[ 4709.653398] sd 3:0:0:0: [sdb] tag#0 Add. Sense: Unrecovered read error
[ 4709.653400] sd 3:0:0:0: [sdb] tag#0 CDB: Read(10) 28 00 00 00 00 00 00 00 01 00
[ 4709.653403] blk_update_request: critical medium error, dev sdb, sector 0 op 0x0:(READ) flags 0x0 phys_seg 1 prio class 0
[ 4709.831007] sd 3:0:0:0: [sdb] tag#0 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE cmd_age=0s
[ 4709.831011] sd 3:0:0:0: [sdb] tag#0 Sense Key : Medium Error [current]
[ 4709.831013] sd 3:0:0:0: [sdb] tag#0 Add. Sense: Unrecovered read error
[ 4709.831016] sd 3:0:0:0: [sdb] tag#0 CDB: Read(10) 28 00 00 00 00 01 00 00 07 00
[ 4709.831018] blk_update_request: critical medium error, dev sdb, sector 1 op 0x0:(READ) flags 0x0 phys_seg 7 prio class 0
[ 4709.997153] sd 3:0:0:0: [sdb] tag#0 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE cmd_age=0s
[ 4709.997162] sd 3:0:0:0: [sdb] tag#0 Sense Key : Medium Error [current]
[ 4709.997167] sd 3:0:0:0: [sdb] tag#0 Add. Sense: Unrecovered read error
[ 4709.997173] sd 3:0:0:0: [sdb] tag#0 CDB: Read(10) 28 00 00 00 00 00 00 00 01 00
[ 4709.997179] blk_update_request: critical medium error, dev sdb, sector 0 op 0x0:(READ) flags 0x0 phys_seg 1 prio class 0
[ 4710.174596] sd 3:0:0:0: [sdb] tag#0 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE cmd_age=0s
[ 4710.174599] sd 3:0:0:0: [sdb] tag#0 Sense Key : Medium Error [current]
[ 4710.174602] sd 3:0:0:0: [sdb] tag#0 Add. Sense: Unrecovered read error
[ 4710.174604] sd 3:0:0:0: [sdb] tag#0 CDB: Read(10) 28 00 00 00 00 01 00 00 07 00
[ 4710.174607] blk_update_request: critical medium error, dev sdb, sector 1 op 0x0:(READ) flags 0x0 phys_seg 7 prio class 0
[ 4710.174653] ldm_validate_partition_table(): Disk read failed.
[ 4711.206717]  sdb: unable to read partition table

I'm thinking I could write zeroes on the entire HDD but not sure if it would help. Is there any way to recover this HDD?

fluffehStack (125 rep)

Dec 2, 2020, 04:35 PM • Last activity: Dec 2, 2020, 06:07 PM

0 votes

1 answers

201 views

Bind smartd to user session

systemd notify-send smartmontools

I want to see smartd notifications in DE (Gnome3). So I've configured smartd to execute custom script that uses notify-send to notify all logged users: **smartd.conf**: ```conf /dev/sda -m root -M test -M exec /etc/smartmontools/smartd_warning.d/notify -a -n standby,10,q ``` **smartd_warning.d/notif...

I want to see smartd notifications in DE (Gnome3). So I've configured smartd to execute custom script that uses notify-send to notify all logged users: **smartd.conf**:

/dev/sda -m root -M test -M exec /etc/smartmontools/smartd_warning.d/notify -a -n standby,10,q

**smartd_warning.d/notify**:

#!/usr/bin/env sh

IFS=$'\n'
for LINE in w -hs
do
    USER=echo $LINE | awk '{print $1}'
    USER_ID=id -u $USER
    DISP_ID=echo $LINE | awk '{print $8}'
    sudo -u $USER DISPLAY=$DISP_ID DBUS_SESSION_BUS_ADDRESS=unix:path=/run/user/$USER_ID/bus notify-send "S.M.A.R.T Error ($SMARTD_FAILTYPE)" "$SMARTD_MESSAGE" --icon=dialog-warning
done

it works correctly only if I restart smartd when I logged into system. Obviously it can't work on boot, because smartd starts before any user logged into system.

[Unit]
Description=Self Monitoring and Reporting Technology (SMART) Daemon
Documentation=man:smartd(8) man:smartd.conf(5)

[Service]
Type=notify
EnvironmentFile=-/etc/sysconfig/smartmontools
ExecStart=/usr/sbin/smartd -n $smartd_opts
ExecReload=/bin/kill -HUP $MAINPID
StandardOutput=syslog

[Install]
WantedBy=multi-user.target

How can bind smartd service to user session to see those notifications?

Evan (101 rep)

Jun 13, 2020, 05:51 PM • Last activity: Jun 13, 2020, 10:48 PM

1 votes

0 answers

992 views

How to check for file system consistency after power outage

filesystems fsck e2fsck smartmontools

What can I do to check if a file system (data in files and the hardware) is intact or corrupt after a computer is shutdown abruptly due to power outage? My home desktop computer was shut down by sudden power outage. The computer automatically rebooted itself after the power is back, and I then shut...

                                  What can I do to check if a file system (data in files and the hardware) is intact or corrupt after a computer is shutdown abruptly due to power outage?

My home desktop computer was shut down by sudden power outage. The computer automatically rebooted itself after the power is back, and I then shut it down manually in the regular way. The computer runs Ubuntu 18.10, Linux 4.18.0. It has a SSD and a HDD, where the SSD holds the boot, root, and all the essential partitions, and the HDD holds one partition for data files. I think all the file systems are ext4. I want to determine if there was a corruption in any file, and if the SSD or HDD had a physical damage.

I think I can use smartmontools to check the physical damage.
I am lacking a clue about how to check the data integrity. It looks that fsck can do some checks on the file system, but it looks I need to unmount the partition to inspect. How can I run fsck to inspect the SSD? Can I use a USB boot stick which has Ubuntu on it?

I would appreciate any pointers.

----
**Edit**

The 'duplicate question' link nominally answered my question, and I am closing this question.

Among the several methods suggested in the link and the comments to this question, what I actually did is the following. 

I booted the computer, and ran

$ sudo tune2fs -c 1 /dev/sda4

where /dev/sda4 is the SSD. Then, I rebooted. The system started up, showed something about the disk check for a few seconds, and presented the normal log-in screen.

I also did

$ sudo tune2fs -c 1 /dev/sdb1

for the HDD. Upon reboot, the start up screen showed a progress for about a minute, and then the normal log-in screen came up. 
I'm not really sure if there was no error, if a problem was fixed silently, or if there was an error, but I assume the lack of explicit warning indicates that there was no error.

Thank you for the comments and the link.

norio (225 rep)

Sep 28, 2019, 02:53 AM • Last activity: Sep 29, 2019, 01:03 AM

Showing page 1 of 20 total questions