View Categories

How to check the health of your servers hard disk

3 min read

Table of Contents

It is a good idea to do regular checks of your server disks, however with a number of disk types now, Spinning HDD’s SSD’s and NVMe disks there are different ways to check them, this guide is intended to give you indications only it is not black and white between good and bad disks.

NVMe #

First of all get the list of disks with:

#nvme list

If you do not have this installed use your package manager to install ‘nvme-cli’, you will also need ‘smartmontools’

You will see a list of NVMe disks:

# nvme list

Node                  SN                   Model                                    Namespace Usage                      Format           FW Rev  

--------------------- -------------------- ---------------------------------------- --------- -------------------------- ---------------- --------

/dev/nvme0n1          S435NA0R200095       SAMSUNG MZ1LB960HAJQ-00007               1           0.00   B / 960.20  GB    512   B +  0 B   EDA7602Q


Now we know the disk is nvme0 we can check its health:

# smartctl -a /dev/nvme0 | grep Used

Percentage Used:                    0%

The closer to 100% you get the more your disk is close to the end of its predicted life. It is however VERY important to keep in mind that you can be at 100% and still see no issues with the disk, this is simply a vendor estimate of the % of the remaining life of the disk if you are getting filesystem issues and slow performance then the disk may be at the end of its best days.

 

SSD #

With SSD’s you can use the regular smart tests, assuming that the disk is /dev/sda you can run a short or a long test as follows:

#smartctl -t short -a /dev/sda (This is a short self test)

#smartctl -t long -a /dev/sda (This is a long self test)

You can then check the disk health with:

#smatctl -a /dev/sda

You are then looking for any specific errors logged at the end which will be very obvious and titled as errors, if you find these please copy and paste them into a ticket for us to review.

To see the overall health you can check the ‘Wear_leveling_count’ line, this starts at 100 and reduces down to 0, again this is simply a vendor estimate of the % of the remaining life of the disk if you are getting filesystem issues and slow performance then the disk may be at the end of its best days.

 

HDD #

With spinning disks let us again assume that the disk is /dev/sda you can run a short or a long test as follows:

#smartctl -t short -a /dev/sda (This is a short self test)

#smartctl -t long -a /dev/sda (This is a long self test)

You can then check the disk health with:

#smatctl -a /dev/sda

You are then looking for any specific errors logged at the end which will be very obvious and titled as errors, if you find these please copy and paste them into a ticket for us to review.

With spinning disks, there are 2 important lines which are ‘Reallocated_Sector_Ct’ and ‘Current_Pending_Sector’ count, the reallocated sector count means you have bad sectors but the disk has dealt with them fine, this is not terrible if the count is either low <1000 or static and not growing over time if however you have Current_Pending_Sectors for more than 24 hours that means the disk has issues and it cannot deal with them itself, this is critical.