Pinterest Stumbleupon Whatsapp
Ads by Google

Part 1 of this article What Are Bad Sectors And How Can You Fix Them? [Part 1] What Are Bad Sectors And How Can You Fix Them? [Part 1] What are these bad sectors? Is this a sign your drive is going to bite the dust? Can these sectors be repaired? Read More  looked at drive hardware and how the controller works behind the scenes to detect and resolve issues with bad sectors it detects during normal operation.

In this conclusion to that discussion, we will look at the tools available from the operating system, drive manufacturers, and other third parties you can use to manage and monitor your drives to keep them as healthy as possible.

Disclaimer: Before running any of the commands in this article, make sure you have a good backup of the drive The PC Backup & Restore Guide The PC Backup & Restore Guide Disasters happen. Unless you're okay with losing all of your data, you need a good backup system. Read More since attempts to repair bad sectors may result in corruption of the file system. This means it is possible to lose portions of data files or metadata that is used to locate files and directories on the volume. Manufacturer and third-party tools may be just as catastrophic as operating system utilities. This is especially important for those utilities that bypass operating system protections and directly access the drive which is exactly what many of these tools do.

Scanning a Disk for Bad Sectors

Every operating system has tools to scan a disk for bad sectors. Some are automatically invoked during startup if the computer detects an improper shutdown. For example, Windows maintains a “dirty bit” in the Master File Table (MFT) on NTFS volumes or the File Allocation Table (FAT) on FAT16/32 drives.

During boot, the autochk program looks for this value and if it is set, it will run an abbreviated version of the actions performed by chkdsk on all volumes flagged. A similar process is used by other modern operating systems.

Windows

For those who are not afraid of the Windows command line, chkdsk Stuck on CHKDSK? How to Use & Fix It the Right Way Stuck on CHKDSK? How to Use & Fix It the Right Way CHKDSK can seriously deflate your Windows boot time. More importantly, it indicates that something's not right. Here's what Check Disk does and how you can use it. Read More /r or chkdsk /b can be run at any time to look for bad sectors. It will run other tests first to verify the consistency of the drive’s metadata before the optional bad sector pass. Depending on the size of the volume in question and the number of directories and files, it can take quite some time to complete. The difference between the two commands is the second one will re-evaluate sectors that are already flagged by the operating system as bad.

Ads by Google

Windows also has a GUI tool that can be used to perform the same checks. It can be accessed by opening Windows Explorer > right-click the drive to check > Properties > Tools tab > Check now… > check “Scan for and attempt recovery of bad sectors” > Start.

windows-disk-check

No matter which one you choose, if you are scanning a system or boot drive, it will require exclusive access to the volume and ask you if you want to schedule the scan on the next restart. If it is not a system drive, the scan should begin immediately unless another process has already locked it for exclusive access.

This tool does not mark individual sectors bad; it marks the entire cluster as bad in the MFT or FAT and relocates the entire cluster to another unused cluster on the drive. This can happen if the drive hardware cannot remap the bad sector for any reason such as its spare sector pool has been exhausted.

Linux

Although the badblocks program can be used to search for bad blocks (sectors) on a disk partition on Linux systems, I recommend you use e2fsck -c instead or the appropriate fsck variant for the filesystem you are using. This ensures that the proper parameters are passed to the badblocks program.

Incorrect parameters can cause irreparable damage to the filesystem. The -c parameter performs a read-only test on the volume. If you want to use a non-destructive read-write test, you need to specify the -cc parameter instead.

e2fsck-complete

When using -c or -cc, the entire bad blocks list is rebuilt. If you wish to keep the existing entries in the list and merely append new blocks to the list, add the -k (keep) option. If you suspect there has been damage to the drive itself and/or the filesystem, you may also want to add the -p (preen) option which will attempt to automatically repair any damage. It will notify you if it cannot fix errors it finds.

Manufacturer Tools

Drive manufacturers have their own diagnostic software that may be used to perform surface analysis and control features specific to their drives. Western Digital has Data Lifeguard for Windows for their drives while Seagate has SeaTools for Windows which can be used to test Seagate, Maxtor, and Samsung drives.

Both offer options for testing and repairing their associated drives but you need to be careful about what tests are destructive and which are non-destructive. In either case, you should still have a current backup 6 Safest Ways to Backup & Restore Your Files in Windows 7 & 8 6 Safest Ways to Backup & Restore Your Files in Windows 7 & 8 By now, we're sure you've read the advice over and over: Everyone needs to back up their files. But deciding to back up your files is only part of the process. There are so many... Read More before proceeding.

data-lifeguard

Third-party Tools

There are also third-party tools such as SpinRite from Gibson Research Corporation that access the drive below the operating system level in order to perform their magic. It bypasses BIOS and interacts directly with the hard drive controller. It is primarily for data recovery but can also be used to perform surface analysis and verification prior to putting a new drive into service.

SpinRite does have its limitations. Because it runs on the FreeDOS operating system and it uses CHS to access the drive, it can only access the first 228 (268,435,456) sectors. So a drive that uses 512 byte sectors will be limited to 128 GB and a drive using 4K sectors will be restricted to 1 TB.

By setting it up on a bootable disk using the Windows 98 DOS 7 command interpreter, SpinRite 6 can theoretically test the entire drive.

Are Bad Sectors Repairable?

Physical defects from manufacturing, head crashes and most other faults detected by the hard disk controller generally cannot be repaired. Those that have been isolated by the operating system are another story.

hard-drive-repair

Operating System Tools

It is sometimes possible to recover blocks or clusters that have been marked as bad by the operating system. Since a cluster is normally several sectors and a single bad sector will get an entire cluster marked as bad, it is occasionally possible to recover those clusters.

This is because the hard drive controller may not have dealt with the bad sector before the operating system had a problem with it. Remember, the drive generally doesn’t know something is wrong until it cannot read the sector and it does not attempt to remap the sector unless there are numerous failed reads or a write is attempted to that sector after a failed read.

If the hard drive controller has reallocated the bad sector after the operating system marked the containing cluster as bad, re-running the appropriate command to re-evaluate the bad blocks (chkdsk /b for Windows, e2fsck -cc for Linux – you must not use the -k option here since it would keep the current list of bad blocks) should clear it from the list.

SpinRite

SpinRite is one of the tools that claims to be able to recover weak sectors. Even with three decades of working with technology, this is something I am unwilling to trust. The sector was originally marked as bad by the drive controller (or the containing cluster was marked by the operating system) because data could not be reliably read from it. Even if its ability to retain data can be improved, it is likely to be temporary which should bring a couple questions to mind.

  1. How temporary is this repair?
  2. Are you willing to trust your data to this sector?

Personally, this is one area where I am unwilling to tread. Much of my data is too important.

Monitoring Drive Status

One of the two best ways to protect the data you have stored on your drives – if you haven’t discerned it from previous comments – is to ensure you have implemented a reliable backup plan.

The other is using software to monitor the status of your drives. Modern hard drives include Self-Monitoring, Analysis and Reporting Technology ( 4 Tools To Predict and Prevent Hard Drive Failure 4 Tools To Predict and Prevent Hard Drive Failure Read More SMART) 4 Tools To Predict and Prevent Hard Drive Failure 4 Tools To Predict and Prevent Hard Drive Failure Read More  to help determine the health of the drive and predict failures.

Ubuntu, RedHat, and their derivatives have the Disks utility as part of their default installation. It allows you to access the most important SMART counters as well as run both the short and extended SMART tests. There are also command line tools such as smartctl Avoid Linux HDD Faults & Errors With These Tools Avoid Linux HDD Faults & Errors With These Tools From personal experience, whenever I have a problem with any component in my computer, it's more often the hard drive than anything else. My CPU has never failed me, nor my RAM, and my motherboard... Read More which can be used to automate checking and reporting of drive status.

Windows does not supply this capability so we need third-party tools such as CrystalDiskInfo CrystalDiskMark & CrystalDiskInfo - Valuable Disk Diagnostics For Free [Windows] CrystalDiskMark & CrystalDiskInfo - Valuable Disk Diagnostics For Free [Windows] There was once a time when I was a naive kid on a desktop who never even considered that I'd have to keep up with the health of my hard drive. Jumping from the world... Read More and Hard Disk Sentinel Keep an Eye on Your HDD & SSD Health With Hard Disk Sentinel Keep an Eye on Your HDD & SSD Health With Hard Disk Sentinel Have you ever had a drive failure? Sadly, there’s no panacea for preventing hardware damage. But monitoring your HDD or SSD, to act when their death becomes likely, is a first step. Read More to handle the job.

SMART Counters

Don’t freak out when you see the values being reported by these tools. The threshold value is set by the manufacturer to indicate when it will be considered a problem. The current normalized value is likely to be higher than the worst reported value and for most counters this is expected. The normalized values which range from 1 to 253, although some manufacturers will choose a starting point of 100 or 200 for some attributes, decrease from a higher starting point over time and it isn’t necessarily a problem until it goes below the threshold value.

hdd-sentinel-failing-drive

No matter what tool you choose to use to monitor your storage devices, there is a short list of counters you should be concerned with provided your drive supports them:

  • Counter 5 (Reallocated Sectors Count) is a total count of sectors that have been reallocated and placed on the G-LIST since it was put into service. This does not include the sectors that were flagged at the factory. The raw data is a true count so lower is better.
  • Counter 10 (Spin Retry Count) indicates how many times the drive needed to try to spin up the drive until it reached operational speed if the first try was unsuccessful. Increases in this attribute indicate mechanical issues with the drive or a possible power problem.
  • Counter 187 (Reported Uncorrectable Errors) is the number of ECC errors that could not be fixed by the drive controller. Lower is better when looking at the raw value.
  • Counter 188 (Command Timeout) is the number of aborted operations on the device. This is commonly a result of problems with the power supply or data cable connection issues. Again, the raw data value should be low.
  • Counter 195 (Hardware ECC Recovered) is a vendor-specific implementation so the values may not always represent identical conditions. In general, it is a count of the number of times ECC correction was required to return the correct data from the drive.
  • Counter 196 (Reallocation Event Count) represents the number of times sectors have triggered a remap event by the controller. It counts both successful and unsuccessful attempts to remap sectors. It is not supported by all manufacturers.
  • Counter 197 (Current Pending Sector Count) is the number of sectors that are currently marked as unstable and will be remapped if its next read attempt is successful or when it is next written. This counter is decremented once the sector has been successfully remapped.
  • Counter 198 (Offline Uncorrectable Sector Count) is the total count of errors when reading or writing sectors. If this starts going up, there is a problem with the disk surface or the mechanical subsystem.

Taken by themselves, many of the counters available don’t offer much insight into the overall health of your drives. But when they are taken together, paying particular interest to the ones listed above, you are more likely to spot negative trends so you can prepare for the drive’s inevitable demise.

Conclusion

Even though there are tools available to help predict how much life may be left in your storage devices, it does not obviate the need for a solid, tested backup plan. There is evidence that a large number of drives will fail without a single SMART error appearing in its entire history. In the same report, it also shows a high correlation between some of the SMART errors listed above and an extremely abbreviated lifespan of the device.

For example, the bad sectors indicated in the image above are from a drive that Hard Disk Sentinal Pro estimates has 21 days of life remaining. Two months ago it was reporting 30 days and I am still waiting to see how much longer it will go before it finds its way to data heaven. So it does show that predictive analysis, while indicating that data is at risk, still cannot accurately give a reliable idea of how much time it has remaining.

Even though the bad sector count has not increased in months and using HDD Regenerator on the drive to see if it could revive those 77 bad sectors didn’t help, the overall health has still decreased somewhat. It’ll be interesting to see how much longer it survives.

I’m interested in hearing if anyone else has had similar experiences with SMART monitoring tools? Have you had success in saving your data from disaster by using them? Have they not worked for you at all? How about tools for reviving bad sectors such as SpinRite or HDD Regenerator? Let me know in the comments below!

  1. Gregg Eshelman
    December 1, 2016 at 12:40 pm

    I do miss part of the early personal computer era in the 80's when it was possible to perform a true low level format on a hard drive then FDISK to partition then do a quick FORMAT followed by an error scan. If that found no errors it was good to go. If it did find errors and you wanted a "perfect" disk you could start over and plug all the bad sectors into the low level format.

    Even later, with SCSI, true low level formatting was possible, and often useful to "strengthen" problem sectors by restoring the base data that partitions and file systems rely on.

    I did have one odd old motherboard, a 286 or 386, with a built in IDE controller and a low level format function in its BIOS. While the BIOS couldn't recognize and use drives over 528 megabytes, the LLF function somehow worked on drives up to (IIRC) around 4 gig. Could be that around the time drives got to that size, manufacturers were removing or blocking whatever that board's BIOS was accessing? I kept that motherboard around for quite a while just to use on problem IDE drives to see if they could be revived. Most of the time it worked, if the problem was soft errors.

  2. Gregg Eshelman
    December 1, 2016 at 12:28 pm

    What about resetting SMART counters when all software tests show there's nothing wrong with a drive and a full, unconditional format completes without error? Not a bleeping thing wrong with the drive but something transitory like a power blip or a short heat spike caused one of the counters to erroneously max out.

    Then you're stuck with a bogus warning and having to hit F1 every time when booting up.

    I have a 1TB Western Digital drive that has nothing wrong with it, no PC I have has any issue with it - except one Dell insists it has a "parameter out of range". There ought to be a way to fix that, and if there really is a problem then the SMART counter(s) will quickly max out again, but only if there really is a problem.

    Seems to me that at least 50% of SMART is to sell new drives when "errors" are caused by things that are no fault of the media or mechanism of the drive. Just read a thread on an issue someone was having with a bunch of NAS units. They'd updated the NAS firmware and suddenly 40 hard drives' SMART status went from Good to Normal, but *only* in the NAS units' built in status reporting webservers. With the drives pulled out and tested individually every which way, there was nothing wrong with them. Seems rather fishy to me, but apparently the drivers were still under warranty (which is another sore point with the storage industry, measuring warranty from *manufactured* date instead of date of first sale to an end user) so the IT person could simply call up the drive manufacturer and ask for replacements - IF they'd accept the NAS' built in status reporting over their own diagnostic tools.

  3. ferrvittorio
    July 23, 2016 at 9:12 pm

    Hi,
    my pc went to maintenance and hd was substituted by shop with a FUJITSU MHZ2250BH G2 SSD. Accessing now smart data, I find pre-failure to old-age values. Should I "freak-out"? I have a regular back up strategy, but I thought this HD was new - which does not seem to me now.

    • Bruce Epper
      July 24, 2016 at 3:04 am

      That would depend entirely on what values you are looking at. SSDs don't use (or return useful values) for many parameters since they only apply to mechanical drives (spin-up time, seek errors, spin retries, high fly writes, and many others).

      • ferrvittorio
        July 24, 2016 at 4:14 pm

        Obviously .. I do not know this technology SSD. From what I read around, I think I distrust a bit. What values should I look for, then? PS Thanks for the extensive and clearly written article.

        Vittorio

        • Bruce Epper
          July 25, 2016 at 1:00 am

          I just did a lookup of the model number you supplied for the drive and found it is a mechanical HDD, not an SSD. The values you are seeing for the drive should be correct and if you are seeing changes to the raw data with any of the 8 identifiers noted in the article, you should be concerned.

          The major point to take away is your concern should be with the listed counters that are *changing*. It is not as important if they remain static, but if the raw counts are going up, it is finding more errors (or is unable to fix what it is finding) and drive failure is much more likely.

          It should also be noted that even with a drive fresh off of the factory floor, all of these counters will not be zero, but none of them should be excessively large either. The drive does go through testing before being boxed up and shipped out.

          You may also want to check the Event Logs to see if it is recording errors from the Disk subsystem. If you are seeing reallocation events, you should be very concerned about the status of the drive.

          For users with an SSD installed, the counters to watch would be:
          5 Reallocated Sector Count
          177 Wear Leveling Count
          181 Program Fail Count (total)
          182 Erase Fail Count (total)
          183 Runtime Bad Count (total)
          187 Uncorrectable Error Count
          195 ECC Error Rate
          199 CRC Error Count

          Some SSD manufacturers such as Samsung and Crucial ship their own diagnostic tool that can properly report this SMART data, so that should be the first place to check.

          Regardless of SMART data and the contents of the Event Log, having a good, tested backup strategy is paramount. Despite best efforts to develop a means of predicting drive failures, some drives will still fail with no warning whatsoever.

          And finally, thanks for reading the article. It makes me happy you could make use of it.

        • ferrvittorio
          July 25, 2016 at 9:27 am

          Thank you very much for sharing knowledge. You know, my crucial scope is not backup as I have several clouds but .. the hassle to reinstall the whole system. I'll pass my questions onto shop.

          :)

        • Bruce Epper
          July 25, 2016 at 9:48 am

          A system image should be part of the backup plan. Restoration in that case requires 2 things: the latest system image and the latest full backup set. The last recovery I did from a system image took about 2 hours (all of my data resides on non-system drives, so they were untouched).

          Part of that 2 hours was due to my failure to update the system image in 2 months which required a run of Microsoft Update to patch the system after it was restored. Microsoft Update took more than 40 minutes just to *find* the needed updates. If I had kept the system image on that machine up-to-date, the full restore would have taken about 1 hour.

  4. Suder
    June 16, 2016 at 5:30 am

    I did e2fsck - cc but it's says device is mounted, what to do, am new to linux

    • Bruce Epper
      June 16, 2016 at 8:12 am

      The tool cannot scan and fix a mounted filesystem. There are multiple ways you can get around this depending on what you want to accomplish:

      1) If you want to scan the drive that contains /home but does not contain the root filesystem (/), you need to be logged in as root, umount the partition, then run e2fsck.

      2) If you want to scan the root (/) partition, you can simply restart the system with 'shutdown -F now' which will immediately reboot the machine and run fsck on all partitions.

      If you don't have separate partitions for root and /home, you will have to use the second option.

Leave a Reply

Your email address will not be published. Required fields are marked *