How Smart Is Smart?

11 posts in this topic

Posted · Report post

My WHS console says all my disks are fine. WHS Disk Management says all my disks are fine. But while Home Server SMART console says all my drives are ok and "Falure Predicted" says "False" in all cases, when I view the SMART Status of one of my drives it has some dire warnings about it:

Large Pending Bad Sector Count

There are 289 sectors that are waiting to be reallocated.

Disk Health: Critical

The health of this disk is Critical. It is recommended that you replace it as soon as possible.

Uncorrectable Sectors Detected

There are 13 uncorrectable sector errors on the disk.

All the other disks say : Disk Status: Healthy. So what's going on? Is Home Server SMART blowing things our of proportion? How can it say that Failure is not predicted and still give these warnings? And if it's right, how can WHS and Disk Management not detect these problems? I'm confused...

Related question: how do you run a manufacturer's diagnostic on a WHS drive? Boot from CD?

Thanks for any help unraveling this.

Share this post


Link to post
Share on other sites

Upgrade to a WGS Supporter Account to remove this ad.

Posted · Report post

My WHS console says all my disks are fine. WHS Disk Management says all my disks are fine. But while Home Server SMART console says all my drives are ok and "Falure Predicted" says "False" in all cases, when I view the SMART Status of one of my drives it has some dire warnings about it:

All the other disks say : Disk Status: Healthy. So what's going on? Is Home Server SMART blowing things our of proportion? How can it say that Failure is not predicted and still give these warnings? And if it's right, how can WHS and Disk Management not detect these problems? I'm confused...

Related question: how do you run a manufacturer's diagnostic on a WHS drive? Boot from CD?

Thanks for any help unraveling this.

Hey Breeze,

"Failure Predicted" relies on the disk manufacturer. They set the values of the other SMART counters that determine if "Failure Predicted" is true. In your case, the manufacturer of the disk doesn't think that 13 unrecoverable sectors is a fatal issue.

Home Server SMART examines a lot more SMART counters than the Server Storage tab or Disk Management, and the author has provided his own thresholds for what he considers is a fatal issue.

So the short answer is that SMART is unreliable at best, and is implemented differently depending on the manufacturer and model of the disk.

1 person likes this

Share this post


Link to post
Share on other sites

Posted · Report post

Related question: how do you run a manufacturer's diagnostic on a WHS drive? Boot from CD?

Or floppy. Manufacturer's diagnostics are meant to be run from a bootable disc.

Share this post


Link to post
Share on other sites

Posted · Report post

Or floppy. Manufacturer's diagnostics are meant to be run from a bootable disc.

And doing so wouldn't be a bad idea either. If for nothing else, peace of mind.

Share this post


Link to post
Share on other sites

Posted · Report post

Thanks for the tips, guys. I'm going to run offline diagnostics to see what's what.

Share this post


Link to post
Share on other sites

Posted · Report post

One of the many Linux live ISO's with smartmontools would probably be worth a look. It can be used to initiate the various internal test procedures too.

However I would be inclinded to ditch the drive anyway. All IDE/SATA drives have reserved capacity to relocate bad sectors (which are normal), so there should never be any bad sectors visible from the OS. As an aside, because of this it's not possible to truely erase all data on a disk with the OS, since remapped sectors are no longer addressible from an OS, which can be handy for the forensic services :)

Share this post


Link to post
Share on other sites

Posted · Report post

I'm posting the following as an FYI. I had a hard time running the WD diagnostics on my WHS system:

1. I originally installed WHS with RAID enabled in the BIOS. WD Diag only works under legacy IDE. I'm not actively using RAID so I just set the BIOS to IDE for the testing. Apparently this is what you do even if you do use RAID as long as you don't boot into WHS, but since I haven't tested this check around to be sure.

2. It still wouldn't run after that because of a stupid defect in the latest WD diagnostic (see here), so I downloaded Ultimate Boot CD via http://www.ultimatebootcd.com/ and ran the WD diagnostic from that.

3. It STILL wouldn't run: turns out when it runs from UBCD using default settings, it may stop on "IDLE: Going resident FDAPM ADV:MAX", so you have to choose the second "optimal" option in the menu that launches after you select the WD diagnostic and intercept the message that asks if you want to install "fdapm" (power management) and uncheck it.

After all that, finally WD diagnostics ran. Quicktest says drive is ok (!?). Now running the extended test (4.5 hours to go).

TBC...

Share this post


Link to post
Share on other sites

Posted · Report post

9 hours and 42 minutes later... WD diagnostics reports that the "DRIVE HAS BEEN REPAIRED" along with error code 0223 which according to the relevant WD page:

"Errors found, but have been repaired successfully. There were media errors that were within the repair capabilities of diagnostic utility. The drive should now be defect free."

OK. So why is SMART still reporting that 289 bad sectors need to be re-allocated? Because the max has been achieved? Because the errors have been corrected but SMART data wasn't updated?

And why doesn't WHS even recognize this and deal with it directly? It would seem kind of important that a backup monitors the status of it's own backup hardware and take pre-emptive steps towards preventing disasters. I get the feeling that the prevalent attitude is "Oh it's ok, it's just a consumer server anyway..."

This has really shaken my confidence in WHS. For about $500 you can get a Dobro RAID box that does it's own disk analysis and monitoring and keeps a spare drive on hand for immediate substitution and subsequent replacement, on top of self monitoring of good and bad sectors. My opinion of WHS is changing from tool to toy; another half-baked OS thrown down the customer's throat. No shame.

So it would seem SMART is as smart as the system using it, which in this case is not very smart at all. <_<

Share this post


Link to post
Share on other sites

Posted · Report post

So it would seem SMART is as smart as the system using it, which in this case is not very smart at all. <_<

No. As Sam stated above "So the short answer is that SMART is unreliable at best, and is implemented differently depending on the manufacturer and model of the disk."

So long as you used the manufacturers diagnostics that covers your particular drive model, I would believe it over a 3rd party tool. As to why the counters aren't updated I wouldn't have a clue. A call into WD might provide an answer.

Bottom line is even the Drobo would be using assumptive predictions.

Share this post


Link to post
Share on other sites

Posted · Report post

This is really just as it was in the 1980's.., the OS still doesn't (and can't) mark sectors that require extended retries to read as bad. Perhaps there is no way for the drive to communicate that it's needed to perform some kind of retry mechanism to retrieve the data to the OS, which I suppose was the whole point of IDE.

It used to be normal for drives to have a bunch of 'marginal' sectors on them - back in the day I wrote a little utility to time reads to every cluster on the disk and flag in the FAT as bad any out-of-tollerance by a certain amount (hence detecting internal retries) as DOS and chkdsk, just as today, always actively tried to assume a sector was good and avoiding flagging bad where possible. As said proper hardware RAID will use many techniques to keep control on this kind of thing, which is why time-limited error recovery is really needed to avoid the controller marking the whole drive as bad/offline when a bad sector is found.

On the SMART query, it may be that the manufacturer's data in the field is a different interpretation. For example it might be relocated sectors, free recloation buffer or something similar. It probably isn't an absolute number anyway - take the narrative with a bit of a pinch of salt.

But back to the question, I would strongly suggest ditching this disk or at least using it for something else (external backup drive perhaps) - for your server you want 100% good disks IMO. Also bear in mind that this IS a consumer software product running on consumer grade desktop hardware in most cases. Look at the price of a Dell server with however much usable storage you have (even on SATA drives) with hardware RAID..., and why will be clear :o

Share this post


Link to post
Share on other sites

Posted · Report post

Thanks for the heads up guys. The smaller Dell servers start at $500 so that may be something to consider, though the hard core enterprise stuff is definitely out of my league. Apart from the 1 to 2 drive redundancy, this is what Drobo claims:

"Self-Healing Technology With the self-healing technology now incorporated into Drobo S, your data is safer than ever. Even when sitting idle, the Drobo will continually examine the blocks and sectors on every disk, flagging questionable areas. This preemptive “scrubbing” helps ensure your data is being written only to the healthy areas of your drives, and that your data is always safe. And if a drive fails, Drobo S will strive to keep your data in the safest state possible, utilizing the available space on the remaining healthy drives."

Which sounds exactly like what J1mbo is describing. How functional it is is another matter, but along with the automatic RAID swapping, it sure is more comforting than the way WHS does things.

As for ditching the drive: it's less than 2 months old. I'll contact WD and see what they say.

I used to see Winchester hard disks as a great way of storing large amounts of data. Now I'm starting to see them more like ticking time bombs, especially since they shortened warranties and reduced build quality. I still have a 13-year old 4 gig SCSI drive that still works... it's all about planned obsolescence isn't it?

Share this post


Link to post
Share on other sites
You are commenting as a guest. If you have an account, please sign in.
Reply to this topic...



Upgrade to a WGS Supporter Account to remove this ad.



  • Latest Posts

    • Shares - A Nightmare!
      By Hyde · Posted
      Reading a few websites it would appear there's an option to tick in RDP that can ignore the certificate warning.  I must admit I've missed this option in the connection process but will look again tonight.  But is this good practice to ignore such errors?  I appreciate it's all on my private (home) network but I didn't have to do this when using the same client laptop connecting to my old WHS v1.
    • Shares - A Nightmare!
      By Hyde · Posted
      Woke this morning to a message that one of the folders in Music had already been shared and did I want to replace that with the current request. So said yes. It churned for another 15 minutes or so and then came back and said everything is now shared. So I'll test writing something to that folder structure when I get home tonight. Any thoughts on the certificate error when using RDP on the same network? 
    • WHoSeBox - Use your WHS like dropbox
      By stigzler · Posted
      Hi folks. Just thought I'd share a little app I made. It lets you use your WHS like dropbox - with public shares + link grabbing from explorer. I'm a hobby coder - so don't expect watertight - but think it's working OK - any probs just let me know. Check it here: https://whosebox.codeplex.com/
    • Questions on Planning and Preparing for a N54L
      By darkarn · Posted
      Thanks for your reply! Ah, that issue. I was thinking of using a modded BIOS though, which HP wont support lol, but the network card not working is news to me. Besides, I am strongly considering to pay a bit more and go for a Gen8 instead.
    • Shares - A Nightmare!
      By Hyde · Posted
      Have shut down the RDP session and re-connected.  Using the host machine's name I still get the message back that the Server isn't switched on, not my network, etc. etc.  Tried again with the IP address and once again I get the warning that the certificate isn't from a trusted source, but at least I can still get an RDP session and the Sharing dialog box is still churning some 20 minutes since I kicked it off (for the 3rd time at least today).  So, would be nice to know why I have this certificate error too???  and how to fix it ;-)
  • Recently Browsing

    No registered users viewing this page.