SSD Drives – Are they surviving the times?

SSD Drives – Are they surviving the times?

SSD drives have long been touted as a faster, more reliable way to access data. My most recent experience was with a batch of computers that had SSD failures the Intel SSD 520 Series (180GB) in an unspecified vendor’s machine, doesn’t matter anyway. So, the drive claims to last a stunning 1.2 million hours or 136 years as a mean time between failures. The Warranty is standard with 5 yrs, the same shelf life Seagate and Western Digital usually assign to a spinning drive. The lesser 320 model had a 3-year warranty, again probably not indicative of being planned for a failure 3-5 years out. My best guess is they want to continue writing firmware for only so long and evolve out of old technologies in terms of support ecosystem.

Vendors regularly tell me that they “have SSD failures every day” on the Support Hotlines, and so I wanted to dig in a bit to understand some of the factors. Folks often tell me that the “gates can only open and close so much”, but surely if I’ve had 5 failures of a specific SSD within 12 months, it’s either a manufacturing issue or another component giving out.

My first Hard Drive was in a Packard Bell Legend desktop, and it had 800mb. I recall that it never had failed, or accumulated more than a few bad blocks in all of the 10 years that I had it kicking around.

Let’s dive in below, and see more about what’s happening inside of the box.

What are the factors?

NAND flash cells can sustain only a certain number of program/erase cycles, as specified by the endurance rating before they permanently wear out. SSDs use wear-leveling to distribute wear evenly across cells.

Passive components fail and cause the entire drive to stop working. For example, a NAND dies, the SSD controller, or capacitors. It seems that SSD’s need special internal power supplies, controllers, and onboard circuitry to do the job. This complexity and local handling of voltage mean susceptibility to surges from both internal, and external sources.

Firmware can feel like a Hardware Failure, but it’s just bad programming. Often times the end-user has to seek out the upgrades from vendors like Dell, or HP.

The read-modify-erase-write (erase-write) cycle starts to replace data on the flash once it’s been occupied before, more data more erasing. Drives will be consumed, they have to mark data deleted before writing over it again, a process which fatigues the NAND flash.

SSDs have lower endurance with bit errors, which is caused when electrons leak through cell walls — and program disturbs are created.. A program disturb is the unintentional programming of a memory cell inside of the flash.

Memory & CPU usage does tend to correlate with SSD failures for many models. i.e. Servers with more paging due to 100% occupied Physical Memory can fail faster than one that has ample RAM available.

SSDs are susceptible to Bad Blocks, which are non-writable by the OS, and this can lead to data loss. If blocks are going bad and being marked as inactive on a regular basis it’s a bad sign. Constant resets of the ‘raidport’ on the disk controller with lockups usually points to this kind of issue as well.

Improvements & Fixes

You need an long diagnostic test really isolate. It doesn’t always work natively in diagnostics, Dell’s F12 diag boot as an example. Seek out the right diag tool for the job if you can’t isolate.

chkdsk /f – Fix and recover bad blocks, still works on SSD. Try it, and consider disabling hibernation, other tweaks to reduce large contiguous files on disk.

Watch for bursts of reallocated sectors, bad performance, and forced chkdsk on boot. You can use specific tools to see incrementing bit-level errors, like CRC on the drive.

Tools exist to monitor the SSD, like SSDLife, SMART info from Defraggler. Great programs that give you a deep dive into the drive’s performance metrics.

SSD vendors need to improve wear leveling, and possibly design devices for specific purposes. We love the performance of SSD, but it may not be appropriate for all situations, like being used in a noisy database SAN.

Look for firmware on the vendors sites. Dell has it available on downloads.dell.com but you won’t find those same updates in Dell Command, or on the Dell Support Site.

Conclusion

SSDs ARE less likely to fail, but they are more likely to lose data.

Controller maturation and average product endurance is improving and the standard deviation is falling

SSD makers need to increase the sophistication of wear leveling software and spread data out intelligently. No matter how you spread it though, they’ll eventually wear out. We need better diagnostic tools, failure detection, and optimization built right into the drive.

 

Reference: SSD DataCenter Failures