28 December 2010

Scotty's Wisdom

Message boards can actually be useful to the exercise of figuring out where an industry is going.  STEC is the principal publicly traded Enterprise SSD vendor, so it is the public bellwether with respect to "Enterprise SSD".  They've been segueing from SLC dominant to MLC dominant product mix, which ends up being a topic of discussion, especially recently.  A thread is running now about the qualification of an MLC version of STEC gold standard "Enterprise SSD" (ZeusIOPS).  I was moved to contribute the following:

"I canna change the laws of physics"

That will be true in the 23rd century and is true now.  The number of erase cycles of MLC is fixed by the NAND tech, controller IP can only work around it, usually by over-provisioning (no matter what a controller vendor says).  Whether STEC's controller IP is smarter (enough, aka, at the right price) is not a given.  As controllers get more convoluted, to handle the decreasing erase cycles (what?  you didn't know that the cycle count is going down?  well, it is, as the result of feature size reduction), SLC will end up being viable.  Cheaper controllers, amortized SLC fabs. 

If STEC (or any vendor) can guarantee X years before failure, then the OEMs will just make that the replacement cycle.  It would be interesting to see (I've not) the failure distribution functions of HDD and SSD (both SLC/MLC and STEC/Others).  Failure isn't the issue, all devices fail.  What matters is the predictability of failure.  The best to have is a step function:  you know that you have until X (hours, writes, bytes, whatever), so you replace at X - delta, and factor that into the TCO equation.


I think the failure function (in particular, whether and to what extent it differs from HDD) of SSD does matter, a lot.  Consumer/prosumer HDD still show an infant mortality spike.  Since they're cheap, and commonly RAIDed, shredding a dead one and slotting in a replacement isn't a big deal.  Not so much for SSD, given the cost.

I found this paper, but I'm not a member.  If any reader is, let us know.  The precis' does have the magic words, though:  Gamma and Weibull, so I gather the authors at least know the fundamentals of math stat analysis.  If only there were an equivalent for SSD.  It's generally assumed that SSDs are less failure prone, since they aren't mechanical; but they are, at the micro level.  Unlike a HDD, which writes by flipping the flux capacitor (!!), the SSD write process involves physical changes in the NAND structure; which is why they "wear out".  Duh.  So, knowing the failure function of SSD (and knowing the FF for NAND is likely sufficient, to an approximation) will make the decision between HDD and SSD more rational.  If it turns out that the FF for SSD moves the TCO below equivalent HDD storage (taking into account short stroking and the like to reach equivalent throughput), SSD as primary store becomes a value proposition with legs.  Why the SSD and storage vendors aren't pumping out White Papers is a puzzlement?  May be their claims are a tad grandiose?

The ultimate win will happen when MRAM (or similar) reaches mainstream.  Very Cool.

No comments: