09 February 2022

I Have a Code in My Node - part the second

Many years ago, I got into an argument on some of the compute forums over the notion that node shrink is a Never Ending Story. I called that somewhere between a mistake and a lie. Some time later, this became a missive in these here parts (2014). At that time, features had a few dozen atoms each. The writing was on the wall, so far as I could see.

Well, unbeknownst to me, those in the business of making chips have been looking into the fact that chips have been Heisenberg devices for many years. Not badly enough to shutdown node shrinking, but getting closer. Today's NYT has a lengthy report on the history and currency of this concern. Turns out, I was right. I await the call from Oslo.

About a year after the first installment in this series, the report tells us
Companies that run large data centers began reporting systematic problems more than a decade ago. In 2015, in the engineering publication IEEE Spectrum, a group of computer scientists who study hardware reliability at the University of Toronto reported that each year as many as 4 percent of Google's millions of computers had encountered errors that couldn't be detected and that caused them to shut down unexpectedly.
Heisenberg arrives:
A team of researchers attempted to track down the problem, and last year they published their findings. They concluded that the company's [Google] vast data centers, composed of computer systems based upon millions of processor "cores," were experiencing new errors that were probably a combination of a couple of factors: smaller transistors that were nearing physical limits and inadequate testing.
Is there a way around this? Some think the problem can be solved with software. Coders always think that, of course:
One such operation is TidalScale, a company in Los Gatos, Calif., that makes specialized software for companies trying to minimize hardware outages. Its chief executive, Gary Smerdon, suggested that TidalScale and others faced an imposing challenge.

"It will be a little bit like changing an engine while an airplane is still flying," he said.
For myself, Smerdon is a cockeyed optimist. Consider the problem at a purely logical level. Your node has gone Heisenberg, which means that random cells produce random results at random intervals. Now, the software Monitor answer, at the 30,000 foot level, is either firmware (on the metal) or loaded code somewhere below the OS. What happens, of course, when the code space of the Monitor falls victim to Heisenberg? The Monitor fails, invisibly just as before. So that means you need a Watcher to watch the Monitor, which puts you in the fun house mirror room of infinity. And the situation also impacts performance of the chip, since most (all?) of the features have to me monitored in real-time. How much of the chip's workload will be dedicated to monitoring itself? Don't worry about the GUI bits, and only care about the ALU? I don't pretend to know, and I expect the likes of TidalScale don't know either.

It will be interesting to see how mankind adjusts to a world where "More" is removed from the vocabulary.

No comments: