03 February 2012

Damn You, Damocles!

The earlier thread related post engendered some comments that pooh-poohed the importance of threads going forward. While the points were rational, in that not all of the engines mentioned are threaded in all subsystems on all platforms (and, I never said they were), the fact remains that threading, in addition to coring (if that's a real word), is the architecture we'll live with so long as cpu's are silicon based. Lithography can only go so small, and power can only go so high, especially as feature size diminishes. I really don't think that notion is up for debate.

What can be debated is whether a single threaded engine, all subsystems that is, can keep up with threaded engines, some or all subsystems. The answer, in the limit, is no. Eventually, the fork in the road will split the performance paths just too widely to make even a "free" engine worth the money. Will that happen next year, or within five? Five seems more likely, except that we're working, whether it's recognized or not, in a different Moore world.

Moore's actual observation was that processors would double in productivity (measured in $$$) about every 24 months. He didn't predict anything about feature size, per se. He did take into account that the Law derived from feature shrinkage, but from a financial point of view. Intel is leading the pack in implementing hardware threading across the board, no question; although not in thread count. It is equally unquestionable that clients, more and more phone-ish devices, are inherently limited in power, given the physical size limits. Given Amdahl's Law and inherently limited power, the client isn't a solution to the database power situation. In other words, success going forward will rest with those who *do* go Back to the Future, but not by ignoring threading.

With bidirectional "small" clients (AJAX, WebSockets, et al) talking to servers a la *nix databases to VT-X00 terminals, it behooves us to look back at how those servers functioned, since they are the paradigm which leverages the current, and evolving, hardware. As interTubes communication behaves more and more like RS-232, there is a maximally efficient way to use this. Jamming lots o data on the line (and leaving logic hanging out there, too) isn't it. The network of tomorrow is much like the client/server in-a-box environment of the first instance. One might argue, tee hee, that The Cloud is uber-centralized data and that clients are/will be relatively passive devices; very much like a VT-220 I fondly remember. More to the point, with a client/server in-a-box paradigm (which can now be achieved with WebSockets and such) we have a patch of memory for the server, and another patch for the client, with the screen/terminal/smartphone/whatever on the other end of a wire. The screen only has responsibility for display and input. With proper NF and DRI, there's very little active code in this server resident client. Schweeeet. But, as the previous experience demonstrates (if you've actually been there), all edits are done, input box by input box, in real-time against the live datastore; the VT-X00 are referred to as "character mode" devices, as opposed to the "block mode" 3270 mainframe terminals and (regular) HTTP clients of today which re-connect to send a screen's worth of input. Schweeeeet. All the client does is paint the screen and ship the input back. Schweeeeeeeet. Coders?? We don't need no stinkin' (and do they ever) coders!!

Each of those client patches can run on a thread. Give up on threads, and you give up on half your capacity (or more); it's not just about how much faster a thread switch is than a context switch. Here is a schematic (about half way down the page) of the i7; a core (cpu) is about 10% of the real estate, or the transistor budget. In other words, while feature size has diminished over the years thus pushing up the transistor budget, not much if any of that largess has gone to instruction set implementation. Current Intel chips don't even implement X86 instructions in the hardware, at all (appears that X86 native execution began to disappear with the P4; thanks to Andrew Binstock). So, we get more RISC cores/threads. Here's an Intel thought: "[T]he multithreading capability is more efficient than adding more processing cores to a microprocessor."

When asked in the 1970's why his machines were so much faster, Cray said, "shorter wires". That remains true; certainly cpu designers are quite aggressive in the pursuit as they jam evermore features on nanometer long wires. Why software folks continue to think that longer wires are smarter is puzzling. Even if McNealy had been right that the network is the computer (for a local engineering net, not so much the interTubes), there's still the problem of reconciling all those nodes.

While Intel cpu's mostly are two threads/core, others have up to eight.

So, what margin of advantage is there to "turbo" mode at the thread level? Turns out, there is a bit of it, with the current Intel chips:
"When there are only two active threads, the Intel Core i7 will automatically ramp up two of its processing cores by one speed grade or bin (133 MHz). When only one thread is active, the Core i7 will ramp up the sole active processing core, not by one bin, but by two bins (266 MHz). However, if there are three or more active threads though, the Core i7 processor will not overclock any of its processing cores." Not a lot, compared to turbo at the core level. So, I still predict that ignoring thread support in the cpu is a losing proposition.

Here's a new posting testing threading in SQL Server. While not a runaway, in virtually all test cases, threading improved performance. Now, SQL Server in default mode, is a locker based engine, while Postgres is MVCC, so the engine mechanics might have an effect. Given that MVCC databases promise no conflict between readers and writers, one should expect that those sorts of databases should gain more from a threaded engine than a locker; threads stay active.

While it doesn't explicitly compare Oracle on a process OS versus a thread OS, this paper does discuss the advantages, and methods of use, of parallelism and threads. Oracle and Postgres are both MVCC, although the implementations are different. And, this is a Stanford paper discussing the Oracle Windows thread model. Making comparisons between Windows and linux (or any specific *nix) is difficult, in that what we want to know is whether a "good" thread implementation is better than a "good" process implementation, for an otherwise equivalent OS.

Cut to the chase. Since DB2 switched to a threaded engine from a process engine, there ought to be some evidence on the interTubes as to whether this was a Good Thing. This is a presentation by Serge (guru to DB2 weenies), see slides 6, 15, 16. And, this is the IBM justification. Another example: this is a paper by SAP for DB2/LUW on HP-UX (not the most popular *nix), note point 1. under "Operating System". IBM didn't move to a threaded model for yucks; they've been treating LUW as red-headed stepchild for so long that any significant expense would require significant bang for the buck. Finally, at long last, a recent presentation on the new threaded DB2. Note slide 17.

We need two additional pieces to get the most out of the system: multiplexed clients and fast swap to enable the clients to not stall. Here's IBM's take on multiplexing. Others can do likewise. The way to keep things running smoothly is Sacrificial Swap©, by that one means using SSD as swap on the database and/or web server machine. Since SSD have determinate lifetimes, more cliff like End Times than HDD's decay slope, simply replace the swap drive periodically. Swap on SSD provides vastly better performance.

On a related note, this is an IBM (mostly) paper on bufferpools (what DB2 calls them) and SSD. Of note, this isn't using SSD as primary datastore, but only in support of bufferpools.

My conclusion: threaded models have the lead, and they're not likely to lose it.