05 September 2016

NoSql? No Mas! No Mas!

First they ignore you, then they laugh at you, then they fight you, then you win.
-- Gandhi

It may be a tad early to gloat, but indications are that the NoSql zealots have waved the white flag and admitted that CAP is silly and doing ACID is way more fun. I suppose they deserve hemorrhoids, too. Couldn't happen to a nicer bunch of folks. A bit of innterTubes searching confirms the occasional tidbit that drifts by: the thought leaders in the NoSql cabal finally admit that transactions and central control over data consistency ain't such an old fashioned idea after all. They've discovered it, and will be patenting it soon. Not that NoSql datastores were any kind of innovation, either. Just VSAM files in ASCII with a buzzy name.

Told ya so.

Here's the main wave from the perpetrator of CAP. At least the principal instigator admits the error. The silly part of the whole episode is that partitions really are rare occurrences. They are just extended latency when they occur. Federated RDBMS, which have been around since about 1990 (the general principles since the mid 80s), have handled the situation. Here's a DB2 tutorial from 2003. The semantics are about the same with other such RDBMS.
As the "CAP Confusion" sidebar explains, the "2 of 3" view is misleading on several fronts. First, because partitions are rare, there is little reason to forfeit C or A when the system is not partitioned.

Fact is, distributed RDBMS (both single and multi- vendor database) existed since at least the early 1990's. And it wasn't just casual; here's a paper on security from Mitre (just down the road from Progress, which supported federation) from 1994. While it's no secret that I'm not a big fan of The Zuck,
Facebook uses the opposite strategy: the master copy is always in one location, so a remote user typically has a closer but potentially stale copy. However, when users update their pages, the update goes to the master copy directly as do all the user's reads for a short time, despite higher latency. After 20 seconds, the user's traffic reverts to the closer copy, which by that time should reflect the update.

So, what's the deal?
Another aspect of CAP confusion is the hidden cost of forfeiting consistency, which is the need to know the system's invariants. The subtle beauty of a consistent system is that the invariants tend to hold even when the designer does not know what they are.

Or, as many RM zealots tell us, high NF schemas reveal facts about data relationships we didn't know before. The schema specifies the invariants, but the data reveals the real world correlations.

Later in the piece, Brewer goes off the deep end:
The essential ATM operations are deposit, withdraw, and check balance. The key invariant is that the balance should be zero or higher. Because only withdraw can violate the invariant, it will need special treatment, but the other two operations can always execute.

This is the PollyAnna view of how banks run ATMs, and transactions generally. His description, and what is assumed by most civilians, is that Your Bank updates Your Account in real time, whether at an ATM or human teller. Not true. Accounts are reconciled (sometimes so as to generate overdrafts!)) in batch at some time EOD. Much of the big money made on bank hacking happens because the perps know that they have hours to do the deed before the accounts used are reconciled. Sometimes the intermediate accounts never see the deed. COBOL cowboys much prefer batch. They've been doing things that way for six decades. BASE has been the default paradigm in banking since forever.

So, with so many cpu cycles, SSD, XPoint, NVRAM, bandwidth, and the like available why would anyone drop OLTP/ACID on purpose? Back in the thrilling days of yesteryear when the 360 and 2311 DASD ruled the world, may be there was no other choice. Times, they are a changin.

[For those that keep track of such things, this musing and its title were started before I saw the adverts for the new Roberto Duran movie.]


Roboprog said...

Having spent a few years working on xBase stuff (an ISAM system for PCs, basically) back in the late 80s, I'm familiar with the headaches of a non-ACID system. I don't know why the kids want to go back to a distributed analog of Berkeley-DB (name-value) or MUMPS (Codasyl-ish) type system, either :-)

Some days I miss the explicit index chaining, vs delving into "what did the query planner decide to do *today*?". I don't miss the indices that missed an update, no rollback, dangling foreign keys, unset columns, etc etc etc.

Robert Young said...

DBaseII and III - I hated it
Berkeley-DB - never used it
MUMPS - been around it, and devised at MGH before I lived in the North End

"those who ignore history are doomed to repeat it." that's not the exact quote, which I didn't look up, just my notion of it.

Roboprog said...

Hello again, Robert. FWIW, I just stumbled across a pretty good recap of this (in a discussion about PostgreSQL items needing future attention), in which somebody must explain "why" to one of the newbs :-)