14 January 2016

Buddy, Can You Spara Digm?

Overture:
It is difficult to get a man to understand something, when his salary depends upon him not understanding it.
-- Upton Sinclair

Celko's latest SQL tutorial on simple-talk sat around for a while, nearly off the front page, until the inevitable plaint about DRI not being embraced by the kiddie koder korp:
Joe, your analysis is excellent, but one thing I've seen continually is that many many installations simply refuse to use DRI, and it's enforced via standards.

Many large shops have very little control over the knowledge level of their developers, and they opt to "keep things simple" by disallowing any background processes like DRI to enforce data integrity.

These shops are content to enforce RI via code, often with spectacular failures as a result.

I'd appreciate some feedback on ways to evangelize for this to potential clients.

Common arguments against using DRI include too much background overhead, additional complexity that new developers may not see, and insufficient documentation once the DRI is in place.
[ChrisCarsonSQL]

Thus encouraged, I allowed as how this is not out of ignorance:
Allen Holub (my coding heeero) dealt with the issue, although not specific to RDBMS, in his "Bank of Allen" series on OOD/OOP in the mid to late 90s. here: http://www.drdobbs.com/what-is-an-object/184410076

"The only recourse is to change the ROMs in every ATM in the world (since there's no telling which one Bill will use), to use 64-bit doubles instead of 32-bit floats to hold account balances, and to 32-bit longs to hold five-digit PINs. That's an enormous maintenance problem, of course. "

Holub illustrates the problem with client code doing what should be done on the server/datastore, so it's not specific to RDBMS, but I am among those who contend that the RM/RDBMS is the first implementation of "objects": they are data and method encapsulated. The coder class of folks simply refuse to cede control/LoC to the datastore. Too much moolah to loose. Their managers are clueless, and desirous of keeping their bloated headcount, so they demur.

As I began this missive, I found additional comments, one from Celko himself. I feel so proud.

But the issue demands further discussion. After all, the RM is more than 45 years old (measuring from Codd's IBM restricted paper of 1969), and Oracle's first version went commercial in 1979, a decade later (depending how you measure each "event"). Why should Celko, or even such a nobody as I, need to conduct SQL/RM tutorials to the IT community? Do random physicians post here and there on the benefits of anatomy to other physicians? Of course not. Shouldn't they all have learned the RM/SQL and such in school? Shouldn't development managers have seen the advantages to smaller data footprint, stronger integrity (ACID), and such? Yet we see NoSql re-invent the VSAM/IMS paradigms of flat-files and hierarchy. And out-of-process transaction control in the manner of CICS (1969)!!

I later responded that the reason for avoiding the RM/RDBMS in practice is not technical, but behavioral. Once again, those pesky motive and incentive™ hold sway over technical benefit or productive efficiency in the soul of a new manager. Motive: keep adding more same-skill coders, who I know how to hire and boss around, or at least think I do. Incentive: the more of these widget coders I have, the more power and money I get. And that is the key to understanding why NoSql came to be. If one is careful not to rock the dinghy one is currently in, with a little luck, one can grow it to a President Class aircraft carrier (have to be US, of course). It is another attempt to co-opt power in the development realm by client-centric coders. Bureaucracy wins, since the currency of bureaucracy is power, and that power rests in the size of budget and headcount. The manager with the largest has the most power. Any manager who changes to or adopts a paradigm that needs significantly fewer developers or maintenance (no one to run around changing ROMs every few weeks, for instance) is doomed.

The quality of the product is irrelevant, so long as it isn't so lousy as to cause losing law suits. "Good enough" means that the bureaucracy follows the path of least resistance, which in turn means that development follows from established skill set. Academic, and self-directed, IT is language centric. One takes courses in many languages and algorithms. One does not take many courses in understanding and implementing the RM; it's assumed to be "obvious" and thus of no interest. The plain fact that SQL and existing engines do only part of the job set out by Codd for the RM, is taken to mean that the RM is flawed rather than the true state of affairs: both the language and the engines were built to be "good enough" based on 1980s tech. Bah.

In the beginning, Codd didn't specify implementation, and purposely so. Random access storage was expensive, relative to tape, and batch processing was still the norm. While not quite as ancient as Celko, the first system I was involved in building was an MIS (remember that acronym?) for a state program running on an outsourced (and clapped out) 360/30 with a couple of 2311 disk drives and a COBOL compiler. DB2 didn't yet exist, not even Oracle version 2. The coders' heads were cemented in the FOR loop paradigm that persists to this day. And, of course, the FOR loop works best if the data is processed in some single order, RBAR. Thus, simple sorted files on things like SSN or Order Number or whatever were/are the norm. The 2311, by the way, was used as if it were holding sequential files from a tape drive, of course. Three tape sort/merge done on disk. The advantage: the disk was faster at sequential scans and only one "job" was run at a time. Batch multi-programming came later.

There's the notion that the RM exists to support multi-user systems, primarily some sort of terminal interaction. But such were, at best, rare in 1969. TSO didn't get to IBM mainframes until 1971, and then you had to run MVT. Codd devised the RM in the context of IMS batch processing on OS/360, which, if memory serves, didn't support file sharing across partitions (with a narrower definition than used today), which means applications. In other words, the benefits of the RM, at that time, were construed within each application silo. It wasn't until much later that having ACID meant that one could have fewer (one?) datastores supporting multiple applications from one engine.

With desktop sized machines having 10s of gigabytes of memory, 100s of gigabytes of SSD storage and a hefty Xeon cpu for under $10K, why hasn't the canard that "joins are too sloooooooooow!!" (and DRI checking, and so on) been tossed into the dustbin it deserves? Simply because managers are too timid to change paradigms. With a datastore-centric paradigm, client style coders wouldn't have much more to do than write screen I/O routines, and those can be (and are) generated from the schema. The web has been characterized as a paradigm shift, by some. Yet, from a development standpoint, it is reactionary; terminal batch processing, older than terminal time-sharing, by a lot. The web engendered yet another round of client-centric code. The early http based innterTubes didn't support persistent connections (too slow and not enough of them available from web servers), so the disconnected local/browser code attempted to do all edit/integrity processing and ship the result to the server. javascript grew like Topsy. Of course, data collisions happened all the time. Blame went to the database, of course.

Mentioned recently are the GE commercials where a 20-something coder is defending his decision to write industrial control programs in the face of Angry Birds colleagues. Why can't the RDBMS vendors make similar? May be someday.

2 comments:

Todd Everett said...

>>Why can't the RDBMS vendors make similar?

My thoughts exactly. I think you pretty much laid out the first reason. The vendors only make what the customers want, and those customers are the kiddie coders and the bureaucrats you describe. Another reason would be that to create a truly RDBMS would be anything but trivial. It seems that no one is yet up to the task.

Robert Young said...

It's always seemed that the major failing of current engines is that they don't disallow non xNF schemas. Enforcing NF is not trivial either, of course.