29 July 2013

No Mas!! No Mas!! [update]

Perhaps there is something in the water; or more folks read this endeavor than I thought. With somewhat more frequency (not enough to let one get cocky, though), database folks (more or less RM devotees) are writing about the silliness of NoSql. I don't recall a NoSql Zealot (i.e., trainer/evangelist) doing a mea culpa, alas. But this post is more pointed than ever. Sound familiar?
Developers might argue that people can use NoSQL just agreeing on some common rules and information. My answer to this has always been: There is no point in having rules if they are not enforced by something or somebody. SQL has always followed a "think first" philosophy while NoSQL seems to rather fancy a "store first" approach.

Of course, the NoSql zealots, just as their COBOL brethren (whom they routinely disparage for being "old legacy") before them, argue that transactions should be handled in the application. That means some form of rudimentary TPM. And lots of silos of data that can't communicate. And so on. Who needs 40 years of experience building something so simple?

This reflects Pascal's view on xml as just data transport (and is an occasional member of the quote squad):

The fact is that in order for any data interchange to work, the parties must first agree on what data will be exchanged -- semantics -- and once they do that, there is no need to repeat the tags in each and every record/document being transmitted. Any agreed-upon delimited format will do, and the criterion here is efficiency, on which XML fares rather poorly...
-- Fabian Pascal/2005

That's getting on to a decade ago! Some people are just slow learners.

First, it's Roberto Duran.

Second, every now and again, the nice folks at Database Weekly will include a link to one of these posts in their weekly email notification. This is one of those weeks. There are two links in the NoSql heading, the other being to this piece on an interview with Date and Darwen, said interview conducted by Iggy himself. In the course of the post, is this:
The inventor of relational theory, Dr. Edgar "Ted" Codd, himself blessed eventual consistency in the very first paper on relational theory "A Relational Model of Data for Large Shared Data Banks"...

He then gives a lengthy quote. Now, one might argue and I certainly do, that in 1969 Codd was faced with a hardware situation utterly different from today. IMS, which was his target with the RM, was intimately designed to the hardware of the day and Codd sought to break that bond. While I can't cite him, I believe he would make such a statement based on the prevailing practice of batch processing. Even in 1990, accounting modules on RDBMS verticals running on *nix mini machines, still did batch updates. While Codd and acolytes make the point of the RM being divorced from implementation, that quote is clearly inconsistent with that bill of divorcement: it reflects the limited random access support of mainframe disk subsystems of the day. Today we have multi-processor/core/SSD machines that make batch updating as obsolete as high button shoes. I can't quite figure Iggy's point here either. Is he supporting Amazon's preference for "eventual consistency" (an oxymoron) by quoting Codd, or is he supporting eventual consistency in today's RDBMS engines? Don't know, but either isn't needed with today's hardware.


Iggy Fernandez said...

Hello, Robert,

I would argue that the tradeoff between consistency and performance is as important in the wired world of today as it was in Codd’s world. We cannot cast stones at Dynamo for the infraction of not guaranteeing the synchronization of replicated data, because violations of the consistency requirement are commonplace in the relational camp. The replication technique used by Dynamo has a close parallel in the well-known technique of multimaster replication (http://docs.oracle.com/cd/E24693_01/server.11203/e10706/repmaster.htm#BGBGBHFE). Application developers are warned about the negative impact of integrity constraints. For example:

“Using primary and foreign keys can impact performance. Avoid using them when possible.” (http://docs.oracle.com/cd/E17904_01/core.1111/e10108/adapters.htm#BABCCCIH)

“For performance reasons, the Oracle BPEL Process Manager, Oracle Mediator, human workflow, Oracle B2B, SOA Infrastructure, and Oracle BPM Suite schemas have no foreign key constraints to enforce integrity.” (http://docs.oracle.com/cd/E23943_01/admin.1111/e10226/soaadmin_partition.htm#CJHCJIJI)

“For database independence, applications typically do not store the primary key-foreign key relationships in the database itself; rather, the relationships are enforced in the application.” (http://docs.oracle.com/cd/E25178_01/fusionapps.1111/e14496/securing.htm#CHDDGFHH)

“The ETL process commonly verifies that certain constraints are true. For example, it can validate all of the foreign keys in the data coming into the fact table. This means that you can trust it to provide clean data, instead of implementing constraints in the data warehouse.” (http://docs.oracle.com/cd/E24693_01/server.11203/e16579/constra.htm#i1006300)

Most importantly, no DBMS that aspires to the relational moniker currently implements the SQL-92 “CREATE ASSERTION” feature that is necessary in order to provide the consistency guarantee. For a detailed analysis of this anomaly, refer to Toon Koppelaars’s article “CREATE ASSERTION: The Impossible Dream” in the August 2013 issue of the NoCOUG Journal. (href="http://www.nocoug.org/Journal/NoCOUG_Journal_201308.pdf#page=13)

Kindest regards,

Robert Young said...

The tradeoff betwixt is important, sure. But the calculus is different. Xeon/SSD/100's of gig of RAM *nix machines are a totally different beasts from a 1970 era 370 DASD machine. There just isn't any comparison. Codd was very specific about implementation being separate from the RM, and transactions are the heart of the RM, whether Codd chose to emphasize that or not. Without immediately applied DRI, the RM doesn't really offer much. Some gain through smaller byte footprint, but little else.

Moreover, all the industrial strength closed-source RDBMS vendors have been building/acquiring in-memory variants. The only logical reason for such databases is easier implementation of the RM.

Without degrading into a MVCC vs. Locking engine debate, most of what you quote is Oracle specific (and generally true of MVCC engines, Postgres in particular, although its mechanics are rather different from Oracle's). As it happens, I'm not enamoured of the notion of MVCC superiority.

The Agile meme is "fail fast", while the MVCC meme is "fail later". Lockers fail fast. Failure requires intervention, and the sooner the better. Moreover, MVCC engines are notorious for eating machines whole, in one gulp. Not my cup of tea.

This quote:
“For database independence, applications typically do not store the primary key-foreign key relationships in the database itself; rather, the relationships are enforced in the application.”

is mind boggling. We relationalists have spent 40 years arguing and educating against just such apostasy. The RM is a very early implementation of OO thinking: data and method encapsulated together. With these modern machines, actually implementing the RM is much easier. Access to SSD storage (DRAM or NAND) in random fashion is little different from sequential storage, and is about an order of magnitude faster for both than on HDD.

Iggy Fernandez said...

eBay's platform is as large as Amazon's platform, if not greater. Like Amazon, eBay uses schema segmentation and sharding but, unlike Amazon, it uses a conventional RDBMS. eBay has a thousand different logical database instances spread over four hundred servers. They too find distributed transactions too expensive. This gives rise to the occasional inconsistency which has to resolved later. I believe that Dr. Codd would have approved of this "eventual consistency" approach. eBay's approach is detailed at http://www.infoq.com/interviews/shoup-ebay-architecture and http://www.addsimplicity.com/downloads/eBaySDForum2006-11-29.pdf.

Fabian Pascal said...

There is an unwillingness to say anything negative about whatever the resident hot fad is--that is commercially/professionally risky.

The best indicator of this is "the right tool/method for the job", which implies that there is no objective way to assess tools and practices. Rather, if enough people use it, it must be correct and cost-effective.

In turn this of course implies that there is no scientific foundation, it's all a matter of preferences and trial and error.

But the patina of science/great man (but NOT theory!!!!) never hurts. Isn't it nice to know that Codd approved of dumping consistency?

Robert Young said...

I've just spent some quality time with Codd's 1981/2 ACM paper. In the dozen years, or so, betwixt the two papers he spends considerable time explaining that relational databases (optimizers specifically) are at least as efficient as hand-coded application code.

Then jump ahead to his "12 Rules" articles, and 7 through 11 amount to saying: "C happens in real time".

Again, with today's hardware, the only reason to keep flat/hierarchical data around is being too cheap and/ or lazy to replace it. Certainly, there's no reason to build new applications (which admit to relational structure in the real world; books and videos, not so much) anti-relationally.

ForceCrate said...

I have argue before that the people doing this don't care about the data. It is just noise that comes over wires. If they lose it it is no big deal. They will make wild claims about how fast they can access the data and what transformations they can perform to manipulate it but at the end of the day they know that they are storing crap that won't be useful in months. As long as your problem conforms to this solution then you can use NoSQL with a clean heart. (Just don't lie and tell the user they will have real data.)