Dr. Codd Was Right: June 2009

29 June 2009

Oracle does Data

SeekingAlpha is the most fact based of the myriad stock "analysis" sites available. Since I spend a good deal of my time these days getting rich day-trading (yeah, sure) I have fallen under its spell.

Today yields a post about Oracle and its "new" data modeler. I haven't tried it out yet (I'm still primarily interested in DB2, sigh), but from the write up, this sounds like another nail in the de-normalization coffin.

Tick. Tick. Tick.

22 June 2009

Zealous. Conviction.

Some might consider a blog dedicated to the relational model, relational databases, and the extinction of xml as datastore rather quixotic. Zealous, even. This month marks the 39th anniversary of Dr. Codd's public paper; the 40th of the first paper, internal at IBM, is in August. All along, Dr. Codd, and Chris Date most prominently of those who followed, asserted that the relational model and the database were logical constructs. In particular, physical implementation was a vendor detail; vendors were free to use any hardware and coding they wished to support the relational database.

Early on, the implementation of the join was seen to be a stumbling block. Hardware based on rotating storage had to be supplemented with ever more ingenious buffering to make joins less costly. Today, industrial strength RDBMS as Oracle and DB2 with sophisticated RAID disk subsystems can support high normal form relational databases. But they are not only expensive to acquire, but expensive to maintain. They require lots of vendor specific knowledge, since the file structures and buffer structures are not defined either in the relational model or in SQL (not strictly even a part of the relational model, but few understand that).

I, and others, made the connection a few years ago, that freeing the relational database from rotating storage meant that any logical database structure, in particular those of high normal form, would be hardware neutral with a machine built on multi-cores/multi-processors with solid state disk storage; even with current products. Nirvana had been reached.

Here in rural Connecticut, as other locations I will admit, the mainframe/COBOL/VSAM mindset (hidebound as it is) still holds sway. Yet I continue to preach. Some say, zealously. Though I haven't yet delved into the xml morass, I have acquired one knucklehead correspondent. Not "Database Debunking" territory, but there is always hope. I gather that some nerve has been pinched. That is a good thing. The TCO equation is becoming ever more difficult to ignore. The ever more widespread development of SSD, and more importantly, NAND controllers make it all inevitable. Any vendor with access to NAND and an OEM controller will be able to produce these drives.

Which brings me to the zealous. Not I, believe it or not. I first became aware of SSD with Texas Memory Systems. But they dealt, at that time, in hundreds of kilobuck DRAM based hardware. It was only in the last few months, with the Intel X-25 parts, that I became aware of flash SSD; turns out that flash SSD had been around for some time, just not very good until the Intel drives. From there I found STEC, which, by all accounts, is the numero uno enterprise vendor. They also are not shy about making their case. This past week, the principals were interviewed by a trade organ and the CEO had this to say:

"To say that they [new and existing vendors] are competing with STEC is really a misunderstanding. We don't have a direct competitor today. We've got the five customers worldwide that we went after. Basically, we have all of our target customers."

The COO, his brother:

"We see in the next three or four years flash drives from STEC and others wiping out the whole hard drive industry for high-end storage. The biggest guys in the industry are forced to follow in our footsteps instead of us following them."

Of course, they would say good things about their business. The thing is, the industry analysts agree. The fact that there are others in the industry *trying* to take the business is far more significant. If no other storage manufacturer cared, and if there were no startups and micro-cap privates working on flash SSD drives, flash SSD controllers, flash NAND replacements; then SSD, it could be argued, would just be the niche product it has been for decades with Texas Memory having it largely to itself. Texas Memory now ships flash SSD, in addition to its DRAM machines.

Repeat: "wiping out the whole hard drive industry for high-end storage". But, it looks to me that along with STEC dominating high end, Intel et al will wipe out HDD in consumer computers. Not that this is directly relevant to the RDBMS world in which I live. I left dBase II behind some time ago.

Tick. Tick. Tick.

19 June 2009

Real Time. No Bill Maher, But I'll be Funny. I promise.

The point of this endeavor is to find, and at times create, reasons to embrace not only the solid state disk multi-processor database machine, but also the relational database as envisioned by Dr. Codd. The synergy between the machine and the data model, to me anyway, is so obvious that the need to have this blog is occasionally disappointing. Imagine having to publicize and promote the Pythagorean Theorem. But it keeps me off the streets.

Whilst sipping beer and watching the Yankees and Red Sox (different games on different TVs) at a local tavern, I noticed for the umpteenth time that the staff were using Aloha order entry terminals. Aloha has been around for years, and I've seen it in many establishments. The sight dredged up a memory from years ago. I had spent some time attempting to be the next Ernst Haas, but then returned to database systems when it didn't work out. I was working as MIS director for a local contractor, and convinced them that it might be a good idea to replace their TI-990 minicomputer applications with something a tad more up to date. They took me up on the idea, so I had to find replacements for their applications. Eventually, we settled on two applications both written to the Progress database/4GL. They're still using them.

Progress was and is relational, but not especially SQL oriented. While talking with the developers of one of the applications (both were general ledger driven verticals), we talked about some of the configuration switches available. The ledger posting subsystems each had a switch for real time versus batch updating. The recommendation was to batch everything except the order entry; inventory needed to be up to date, but A/R, purchasing, and the like would put too much strain on the machine. And don't even think about doing payroll updates in real time.

Now, the schema for this application printed out on a stack of 11 by 14 greenbar about a foot thick. There were a lot of tables with a slew of columns. Looking back, not very relational. Looking ahead, do we need batch processing any longer?

The reason for doing batch processing goes back to the first computers, literally. Even after the transition from tape to disk, most code was still sequential (and my recent experience with a Fortune 100 dinosaur confirms this remains true) and left as is. Tape files on disk.

But now the SSD/multi machine makes it not only feasible, but preferable to run such code with the switch on real time. No more batch processing. No more worrying about the "batch window" overnight. The amount of updating to tables is, at worst, exactly the same and at best, less. Less when the table structure is normalized and therefore less data exists to be modified. Each update is a few microseconds, since the delay on disk based joined tables is removed. The I/O load is the reason to avoid real time updates in databased applications. We're not talking about rocket science computations, just moving data about in the tables and perhaps an addition here and there.

New rule: we don't need no stinking batches.

17 June 2009

And the Beat Goes On

Some more industry news. Compellent Technologies is a smallish (relative to EMC) storage subsystem supplier. They qualified the STEC SSD drives some time ago; among the earliest to do so.

The have confabs, as do most vendors. They put out a press release for their recent meeting. Compellent doesn't, so far as I can find out, care whether they ship HDD or SSD subsystems. But they did run a survey during the fest, and found that 91% of their business partners (78% of customers; i.e. the sheep need to be led) checked the boxes for "I really need to do SSD" and the like. The train keeps gathering speed. Now, we just need for the CIO types to realize that SSD systems have much more innovative/disruptive implications for application design.

Tick. Tick. Tick.

Here is the PR.

13 June 2009

Let's Go Dutch

In keeping with the theme of this endeavor, every now and then I return to the basics; which is to say the relational model. Since I'm neither Date nor CELKO, I seek out published authors who have figured it out (agree with me, he he). Today, a Dutch Treat, Applied Mathematics for Database Professionals. The book discusses a formalized model of data, which was first developed by the authors' mentor, Bert de Brock in 1995. As with much that isn't grounded in Flavor of the Month, the ideas took some time to develop, and still remain relevant. Date and Darwen provide a, rather tepid for some reason, forward. I don't quite understand why they aren't explicitly supportive, but there you are.

What de Haan and Koopelaars talk about won't (unfortunately) likely get you that next job doing SQL Server for Wendy's. On the other hand, it will clarify why and how high normal form data structures do what this endeavor seeks: define the most parsimonious data structure which is self-regulating and self-defending. As I have been saying, the SSD/multi machine makes this fully doable in the face of the "joins are too slows" rabble. Following the encapsulation discussion earlier, an existing bloated flat file database can be refactored to proper normal form, and existing code (COBOL, C++, java, etc.) reads from views which replicate the old table structures and writes through stored procedures. No sane code writes directly anyway (modulo 1970 era COBOL, alas).

I won't attempt to rehash the text, but I will give an overview (the link will take you to Amazon, as the eager amongst you have already found). The first part, four chapters, deals with math; logic, set theory, and functions. This is a rather complete treatment, which is welcome. The second part, five chapters, is titled Application, and is the meat of the effort. Here the authors build both the vocabulary of their modeling approach, and the model. It is expressed in the language of the math they set out in part 1, not SQL or the DDL of some particular engine. (That is dealt with in the third part, one chapter, at the end and is Oracle syntax.) I found it quite alien on first reading. The book does demand re-reading, but rewards one with a very clear understanding of what it is possible to do in a data modeling vocabulary.

The model is based on the idea of a database state which is initially correct, and that this state will only be modified into another state which is also correct. The definition of correct is closed world, and the transition process is database centric, not table centric. Years and years ago, I worked with the Progress database/4GL, which was not a SQL implementation (its 4GL was the basis of programming 99.44% of the time) by intent, although it did support SQL. I talked with one of its principle developers, The Wizard, at a conference who observed that a database, if it really is such, is whole. It is not a collection of files/tables; if you can replace any one table at will, what you have is not a database. It just isn't. That was one of the many epiphanies in life. de Haan and Koopelaars take this approach with no remorse. The object of interest is a database state. Refreshing.

The discussion of constraints is more explicit than usual. They describe constraints as tuple, table, and database level. With a high normal form view of data, there will be more tuple constraints than is common is legacy databases.

The last chapter presents the method using Oracle syntax. The most interesting aspect of the chapter is the evolution of what the authors refer to as the Execution Model. There are six of increasing correctness, and are implemented both declaratively and with triggers. The trigger code is Oracle specific, in that other databases, SQL Server for example, define change of state coverage differently; some databases require separate triggers for each of insert, update, and delete, others support multiple actions for a trigger. And Oracle users have to deal with the dreaded mutating table error; DB2 users will not since DB2 provides support for before and after images of rows. But Oracle remains the mindshare leader.

So, 300 pages which will, unless you have a degree (or nearly) in undergraduate math, stretch your understanding of what math there really is in the Relational Model; and how that can be leveraged into far stronger database specifications. Well worth the effort.

07 June 2009

But, I can see Russia from Alaska

As Sarah said, you can see Russia from Alaska. With regard to SSD and relational databases, a view is going to motivate another major change to how we build database applications.

During this down time, I have been keeping track of the producer of SSD, STEC (both its name and stock symbol; it started out as Simple Technology), and they continue to announce agreements with large systems sellers, HP being the newest. A notion has begun to bubble in my brain. Said notion runs along these lines. The major sellers of hardware have gotten the SSD religion. Absolute capacity of SSD will not, in a reasonable timeframe, catch up with HDD. This puts the hardware folk in the following situation: they have these machines which can process data at unheard of speeds, but not the massive amounts of data that are the apple of the eye of flat file KiddieKoders. What to do, what to do?

The fact that STEC and Intel are making available SLC drives, with STEC targeting big server and near mainframe machines, and the machine makers taking them up on the deal means that there really is a paradigm shift in process. Aside: Fusion-io is targeting server quality drives on PCIe cards. That puzzled me for a bit, but the answer seems clear now. Such a drive is aimed squarely at Google, et al. The reason: the Google approach is one network, many machines. A disk farm is not their answer. Fusion-io can sell a ton of SSD to them. Alas, Fusion-io is private, so no money for the retail investor. Sigh.

The paradigm shift is motivated by the TCO equation, as much as blazing speed. SSD gulps a fraction of the power of HDD, and generates a fraction of the heat and thereby demands a fraction of the cooling. So, total power required per gigabyte is through the floor compared to HDD. There is further the reduced demand to do RAID-10, since the SSD takes care of data access speed in and of itself. So, one can replace dozens of HDD with a single SSD, which then sips just a bit of juice from the wall plug. The savings, for large data centres, could be so large that, rather than just wait to build "green fields" projects on SSD/multi machines as they arise in normal development, existing machines should be just scrapped and replaced. Ah bliss.

But now for the really significant paradigm shift. My suspicion is that Larry Ellison, not Armonk, has already figured this out. You read it here first, folks. There was a reason he wanted Sun; they have been doing business with STEC for a while.

The premise: SSD/multi machines excel at running high NF databases. The SSD speeds up flat file applications some, but that is not the Big Win. The Big Win is to excise terabytes from the data model. That's where the money is to be made. The SSD/multi machine with RDBMS is the answer. But to get there, fully, requires a paradigm shift in database software. The first database vendor to get there wins The Whole Market. You read that here first, too, folks.

There have been two historical knocks on 4/5NF data models. The first is that joins are too slow. SSD, by itself, slays that. With the SSD/multi machine, joins are trivial and fast. These models also allow the excision of terabytes of data. The second is that vendors have, as yet, not been able (or, more likely, willing) to implement Codd's 6th rule on view updating. Basically, only views defined as a projection of a single table are updatable in current products. This is material in the following way. In order to reap the maximum benefit of the SSD/multi machine database application, one needs the ability to both read and write the data in its logical structure, which includes the joins. In the near term, stored procedures suffice; send the joined row to the SP, which figures out the pieces and the logical order to update the base tables.

But what happens if a database engine can update joined views? Then it's all just SQL from the application code point of view. Less code, fewer bugs, faster execution. What's not to love? The pressure on database vendors to implement true view updating will increase as developers absorb the importance of the SSD/multi machines, and managers absorb the magnitude of cost saving available from using such machines to their maximum facility.

Oracle will be the first to get there just because IBM seems not to care about the relational aspects of its databases. The mainframe (z/OS) version is just a COBOL file handler, and the LUW version is drinking at the xml trough. Microsoft doesn't play with the big boys, despite protestations to the contrary. With the Sun acquisition, Oracle has the vertical to displace historical IBM mainframes. Oracle has never run well on IBM mainframe OSs. Oracle now has the opportunity to poach off z/OS applications at will. It will be interesting.

Dr. Codd Was Right

Lisa Murkowski, Swamp Critter

About

Shameless Plug

Extended Pieces

Good Stuff

Followers

Blog Archive