29 March 2009

The Next Revolution and Alternate Storage Propositions

I've spent the last few days reading Chris Date's latest book, "SQL and Relational Theory". One buys books as much to provide support to the author, kind of like alms, as to acquire the facts, thoughts, and opinions therein. Kind of like buying Monkees albums; one doesn't really expect to hear anything new. I may post a discussion of the text, particularly if I find information not in previous books.

What this post is about is the TransRelational Model [TRM] which this latest Date book resurrects, column stores such as Stonebraker's Vertica, and the impact of the Next Revolution on them. As always, this is a thought experiment, not a report on a Proof of Concept or pilot project about either. May be someday.

In Date's eighth edition of "Introduction...", there is the (in)famous Appendix A, wherein he explicates why Tarin's patented Tarin Transform Method, when applied to relational databases, will be "the most significant development in this field since Codd gave us the relational model, nearly 35 years ago" without referencing an implementation. In particular that, "the time it takes to join 20 relations is only twice the time to join 10 (loosely speaking)." When published in 2004, Appendix A led to a bit of kerfuffle over whether, given the reality of discs, slicing and dicing rows could logically lead to the claimed improvements. I found a paper, which says it is the first implementation of TRM. The paper is for sale from Springer, for those who may be interested. You will need to buy the book to see what they found.

At the end of "SQL and Relational Theory", in the About the author, is a list of some of Date's books, among them "Go Faster! The TransRelational Approach to DBMS Implementation, is due for publication in the near future." The same book is "To appear" in Appendix A of the eighth edition. And I had thought it had gone away. The url provided for Required Technologies, Inc. is now the home of an ultrasound firm.

The column database has been around for a while; Vertica is Michael Stonebraker's version. There is also a blog, The Database Column which discusses column stores. It makes for some interesting reading. Two of the listed posters are of Vertica.

My interest is this: given the Next Revolution, do either a TRM or column store database have a purpose? Or any 'new and improved' physical storage proposition. My conclusion is, on the whole, no. The column store, when used to support existing petabyte OLAP systems may be worth the grief, but for transactional systems, at which the TRM is aiming and from which column stores would extract, not so much. The claim in the eighth edition is that TRM datastores scale linearly with the number of tables referenced in a JOIN, but my thoughts are that the SSD table/row RDBMS cares not about the number of tables referenced in the JOIN, since access time is independent of access path. In such a scenario, the number of tables in the JOIN (assuming that the number of tables is determined by the degree of decomposition) should lead to faster access, since there is less data to be retrieved. As I said in part 2, there is a cost in cycles for the engine to synthesize the rows. The actual timing differences will be determined by the real data. In all, however, it seems to me that plain vanilla table/row 5NF RDBMS on SSD multi-processor machines will have better performance than either TRM or column store on any type of machine. Were I of TRM or a column store vendor, inexpensive SSD multi-processor servers would be making my sphincter uncomfortable.

The sine qua non of RDBMS performance implementation, is access path on storage. The fastest are in memory databases, such as solidDB now from IBM. For production databases for normal organizations, mainstream storage for mainstream databases will be where the action is. Both TRM and column datastores, so far as either has 'fessed up, are an attempt to gain superior performance from standard disc storage machines. Remove that assumption, and there may not be any there, there. Gertrude Stein again. Kind of like making the finest buggy whip in 1920.

Current mainstream databases can be run against heavily cached disc storage, buffering in the engine and the storage subsystem. The cost of such systems will approach that of dedicated RAM implemented SSD storage, since the hardware and firmware required to insure data integrity is the same. As was discovered by the late 1990's, one level of buffering which is controlled by the engine is the most efficient and secure way to design physical storage.

And for what it's worth, back in the 1970's, before the RDBMS came into existence, there was the "fully inverted file" approach to 'databases'. In essence, one indexed the data in a file on each 'field', and turned all random requests into sequential requests. This appears to be the kernel behind the TRM and column store approaches. Not new, but if one buys Jim Gray's assertion that density increases will continue to surpass seek/latency improvements, then it makes some sense for rust based storage. The overwhelming tsunami of data which results may be a problem. If we view a world where storage is on SSD, rather than rust, as Torvalds says, the nature of file systems changes. These changes have a material impact on RDBMS implementations.

26 March 2009

The Next Revolution, Part Two

Part 2 was to be a study in my building and testing 5NF versus 0NF. To do that, I needed an Intel X-25E (it is the Real Deal), but at $800 not something I would buy casually; soon, just not this week. For a couple of weeks. Then AnandTech came to my rescue. They published some testing last fall of a group SSD, including the X-25M; which got my brain grinding on this whole ball of tar. They also published some technical discussions of how SSD works, and under certain circumstances don't. For those interested in industrial strength SSD, I first got to know about them through Texas Memory Systems; they've been around since 1978. One of those, I couldn't afford, not this week or next.

I check every now and again, yesterday found that AnandTech had just published a new X-25E specific study. WooHoo!

Here is the link

The page that is of interest to me, since it validates my premise, is Page 11: Testing in the Real World. The money quote: "One X25-E is 66% faster than eight (!) 15000RPM SAS drives."

What they didn't do, and I'll have to get around to it soon, and it would be easier if I had access to a full fledged production sized development system with lots of flat file data (that's too broad a hint??), is demonstrate how much data can be tossed out from such 0NF applications.

Tossing out redundant data is only part of the gain from SSD multi-processor machines. (As an aside, there is a new blog "The Grey Lens Man", telling the ongoing tale of tossing out a COBOL/REXX application on AS/400 (now iSeries last time looked) to Scala on something else, I guess. Some companies will retire old code. There is hope.) The other part of the improvement motivation lies in housing the data constraints with the data; declared referential integrity, triggers, and stored procedures (not everybody's cup of tea). Doing so reduces the amount of coder created code; I make the distinction that database constraint writing is done by database developers, not coders. The distinction is based on my experience that coders seek to write as much text as possible, since there is a tacit understanding twixt them and Pointy Haired Bosses that SLOC is _the_ measure of productivity; more text equals larger paycheck. Database developers, on the other hand, just want the damn thing to work with as little typing as possible. This motivation is what accounts for the growing number of catalog based CRUD generators appearing like croci in the spring. I expect I'll have more to say about them in due time.

Lest I sound too much like a star struck lover on Madeira, there is a cost, albeit small I believe to 5NF data. In order to return the 0NF data to the application, the engine has to synthesize the rows, and that will take some cycles. So, if we need 1,000 rows from the 0NF table Foo, we could get them as 10 rows from table Able and 100 rows from table Baker, where Able and Baker are a full decomposition of Foo. It is the full decomposition of tables that make SSD databases useful. It is unlikely in our lifetimes that terabyte xml dumps could be stored economically on SSD. Which is fine with me. The point here is to make better transactional systems. Whatever the Web 2.0 folk propagandize, the money data is in transactional systems.

Ride the wave. Cowabunga, dude.

Elvis has left the building, and Fabian was never here

It occurs to me that, since I am forcefully in the Date/Darwen/McGoveran/Pascal camp, there could be a suspicion that this site is a resurrection of "Database Debunkings" and/or I am an amanuensis or pseudonym for some or all of the above.

Nope. I am a real flesh and blood bloke with the name I was slapped with (me bum still hurts) over the fount. I have arrived at my conclusions quite independently, and disagree at times with some or all of the above. I don't, for example, believe that we should drop ACID. And I do find some of CELKO's writings to be quite lucid. I have been doing databases since before there were any widely available relational engines, about the time Dr. Codd released his first paper. Such databases were not relational; they were used to support statistical and operations research and might now be considered just flat-file systems with a bit indexing. A decade later, dBaseII and Clipper purveyed such and called themselves database systems. Oh, yeah MySql did that, too.

What got this writing exercise started was having the time to delve more fully into the implications of some new-ish hardware; that if enough folk who could see past the nonsense of the God xml and see how we can build truly relational transactional systems with this new hardware; and the realization that both the writing exercise and the system building could all be done for next to no $$$. Such a deal. Which the cynics among us will point out is exactly what blogging on the esoterica of databases is worth; even so, the value of the new systems is likely rather higher.

I also find amusement in the machinations of both suppliers of database software, hardware, and their bureaucratic clients. Expect some musings on such subjects; as the IBM/MySql speculations already posted. Forewarned is forearmed.

25 March 2009

DB2 for COBOL-z/OS

There is much ado about the Sun/IBM marriage these days, and I have made my contributions elsewhere, so it is time to do so here.

The issue that concerns me is MySql. On the surface, Sun spending $1 billion for it made no sense at the time, and now we have the Drizzle-ing off of what makes a RDBMS worth having:

Aker has already selected particular functionality for removal: modes, views, triggers, prepared statements, stored procedures, query cache, data-conversion inserts, access control lists and some data types. (Chris Duckett on zdnet/24 jul 2008)

That report says InnoDB is/will be included. Hmmm.

What I see happening if this goes through, IBM will roll up the MySql name and ship DB2 for COBOL-z/OS. I don't see IBM shipping an Oracle bit of kit (InnoDB) with "it's" database. At the same time, conversions of COBOL/VSAM applications (I have been involved in a few) typically are done for the purpose of _looking_ like the new application is a database system, but all of the ACID remains in the COBOL (no reason to get rid of that, it works. riiiiiiiiiiiiiiight). Classic MySql (up through about 4.0) was just a sql parser in front of the file system.

This is really all that the COBOL coders want, anyway. They don't understand ACID or Normal Forms. For marketing purposes, the Pointy Haired Bosses need to say that their 40 year old code is really Brand New and Updated to Modern Technology (sql:92 is modern by this definition), so off to DB2 it goes. But no one wants to think relationally. Real DB2 for z/OS does included the proper kit to build a proper application, but I have never seen any of that used.

MySql gives them the perfect (or perverse, depending) answer: DB2 of COBOL-z/OS. Coming to a shop near you. Real soon now.

18 March 2009

The Next Revolution, Part One

It all started with a transcript of an interview with Linus Torvalds, in which he remarked that solid state disks would change the way file systems worked once they were relieved of delays inherent in rotating storage. Well, if file systems change, why not databases? Which got me to musing. I do that from time to time.

A bit later, I ran across a thread on comp.databases.theory, and posited this notion. CELKO chimed in that his at that time to-be-published next book ("Thinking in Sets", and worth having) had a section discussing solid state disks.

So, two Well Known People and solid state disks. There just might be something there, there.

Here's the notion. Databases started out with what became the Network model about 1960 (yes, that long ago). IBM wasn't thrilled (they hadn't worked out the Network model), so they invented the Hierarchical model (as IMS). Dr. Codd was at IBM and wasn't thrilled with IMS, so wrote up the relational model in a 1969 internal paper, which was a public paper in 1970. IBM was not happy. Here was one of their own taking potshots, and getting bullseyes, at what was already a cash cow.

What became Oracle Corp saw the opportunity, and they took it. The first "commercial" relational database was released by them, 1979. From there, the players we all know and love began their struggle over what *is* a Relational Database Management System, what *is* the proper language to be supported internally, and even whether a RDBMS should be about relational data.

The most persistent struggle over these 30 years (leaving aside for the moment all that is xml) has been over how relational a production database should be. Since IBM's core (from the point of view of database technology) was then, and is now, COBOL programs running on mainframes; their users remain wedded to the notion that performance of databases is always detrimental to performance of the whole application. The reason for this sideways thinking derives from the way COBOL code practices have remained solidly in the data "models" of the 1970's, since, to all intents and purposes, so has the language.

Rather than refactor out of the code, which likely is still architecturally identically to whatever was the initial design 20 or 30 years ago, the data integrity bits and put them in the database with the data; all that is typically done is to move the data from the existing datastore (IMS or VSAM) into DB2/Oracle/whatever. The result is that the database is asked only to retrieve data in disconnected lumps. Send the Order header. Process that. Send the Order lines. Process them. Send the Inventory lines. Process them. And so on.

Such "legacy" coded applications can be morphed into using the RDBMS to its fullest capabilities, but that is too often, based on back of the bar napkin jottings, dismissed as too expensive. Adding another sand brick to the sand castle is seen as better. From a quarterly bonus for the boss point of view, that may be. But the tide is coming in, and it is called solid state disks.

What is coming: a faster, cheaper, easily maintained application based on the multi-processor solid state disk machine. Building and running these applications will be so economical that even the back of the bar napkin jottings will succumb.