29 March 2009

The Next Revolution and Alternate Storage Propositions

I've spent the last few days reading Chris Date's latest book, "SQL and Relational Theory". One buys books as much to provide support to the author, kind of like alms, as to acquire the facts, thoughts, and opinions therein. Kind of like buying Monkees albums; one doesn't really expect to hear anything new. I may post a discussion of the text, particularly if I find information not in previous books.

What this post is about is the TransRelational Model [TRM] which this latest Date book resurrects, column stores such as Stonebraker's Vertica, and the impact of the Next Revolution on them. As always, this is a thought experiment, not a report on a Proof of Concept or pilot project about either. May be someday.

In Date's eighth edition of "Introduction...", there is the (in)famous Appendix A, wherein he explicates why Tarin's patented Tarin Transform Method, when applied to relational databases, will be "the most significant development in this field since Codd gave us the relational model, nearly 35 years ago" without referencing an implementation. In particular that, "the time it takes to join 20 relations is only twice the time to join 10 (loosely speaking)." When published in 2004, Appendix A led to a bit of kerfuffle over whether, given the reality of discs, slicing and dicing rows could logically lead to the claimed improvements. I found a paper, which says it is the first implementation of TRM. The paper is for sale from Springer, for those who may be interested. You will need to buy the book to see what they found.

At the end of "SQL and Relational Theory", in the About the author, is a list of some of Date's books, among them "Go Faster! The TransRelational Approach to DBMS Implementation, is due for publication in the near future." The same book is "To appear" in Appendix A of the eighth edition. And I had thought it had gone away. The url provided for Required Technologies, Inc. is now the home of an ultrasound firm.

The column database has been around for a while; Vertica is Michael Stonebraker's version. There is also a blog, The Database Column which discusses column stores. It makes for some interesting reading. Two of the listed posters are of Vertica.

My interest is this: given the Next Revolution, do either a TRM or column store database have a purpose? Or any 'new and improved' physical storage proposition. My conclusion is, on the whole, no. The column store, when used to support existing petabyte OLAP systems may be worth the grief, but for transactional systems, at which the TRM is aiming and from which column stores would extract, not so much. The claim in the eighth edition is that TRM datastores scale linearly with the number of tables referenced in a JOIN, but my thoughts are that the SSD table/row RDBMS cares not about the number of tables referenced in the JOIN, since access time is independent of access path. In such a scenario, the number of tables in the JOIN (assuming that the number of tables is determined by the degree of decomposition) should lead to faster access, since there is less data to be retrieved. As I said in part 2, there is a cost in cycles for the engine to synthesize the rows. The actual timing differences will be determined by the real data. In all, however, it seems to me that plain vanilla table/row 5NF RDBMS on SSD multi-processor machines will have better performance than either TRM or column store on any type of machine. Were I of TRM or a column store vendor, inexpensive SSD multi-processor servers would be making my sphincter uncomfortable.

The sine qua non of RDBMS performance implementation, is access path on storage. The fastest are in memory databases, such as solidDB now from IBM. For production databases for normal organizations, mainstream storage for mainstream databases will be where the action is. Both TRM and column datastores, so far as either has 'fessed up, are an attempt to gain superior performance from standard disc storage machines. Remove that assumption, and there may not be any there, there. Gertrude Stein again. Kind of like making the finest buggy whip in 1920.

Current mainstream databases can be run against heavily cached disc storage, buffering in the engine and the storage subsystem. The cost of such systems will approach that of dedicated RAM implemented SSD storage, since the hardware and firmware required to insure data integrity is the same. As was discovered by the late 1990's, one level of buffering which is controlled by the engine is the most efficient and secure way to design physical storage.

And for what it's worth, back in the 1970's, before the RDBMS came into existence, there was the "fully inverted file" approach to 'databases'. In essence, one indexed the data in a file on each 'field', and turned all random requests into sequential requests. This appears to be the kernel behind the TRM and column store approaches. Not new, but if one buys Jim Gray's assertion that density increases will continue to surpass seek/latency improvements, then it makes some sense for rust based storage. The overwhelming tsunami of data which results may be a problem. If we view a world where storage is on SSD, rather than rust, as Torvalds says, the nature of file systems changes. These changes have a material impact on RDBMS implementations.

2 comments:

Anonymous said...

Robert,

I enjoy reading your posts, but had not read this one until recently when reviewing your earlier posts.

You may have already discovered the findings below since this post in 2009. I looked around for more information on the TransRelational Model and Date’s book “Go Faster! …” that you mentioned in this post. I found that the book is now published (2011) and even found a PDF version that can be downloaded for free at http://www.zums.ac.ir/files/research/site/ebooks/it-programming/go-faster.pdf. This PDF copy has advertising in it, but appears to be a complete copy at 287 pages. The book PDF download link is also referenced in Date’s news page http://www.justsql.co.uk/chris_date/chris_date.htm, so I suspect it is legit.

The book mentions that the book publication was held up due on non-disclosure agreements that expired in 2011 (and likely why you hadn’t seen the book at the time of your 2009 post). There is also further information about the Tarin Transform Method and a patent on that method (US PTO 6,009,432 – 12/28/1999 – Value-Instance-Connectivity Computer-Implemented Database). I am still in the process of reading the book, but found some interesting info so far.

I will be interested to hear your comments on this book – perhaps in a future post.

Thanks again for your writings and thoughts.

Scott R.

Karthika Shree said...

Finding the time and actual effort to create a superb article like this is great thing. I’ll learn many new stuff right here! Good luck for the next post buddy..
AWS Training in Chennai