Dr. Codd Was Right: 2010

28 December 2010

Scotty's Wisdom

Message boards can actually be useful to the exercise of figuring out where an industry is going. STEC is the principal publicly traded Enterprise SSD vendor, so it is the public bellwether with respect to "Enterprise SSD". They've been segueing from SLC dominant to MLC dominant product mix, which ends up being a topic of discussion, especially recently. A thread is running now about the qualification of an MLC version of STEC gold standard "Enterprise SSD" (ZeusIOPS). I was moved to contribute the following:

"I canna change the laws of physics"

That will be true in the 23rd century and is true now. The number of erase cycles of MLC is fixed by the NAND tech, controller IP can only work around it, usually by over-provisioning (no matter what a controller vendor says). Whether STEC's controller IP is smarter (enough, aka, at the right price) is not a given. As controllers get more convoluted, to handle the decreasing erase cycles (what? you didn't know that the cycle count is going down? well, it is, as the result of feature size reduction), SLC will end up being viable. Cheaper controllers, amortized SLC fabs.

If STEC (or any vendor) can guarantee X years before failure, then the OEMs will just make that the replacement cycle. It would be interesting to see (I've not) the failure distribution functions of HDD and SSD (both SLC/MLC and STEC/Others). Failure isn't the issue, all devices fail. What matters is the predictability of failure. The best to have is a step function: you know that you have until X (hours, writes, bytes, whatever), so you replace at X - delta, and factor that into the TCO equation.

I think the failure function (in particular, whether and to what extent it differs from HDD) of SSD does matter, a lot. Consumer/prosumer HDD still show an infant mortality spike. Since they're cheap, and commonly RAIDed, shredding a dead one and slotting in a replacement isn't a big deal. Not so much for SSD, given the cost.

I found this paper, but I'm not a member. If any reader is, let us know. The precis' does have the magic words, though: Gamma and Weibull, so I gather the authors at least know the fundamentals of math stat analysis. If only there were an equivalent for SSD. It's generally assumed that SSDs are less failure prone, since they aren't mechanical; but they are, at the micro level. Unlike a HDD, which writes by flipping the flux capacitor (!!), the SSD write process involves physical changes in the NAND structure; which is why they "wear out". Duh. So, knowing the failure function of SSD (and knowing the FF for NAND is likely sufficient, to an approximation) will make the decision between HDD and SSD more rational. If it turns out that the FF for SSD moves the TCO below equivalent HDD storage (taking into account short stroking and the like to reach equivalent throughput), SSD as primary store becomes a value proposition with legs. Why the SSD and storage vendors aren't pumping out White Papers is a puzzlement? May be their claims are a tad grandiose?

The ultimate win will happen when MRAM (or similar) reaches mainstream. Very Cool.

23 December 2010

Mr. Fusion Powered Database

I keep track of my various database interests with a gaggle of sites and blogs. For PostgreSQL, I've signed up for the Performance sublist. It's mostly about fixing things in parts of the engine in response to questions like: "my query is slower in PG than it is in SQL Server; how come?", and such.

A thread started today that's of interest to this endeavor. It started out with one person wondering why a Fusion-io drive runs so fast, but PG doesn't run any faster. Then another chimed in to say he was setting up PG with Fusion-io drives. Looks to be an interesting discussion. Now, if only my employer would let me buy some of those Fusion-io drives! BCNF to the rescue.

Here's the list.

The thread is titled: concurrent IO in Postgres?

22 December 2010

Django Played Jazz

The PostgreSQL site has been linking to this blog a bit recently, and he's refreshing. I'm going to spend some time looking into it. It could be there's some intelligent life out there after all.

Here's the start of today's entry, if you didn't slide off immediately:

Don't retrieve a whole row just to get the primary key you had anyway. Don't iterate in the app; let the database server do the iteration for you.

And he signs off with this (my heart went pit-a-pat):

but far better is to make the database do all the work

It is shocking how many coders still insist on their for loops in code. I mean, Dr. Codd made that obsolete, in the sense of providing an abstract declarative data model, in 1969/70 (the year depends on whether you were inside or outside IBM then). In a few years, Ingress and Oracle were live. I've concluded that MySql, PHP, java, and the web generally is what motivated the regression to COBOL/VSAM paradigms (that is, data is just a lump of bytes which can only be accessed through bespoke source code). One didn't need to know much, and frankly most webbies didn't and don't, about data to build some snappy web site that just saves and moves gossip. I suppose that's OK for most of the juvenilia that passes for web stuff, but not for the grown ups.

16 December 2010

An Once of Prevention

Andy Lester is a Perl coder, and I loath Perl, so there has to be a good reason for me to mention him. And that reason is a posting of his, linked from Artima, which contains the following:

This person was one of those programmers who tried for the premature optimization of saving some typing. He forgot that typing is the least of our concerns when programming. He forgot that programmer thinking time costs many orders of magnitude more than programmer typing time, and that the time spent debugging can dwarf the amount of time spent creating code.

Now, when I was young and impressionable, the notion that a developer is paid to think, and not to type, was widely accepted. But I've certainly noted that in recent years, java perhaps the culprit, lots o typing is now the metric. LOC rules, even if most are worthless. Moreover, development by debugging is also normative. Ick.

What might this have to do with the point of this endeavor, you may be asking? Simply that declarative data is so much lazier than typing, and that a BCNF schema is easy to modify (since there aren't covariances to worry about). It does require some forethought, what Spolsky calls BDUF (you should look it up, if it's foreign), but that forethought isn't carved in stone, rather a strategic battle plan which permits many tactics. The "Agile" meme appears to have eaten its children, in that its zealots really, really do believe that all projects can be built from daily hacks by masses of coders. Ick; double Ick.

15 December 2010

Pundit for a Day

I've been reading Cringely for decades, and especially, along with most who read him it turns out, his annual predictions. Since leaving his PBS gig, he hasn't been doing them. Sniff. But today he announced that he would do another, and invited his readers to contribute same. Well. Not one to turn my nose up at the possibility of 15 seconds of fame (he allowed that any reader predictions would be printed with attribution, which he sort of has to do) I offered up what follows.

Just one, sort of.

I've been banging a drum for SSD for a number of years, at least since Intel released their flash version (in true Enterprise, Texas Memory has been shipping DRAM parts for decades, but that's another story).

When STEC, Violin, et al started to build "Enterprise" flash SSD those few years ago, the notion they promoted was that SSD would replace HDD, byte for byte. That didn't happen, largely IMO because the storage vendors (SSD makers and storage OEMs) couldn't develop a value story.

There always was a story: the Truly Relational RDBMS (not the flatfile dumps common in Fortune X00 companies which moved their COBOL/VSAM apps to some database) is (so far) the only thing which exercises the real strength of the SSD: random IOPS. But to get that benefit, you have to have a BCNF (or better) database, and join the shit out of it. The COBOL/VSAM and java apps devs don't think that way; they love their bespoke written loops.

So, what we've got now is SSD as front end cache to HDD arrays. And SSD as game machine and laptop speed up. Enterprise hasn't yet bought SSD as primary storage. Hmmm.

In 2011, we will see that. My guess is Oracle will be the lead. It works this way. Larry wanted Sun, not for java or MySql, but the hardware stack. What Larry needs is that last group of holdouts: IBM mainframe apps. To do that, he needs a credible alternative to the z machine ecosystem.

He has that now, but it ain't COBOL. He needs a value story to get those COBOL/VSAM apps. Whether you buy that Oracle is the best RDBMS or not, Larry can make the case, particularly since his competitors (save IBM) have adopted the MVCC semantic of Oracle.

Pushing highly normalized databases, with most/all of the business logic in the database (DRI and triggers and SP) running on SSD makes for a compelling story. But you've got to spend some quality time building the story and a POC to go with it. Larry's going to do it; he hasn't any choice. And it makes sense, anyway.

Remember, a RDBMS running on SSD is just an RCH from running an in-memory database. You don't need, or want, lots of intermediate caching between the user screen and the persistent store. Larry's got the gonads to do it.

Regards,
Robert Young

09 December 2010

Hand Over Your Cash, or I'll Shoot

The recent events led me to consider, yet again, the SSD landscape. The point of this endeavor is to promote the use of SSD as sole repository/persistence for BCNF databases. The reasons have been written to a great extent.

Since beginning this endeavor, there has been a clear shift in storage vendors' marketing of SSD, whether this shift was proactive or reactive, I do not know. These days, there is much talk of tiering and SSD as cache, less talk of SSD as whole replacement of HDD. Zolt, over at storage search still promotes the wholesale replacement angle, but he seems to be in the minority. I stopped over to copy the URL, and there's an interesting piece from 7 December worth reading, the column headed "MLC inside financial servers new interview with Fusion-io's CEO" (the way the site works, the piece will likely be hard to find in a couple of weeks, so don't tarry).

So, I reveried into the middle of a thought experiment: what difference, if any, does it make whether an SSD is used as sole/primary store or as cache? Well, I concluded that as cache, dirt cheap SSDs are just as good as Roll Royce SSDs (i.e., STEC, Violin, Fusion-io, and the like) from one angle, for the simple reason that the data on the SSD is really short-term. From another angle, those cache SSDs had better by really high quality, for the simple reason that the data is churned like a penny stock boiler room, and SSDs need robust design to survive such a pounding.

The OCZ contract announcement leans toward the first answer; $500 doesn't buy much (if any) STEC SSD. With error detection and hot swapping in the array, just pull them as they die and toss 'em in the shredder. I'd unload any STEC shares real soon now. There'll still be full-blown SSD storage for my kind of databases, but the American Expresses are more likely to go the front-end caching route (they've no stomach for refactoring that 1970's data), and for that implementation, commodity (this soon???) SSD is sufficient.

Like a Rolling Stone

One Hit (To The Body), STEC's and Compellent's that is, appears to have happened today. And, I'll 'fess up, I never saw it coming. OCZ, which had looked like a mix of prosumer/consumer SSD builder, is now shipping Enterprise parts. Who knew??? And the stated price is $300-$500. Either some OEM is willing to take a really big chance, or the cost of Enterprise SSD just went over a cliff.

Here's hoping for the latter. Do you get it? SSD for only 2 to 3 times the cost up front!! Not the 10 times (or more) it has been. And if the buyer is EMC???? STEC's corporate sphincter just got puckered.

The devices are the Deneva line, which they only announced back in October? They run SandForce's SF-2000 controllers, and will be shipping by February!! "Watson, the game's afoot."

As I was composing this missive, came word of the Compellent smash, and it's not a technical problem. Compellent is an early adopter of STEC SSD. Months ago, you may remember, a rival, 3PAR, was the hockey puck in a takeover game. Compellent's share rose, a lot, in sympathy, and kept going. Today's story has it that the company is going to Dell, but for substantially less than the share bid up by all those plungers. Irrational exuberance strikes again.

08 December 2010

Simple Simon Met a Pie Man

Another case of an interesting thread and an interesting post (mine, of course). And, once again, it's from Simple Talk; on this thread.

Since you mentioned it, I'll beat the drum, yet again, for the necessary paradigm shift (in many places, anyway) which small keyboardless (in practice, even when one exists) devices.

- Mostly, it was seeing that the best existing tablet and smartphone apps do simple, intuitive things, using simple intuitive interfaces to solve single problems.

As I've said here and elsewhere for some time, by (re-)factoring databases to high normal form (narrow tables, specifically), one gains a number of advantages.

1) such schemas are inherently candidates for UI generation, due to DRI
2) they're inherently robust, due to DRI
3) they're likely most of the way to being "pickable", which is what tablets do
4) given the ability to host high normal form databases on SSD, then building them to such a UI is feasible

Tablets have a long history, in fact; the iPad is nothing new, except for its venue. Those doing VAR systems that work in warehouses have been writing to RF tablets for a couple of decades, and designing to high normal form (or, as often, working around its absence).

07 December 2010

Pink Floyd

For those following that other story, what's Larry up to with Sun, I've been in the it's-about-taking-down-the-mainframe camp from the first nanosecond. It's been kind of a small camp, amongst the Usual Pundit Suspects. But today comes a bit of news along that line of thinking.

It's becoming clearer that Larry wants a database machine that can slurp up all those renegade IBM mainframe folks. He knows he's got to get them off COBOL, somehow too, but first he needs a credible stack. Another brick in the wall.

04 December 2010

Tin Man

I've met the Tin Man. Whilst looking for some MVCC/Locker debate I happened onto a Sybase Evangelist blog. Kind of like what I do, but he gets paid for it. Sigh. May be soon.

Anyway, this post is his paen to SSD, and this:
How big is your database?? **light bulb** Those same 10 SSD's get you a whopping 300-600GB of storage. You could just put the whole shooting match on SSD and forget the IO problems. Rep Server stable queue speed issues - vaporized.

Be warned, he makes the leap from Amazon sourced SSD to enterprise database storage (he doesn't mention STEC or Violin, for instance, but appears to be aware of earlier Texas Memory DRAM units); not really going to happen, so his arithmetic is off by a decimal point. But otherwise, he and I are on the same page, especially with skipping the "cache with SSD" silliness, and just storing to SSD. Schweeet. And he knows from schmere.

Now, hold Dorothy's hand.

02 December 2010

Thanks for the Memory

AnandTech has an article about 25nm flash today. Well worth the read. I'm not sure how this affects this endeavor. On the one hand, the physics of smaller feature makes flash less worthy of enterprise storage. On the other, the increased density supports greater over-provisioning to solve, maybe, the problems. A classic engineering problem.

I stopped by Unity Semiconductor to see if there's any news on shipping of their "new" device. They include this article. If it works, SSD has some calm seas and wind at its back.

24 November 2010

You've Earned a Good Thrashing

One of the aspects of RDBMS on SSD that has been worming around in my lower brain stem for a bit is, what difference does it make whether the engine is locker or MVCC? Now, for those just joining the partay, my database of preference has been DB2, with SQL Server and PostgreSQL and MySql as adjunctants, for the last decade or so. DB2 is the last major database sticking to locker semantics; with good reason, so far as I am concerned. The engine implements a deep locking scheme, and fail fast by the database is smarter than fail late by the user.

The question which as been nagging me is, for MVCC semantics, that data is spread out among both the tables and supporting storage. In Oracle's case, these are rollback segments. MVCC, to use my term, is Read Last Committed semantics; and to do that, the engine has to keep track of changes on the fly such that any query can get any rows from any table *as of* some time/commit/transaction (take your pick of term).

My worry is that garden variety SSD may not be up to storing rollback segments, due to the heavy writing. On a HDD, it's no big deal. But this method of storage necessarily slows down the engine.

In looking for answers, I came across this paper from 2008. You can skip down to slide 22 for the specific discussion. The paper doesn't present evidence, one way or another, about this concern, but does show that putting version data on SSD is a huge performance win. Here's an update to December, 2009 from the main author.

So, in all, I haven't found any clear answers from the literature. What seems clear is that garden variety consumer SSD wouldn't survive (not that I'd ever recommend such parts anyway), and I'm not so sure about prosumer parts. The STEC's out there, not concerned.

18 November 2010

Food Fight

I do so love it when the Kiddies finally figure out databases; well a couple and a little bit. This PHP post came across my bow, and it's just too much fun not to pass on. Thing is, the blogger looks to be a bit of a young-un. And he takes a good deal of heat from the ridgebrows, but doesn't back down. Good for him. There is hope for us all.

16 November 2010

An Open and Shut Case

There was an OpenSQL camp up in Cambridge last month, and I considered going, but the agenda listed far too many NoSQL projects. I decided that I'd just spend a long weekend being irritated. Today I read Josh Berkus's write up on lwn.net, and this was the one nugget:

Some of the SQL geeks at the conference discussed how to make developers more comfortable with SQL. Currently many application developers not only don't understand SQL, but actively hate and fear it. The round-table discussed why this is and some ideas for improvement, including: teaching university classes, contributing to object-relational mappers (ORMs), explaining SQL in relation to functional languages, doing fun "SQL tricks" demos, and working on improving DBA attitudes towards developers.

There are times (quite often, truth be told) that I wish the relationalists had the gonads of the Right Wingnut Zealots or Tea Baggers or what-have-you. The RM isn't just another data store. By attempting to make nice with folks who refuse to listen, you'll just tick off the folks who do get it, but won't convince those who've no intention of being swayed. Use relational databases to make better systems than the knuckleheads who belittle them. Don't get mad (well, a bit some times), get even.

If you go and read the piece, you'll find the attendees worrying about problems that the commercial vendors (with lots more folks, of course) dealt with years, if not decades, ago. Some of the problems are fully discussed in textbooks; Weikum & Vossen in particular. There Ain't No Such Thing As A Free Lunch. If you want minimal byte footprint, maximum structural integrity, minimum modification hassle, then the RM as embodied in current industrial strength RDBMS's is the way to go. Open Source databases can do the same, so long as they concentrate on implementing the fundamentals, and stop worrying about pandering to the FOTM programming language. Languages will come and go (COBOL and java relegated to Big Business), but the data is forever. Best put it some place safe.

09 November 2010

Convicts and Cane

If you're of a certain age, or were precocious at a young age, you may be familiar with the following lyric: "In the early part of this century, convict labor worked the cane fields on the bottoms of the Brazos river... Go down old Hannah don't you rise no more, if you rise in the morning, bring the judgment day".

I've no idea whether the Thought Leaders at AMD are familiar with old folk songs, but they've labeled its latest Intel beater Brazos. The good folks at AnandTech have some details. While this is a notebook implementation, and not especially pertinent to this endeavor, the graph on page one surely is. I gather it represents AMD's view of machine development over the next years, and that AMD will have victory on judgment day.

That last curve, for what AMD calls Heterogeneous-Core Era, is Brain Viagra. "Abundant data parallelism" is what the SSD/BCNF database is all about. Another step along the Yellow Brick Road. It will be a fun journey.

Ahhhhhhhhh. The Stay Puft Man

I once knew a man, father of a friend of mine, who remarked that one of his sons "costs me a lot of money"; private school and all that. Applications designed to be difficult to maintain are a lot like prodigal sons, the money just seems to fly out the window. Coders remain adamant that their API's are what make software easy to maintain. Baloney. The proliferation of code, much as it was in the 1960's (when COBOL was going to make application development so easy, a manager could do it), is justified on the grounds that the latest New Thing in coding will make all the angst go away. Hasn't happened, now has it? Perhaps we should stop looking to bloatcode for the answer.

SQL Server Central has an article on maintenance. I was moved to post a reply, herewith entered for your approval. Another opportunity to ring the bell for SSD/BCNF systems.

In the world of commercial/business software, aka database systems, the answer to maintenance costs is to embrace SSD/BCNF. Why, you may ask? Let me count the ways.

1) by putting the data and its integrity logic in one place, the server, letting the client code be responsible only for screen painting and data input; one small group of smart database geeks keeps the data under control. compare to the human wave approach of client-side coding. in fact, embracing SSD/BCNF means the data is utterly agnostic to the client code. could not possibly care less whether it's a java screen, or VB screen, or csv file. makes no difference. I offer xTuple as an example; not yet a SS application, not that this matters much.

2) by embracing SSD/BCNF, maintenance amounts to adding columns/tables/rows (a row is a business rule, remember) to the schema. the data hangs together on the RM.

3) by embracing SSD/BCNF, client side code can be generated from the schema. not saying it has to be, or should be (well, yeah, it should), but it can be. at the least, clients should interact through SP.

4) with significantly (and soon to be, massively) parallel servers available for small bucks, what's the most adept application for such machines? well, the relational database engine, of course. client code, not so much; as client coders are discovering.

5) by embracing SSD/BCNF, the byte footprint is an order of magnitude less than it is with the flat-file storage so beloved by C#/java/VB/COBOL coders.

6) as Celko (at least) has written (in "Thinking in Sets"), using auxiliary tables to implement constraints makes maintenance still simpler: just add (or delete) rows to update constraints. for that matter, authorized users can update check constraints and the like from screens; such constraints are just text stored in the catalog.

That should do it for now. Remember, the cost of maintenance is *directly* a function of the code/data structure. The more obscure that structure, the higher the cost. Historically, for those that have been paying attention, coders view life (largely because those that employ them are dumb enough measure them so) as a LOC exercise. Anything which increases the LOC future is good; likewise, anything which decreases LOC future is bad. Those that employ them often take the same view, though few will admit it. The reason is that such organizations are inherently bureaucratic, and in that environment the one with the deeper org chart gets more money ("I manage 5 managers and 100 staff, you've only got 3 and 50"). Efficiency and productivity really aren't the goal. The hardware and software to solve the issue, in the commercial world, has existed for decades, yet the COBOL/VSAM paradigm persists; only the syntax has changed. That's not an accident. The RM and RDBMS are actively opposed in many shops just because fewer coders would be needed, to do all that maintenance that CIO's complain about; which fact keeps the CIO's org chart growing. Hmmm. Curious.

06 November 2010

Larry, Larry Quite Contrary

Larry, Larry quite contrary, how does your fortune grow? No silver bells or core contributions, that's how. People are such knuckleheads; perpetual Charlie Browns, expecting the football to always be there.

Regular readers may remember this musing where I made the case that Oracle considered MySql a threat, and would do something about it. The EU was right. Here's the latest. Larry is also reining in java. My thought here is that he'd just as soon do the same to java as MySql: a crippled "Open Source" version, and a pay-through-the-nose not so Open version. Might go so far as make it into the Oracle Language.

Who's going to stop him? Perhaps IBM, also heavily in invested in java use, will take over the OS version. They'd have to either fork or prop up Harmony; they've not shown any inclination for either move, so far. Are the Armonkers dumb enough not to see that Larry is after their mainframe business? Time will tell, but it sure looks like it so far.

01 November 2010

Beam Me Up, Scotty

Another tidbit from an Artima thread.

Carlos wrote:
Someone wrote an academic paper a few years ago advocating exactly this. They showed that software designed around the idea it may be arbitrarily killed at any time was more reliable, shut down more quickly and had a host of other benefits.

And I responded:
They're called industrial strength database engines. Not trivial to write.

In general, however, the AJAX-ian migration is the attempt to recreate a connected database application, aka VT-100/RS-232/*nix/Oracle. With a phone architecture, we have that. A connected architecture will always outperform a disconnected one, HTTP for example. Managing state goes away, since the datastore always *is* the state. With said datastores on SSD, data control relegated to the server becomes a Good Thing; while the client (phone, pad, whathaveyou) just does painting, input collection, and transfer.

I remain convinced that we're headed back to bound data grids, what was once considered a MicroSoft horror (data must be loosely coupled, and all that). As well it might be, per se; but the architecture is superior from both a user experience and data integrity point of view. One fact, one place, one time is fulfilled. Again, it's only a matter of sufficient bandwidth, and your phone/pad/thingee is just a pixelated VT-100 (and a RS-232 Cat-5 wire) connected to a database. Once you've reached that point, there's nothing to be gained from retreating. We only did the web as we did because it started on 56Kb dialup, and that was fast if you could get it (14.4K was not unusual; do any of you actually have experience with BBS's and the nascent web in that circumstance?). A connected web was not envisioned, thus HTTP and the like. For better or worse, most folks are always connected, and mostly do trivial stuff with the facility.

21 October 2010

Soft Core Porn [UPDATED]

You are the Apple of my eye, although I'm no fan of Stevie Wonderful. There's been miles of type devoted to Steve's recent announcement that the next version of Mac portables will run only flash storage (which may or may not be packaged as a SSD). Here's the Journal version.

Not that I'm much of a fan of the Journal, of course, but the story carries this wonderful quote:
"The market is moving from hard drives to flash much faster than it was expected six months ago," he says.
He, in this case, is Philippe Spruch, chief executive of LaCie.

Follow that Yellow Brick Road.

[UPDATE]:

Here's a Forbe's article. The answer is as I expected:

Not surprising, however, is the fact the SSD isn't really an SSD per se--it is simply NAND Flash on the primary circuit board paired with a "SSD controller." This is one of the points I made last year in a post addressing Stec Inc. (STEC). Due to the fact that it is much more economical and takes less space to "roll your own" SSD (put Flash chips on a board with a controller), there will eventually be a very limited market for SSDs that duplicate the form factor of a HDD - PC companies and even enterprise storage companies will simply buy SSD controllers and NAND Flash chips. The point here is the controller will be the differentiating chip and that takes us to Marvell Technology (MRVL).

18 October 2010

sTuple, Part the First

I've had an abiding affinity to ERP/MRP database applications for rather a long time. They're the prototypical application which will benefit from BCNF datastores, and thus SSD infrastructure. Too bad, too; they've mostly been around since the 1990's (and earlier, if you count System 38/AS 400/iSeries/whatever-it's-called-now) and their owners aren't much interested in rebuilding. But there are possible exceptions, a few open source ERP applications exist. One that's stateside is xTuple; and, although not Open Source as the Community prefers to use the term, there is a semblance of a codebase/database from which an application can be implemented. The source is available here.

xTuple is Open Core/dual license as MySql, so I won't be addressing the full ERP horizon, just the OS version. Some regular readers may remember a post or two when the iPad was released, dealing with the notion of tablet based applications making new headway. Tablet computers have been around in the ERP/distribution world for decades, but as rather expensive specialized devices. The iPad (and what is asserted to be a soon-to-be flood of similars) opens up application development based on picking-not-typing to a much wider world. How many of these application developers actually embrace the freedom (some may think of it as a strait-jacket) of picking to re-factor their databases, rather than attempting to just stuff existing screens into a smaller form-factor remains to be seen. (Aside: when Windows took hold, the term Screen Scraping was born, thus ushering in myriad DOS applications, nicely pixilated and creating the adage "Lipstick on a Pig", foisted on an ignorant public.) As it happens, an xTuple affiliate, Paladin Logic (gotta love that name), has modified xTuple for the iPad. Since I don't have an iPad, or any need to run an ERP application, I haven't bought iTuple; thus I don't know what tack was taken.

In any case, Paladin Logic has demonstrated that some degree of morphing is possible; although not likely by a lone developer. xTuple's sorta-kinda rules don't permit modifications to the base database, without those changes being approved by xTuple for inclusion in the base product. As a result, my interest in xTuple hasn't been reciprocated, since my interest in signing on as a consultant would have been to work on transforming to a BCNF form on SSD. xTuple runs on PostgreSQL, which while a much more worthy open source database than any other (certainly more so than MySql), it doesn't have the knobs and switches present in a true strength industrial database. My preference is DB2.

So, in the spirit of xTuple and iTuple, I begin a journey to sTuple: xTuple on SSD. I'm not going, by any stretch of the imagination, to attempt anything more than a one-off POC. But what the heck, you're getting it for free.

There are a number of reasons to prefer DB2 (the server version, most often termed DB2/LUW; the LUW stands for what you think it does) for pure client/server OLTP applications, which is what xTuple is. For eCommerce/net types of applications, a case can be made for MVCC semantics databases being superior. Oracle and PostgreSQL are the two main proponents of MVCC. I'll note that SQL Server and DB2 have made stabs at imitation. In the SQL Server case, Microsoft has implemented Snapshot Isolation, which is MVCC-lite; while IBM has added some "compatibility mode" syntax munging but still on top of its locker engine.

So, what I've been considering is whether it's feasible to implement xTuple on SSD with DB2. From a cost point of view, for the clientele targeted by xTuple, it's likely a wash. IBM makes available a fully functional, up to date, DB2/LUW. It lacks some of the more arcane stuff, like LBAC, but is otherwise complete. There is a two core/2 gig resource limit; again, for the SMB world, and a BCNF (which is to say, minimized) schema, that shouldn't be a problem. Support is available for the same ballpark cost as PostgreSQL consultants' support.

The last bit is the most iffy. xTuple, and the main reason I got interested, implements the "business logic" in the database. Because this is PostgreSQL, for various reasons, this means lots o stored procedures, and some triggers. Well. One of the reasons given by coders for ignoring 90% of the facilities of a RDBMS (treating it as a file system store) is that it is impractical to port from one database to another, especially where triggers and functions and stored procs are used. Not so fast, buckaroo.

Some years ago, whilst toiling in a Fortune 100 (well, an out of the way minor group) one of those coders asked me about moving from DB2 to Oracle; was there any help for that? Turns out there was, and is. The company/product is called SwisSQL. It's been around since the mid 90's, and its purpose in life is to provide a translation application. It translates schemas and database code. It's not open source, no surprise there, but it does offer a limited time evaluation download. Since that's 30 days, I'll hold off on getting it until I'm ready to go. According to the site, it doesn't offer a PostgreSQL to DB2 code translation, but does for Oracle to DB2. We'll see just how close to Oracle PostgreSQL has gotten; it said to be quite nearly identical. The biggest issue, from that page, is outer join syntax, but that should no longer be an issue since Oracle has supported ANSI syntax since (from memory) v10, thus current SwisSQL should recognize the syntax.

I plan on looking for one, or perhaps, two cases of gross denormalization, and refactoring to DB2 and testing. For that, I'll skip the schema/data translation offered by SwisSQL, which does include PostgreSQL to DB2. From there it's getting the "business logic" ported. That can take two forms. One is to just translate the triggers and procs, the other is to implement the logic with DRI. The whole point of SSD for RDBMS, so far as I'm concerned, is to replace as much code as can be done with data. That means tables, perhaps a slew of auxiliary tables. Joe Celko wrote this up in "Thinking in Sets", so I'll not make a big deal of it here. Well, other than to say that using a One True Lookup Table (mayhaps more than One) implementation isn't out of the question using DB2 on SSD.

While I'm at it, in this preamble, a word or two on why DB2 is better for this sort of application. There is but one decent dead trees DB2 book, Understanding DB2 ...", which has a chapter devoted to each of the major structural components of any database engine: the storage model, the process model, and the memory model. There's a lot of meat in those pages. The on-line doc site doesn't replicate those chapters, alas. Here's the storage section. The short answer is that DB2 allows one to associate a (set of) table with a page size, a tablespace (set of tables/indexes), and buffer memory. So, one can have a group of auxiliary tables supporting data constraints which are pinned to memory. Very useful, that. PostgreSQL, and most other databases, use globally allocated buffering, which leads to buffer flushing when you least want it. PostgreSQL supports only one page size per installation; it's set when you compile PostgreSQL.

This can all work, as the iTuple implementation shows, because the database store is agnostic to its clients. As it happens, stock xTuple client code is a Qt implementation, which is to say C++; not my favorite or oft used language. In theory, at least, xTuple clients could be java or PHP or COBOL (OK, the last was a joke); any language with a client binding to PostgreSQL to call the procs. For these purposes, I'll just be using Aqua to access the data initially. If I get motivated to re-factor an entire client request/screen (don't know yet), then the existing client won't know the difference. That assumes that this can be isolated enough, since the rest of the code will still be looking to a PostgreSQL database. No promises on that part.

Off on the Yellow Brick Road.

07 October 2010

The Sands of Time, Much Less of It

Bloody Hell. Sandforce have announced the specs for their next generation controller(s), the SF-2xxx series. And, our friends at AnandTech have published a short "review" of the spec.

This is getting bloody ridiculous. The SF-2xxx specs out better than the STEC Zeus, by a long way. The Zeus has been considered the creme-de-la-creme of enterprise flash SSD. The SandForce controller, if it works as described, puts STEC in a bind, and the enterprise SSD buyer in the driver's seat.

And then there's this, from a Reg article:
"Barry Whyte, an SVC (SAN Volume Controller) performance expert and master inventor at IBM, thinks the previously standard 15,000rpm 3.5-inch disk drives could vanish from enterprise array's performance tier in 18 months."

Come on folks, BCNF databases are yearning to be free, free I say. We must do this.

06 October 2010

I Hear You Wanna be A RAP Star

The intent of this endeavor has always been to demonstrate that the relational model, when implemented in a BCNF database, is not only the most efficient storage model, but also the most robust. One of the side effects of BCNF databases is that they are ideal for code generation. After all, catalogs are simply plain text (and with none of the noise level of xml) and thus sources for text munging.

Today another code generation framework came by. While it doesn't make any BCNF claims (that I can find, at least) nor does it promote SSD as appropriate to its implementation, both fit precisely.

Here's the web site and over here is a White Paper describing the use of the system. Since it's not only SQL Server but also implemented as a Visual Studio project, and I'm on linux only, I've no way to test it. I've gone through the White Paper, and it does look interesting. I will note that the silliness of naming all tables with a prepended "TB" is a tic we should all have outgrown by now.

There isn't any mention of parsing of either check constraints or stored procedures in the generation of the UI. The check constraint bit shouldn't be any more difficult than the foreign key business, although the stored procedure wouldn't be a walk in the park. With a BCNF catalog, one could define lookup tables as real tables as foreign key targets. An SSD stored database would make that same, same. For that matter, using a database such as DB2, where one can define separate tablespaces/bufferpools, such (I'll assume small) lookups won't be heavily modified and can be assigned to a dedicated bufferpool of sufficient size that none of the rows will be evicted. Done, go home.

05 October 2010

Jingle Bells

Santa's early this year. Well, the notice of his sleigh contents anyway. Here's the next X-25M spec, courtesy of AnandTech. The magic words: 25nm process, and a "power safe write cache", and 600Gig. Woo hoo.

Lead, follow, or get the hell out of the way. The BCNF database is comin' through.

27 September 2010

Someday My Prince Will Come

Lordy, lordy, I've found my Prince Charming. Well, in the sense of a soul mate who understands what's going on with Oracle, and why. He doesn't delve into the ultimate goal: the IBM mainframe customer base, but he gets it.

Have a read. While not a Goldman analyst, only a The Street columnist, he should qualify as a Mainstream Pundit. I think so anyway. He's cute. Got good eyes and nice hair.

The key point is understanding that Cloud is a highly fungible term. While some define it as a resource farm used by scads of unaffiliated applications (typically, Public Cloud), Larry understands that it just means centralized data and thin (even, dare we say it, dumb) terminals. Whether the resources are "Public", ala Amazon/Google/FooBar, or "Private", ala XYZ corporate computing infrastructure, it's all about location, location, location. From Larry's point of view that means an Oracle database sitting somewhere; may be even an Oracle datacenter. Running all that newly announced integrated applications.

22 September 2010

Paranoia Strikes Armonk

Since Oracle bought Sun, I've been (uniquely, so far as I know) insisting that Larry's goal was to suck up IBM's mainframe clients, since they're the last significant number (and of really good size) of "legacy" installs to be had. Until today, the Usual Pundits haven't agreed. Ah, but news is news.

Today's Times tells us about the annual lovefest, and in the process, explicates Larry's goal. Following are a couple of quotes.

"But through its acquisition spree, Oracle moved well beyond the database and into business software, buying up the important products that companies use to keep track of their technology infrastructure, employees, sales, inventory and customers."

IBM did this in the mainframe world, initially by supplying the applications themselves. Once they killed off the Seven Dwarves (wikipedia for: IBM Seven Dwarves, for a decent bit of history), and the DoJ got under their saddle, letting software vendors work on the machines was allowed. Larry's strategy was clear before, but can't be ignored now: buy up the Oracle based application software used by the Fortune X00 (and let your fingers do the searching for how well that's gone for the clients of the bought out companies), then build a machine that's tied to the database. Just what IBM has, for now at least.

"With Sun, Oracle has found a way to sell customers hardware bundled with all that software in a fashion similar to that of its main database rival, I.B.M. Oracle executives say they can build better, faster, cheaper products this way by engineering complete systems rather than requiring customers to cobble together the parts."

Well, monocultures (the term often used to describe Windows, and explain the virus vulnerability of it) are never a good thing.

"But customers are objecting to Oracle's moves. For example, some of Sun's largest former customers consist of the large Wall Street players, and they pushed back this year when Oracle moved to limit their choices around the Sun technology. Oracle ultimately gave in to their pleas, reaffirming deals that would let Hewlett-Packard and Dell offer prized Sun software on their hardware."

That first picture in the article is the newest toy, the Exalogic machine. Letting my fingers do the searching, I came up with this Oracle page. Of note; it's built on SSD. Armonk, we have an attack!!

16 September 2010

Be Careful What You Wish For

As the saying goes, "Be careful what you wish for, you just might get it". I was visiting one of my favorite SSD sites, storagesearch.com, which led me to Solid Access, the text under the Technology tab. In their spiel was this:

I/O acceleration is achieved in applications by off-loading I/O-demanding files ("hot files", typically less than 5% of the content) onto an SSD for processing at RAM speed and using mechanical disks (or RAIDs) to process the remaining "cold files". This instantly improves the efficiency of the application servers by recovering CPU cycles formerly lost in I/O wait loops.

Why did this get my attention, you may ask? I will answer. Well, duh! Of course! Multiprogramming, even on multicore/processor hardware, depends on "idle" cycles, and such idle cycles predominantly derive from I/O waits. Mainframers learned this in the 1960's. If there is a proliferation of SSDs, whether my version where SSD is the sole storage medium, or the case where SSD is merely cache, one will see a reduction of I/O waits. Multiprogramming becomes much more problematic, expensive, and delicate, in this circumstance.

With I/O waits, scheduling is easy (well, as easy as that sort of thing gets); the application that *can't* do anything is skipped in favor of one which can use the cpu. What happens when all applications the OS sees are ready and rarin' to go? Algorithms will need to be developed which rank applications' priorities, somehow or another. Do the order entry folks get priority, or the accountants updating the GL? Who's the top dog?

*nix operating systems have nice, but nice, ultimately, depends on human intervention. I'd wager that nice is unknown to most coders whose work ends up on *nix systems. It's going to be interesting.

I just let my fingers do the searching, and found this article, which does mention SSD and scheduling together, although I don't see any direct discussion of removing I/O waits and multiprogramming. So, there is some consideration out in the literature; this paper is March, 2010.

We aren't in Kansas anymore. SSD proliferation may mean that multiprogramming is passe'; each core/processor will be dedicated to an application/process, since there isn't exploitable I/O wait in the machine anymore. If storage is on PCIe, or similar, I/O wait gets yet slimmer. Oh my! OS's would be greatly simplified, since multiprogramming support is the gnarliest part of any industrial strength OS.

What about database engines? They act like an OS, in that they have much the same responsibility: serving multiple clients, although the clients' needs are constrained. In the old days, there were Pick and AS/400; "integrated" OS and database. Could we see linux/db in the next few years? Might be. It's been accepted wisdom in the *nix world for many years that one should reduce, or even eliminate, OS buffering in favor of the database doing all of that itself; after all, what point is there to having multiple memory images of a file that's only of interest to the engine? Hmm.

15 September 2010

Watch Yourself, Streak

A few posts ago I predicted that the emergence of iPad (and similar) would bring the warehouse tablet paradigm into wider use. I also predicted that such use (being at much smaller dimensions) is a perfect opportunity for BCNF databased systems to prosper at the expense of flatfile bloatware.

Well, Dell is pushing Streak for medical uses. Boy howdy. Head 'em up, move 'em out. Real databases will win.

Now, if I can just find some folks in this benighted neck of the woods who've figured it out, too.

10 September 2010

What Am I Bid for This Fair Maiden????

There's movement in the storage bin, these days. I missed out on the 3Par explosion, by a few hours. Alas. Today brings STEC into the buyout rumour factory.

Such a wonderful opportunity to speculate on the future, and I won't refuse.

The assumed buyers are Dell, IBM, Oracle, EMC. I'm not convinced that any buyout is in the making, so I didn't go and buy more. My reasoning is simple: I've not seen any evidence that any of the assumed players (nor any other) understand that flash SSD is useful for relational databases, done as Dr. Codd instructed. As generic byte storage, nope. The assumed players, especially Dell and EMC, haven't said or done anything to indicate they've made the connection.

The lack of connection is not surprising, since BCNF data storage means either green field development (a small corner of the enterprise space which consists largely of running 40 year old COBOL through DB2 or Oracle, stuffed full of simpleminded files), or refactoring those file image databases to BCNF (or something that looks an awful lot like that). Enterprises don't do that sort of thing. Mark Hurd (you've heard of him?) made HP "profitable" by cutting out any activity that smacked of R&D or innovation. Larry just hired him, so that tells you all you need to know about whether Oracle would embrace SSD.

That leaves IBM as possible buyer. STEC is qualified to IBM now. Why would a non-storage company (IBM or Oracle or ...) want to own STEC? The only rational reason is to own the controller IP STEC has established, and either take it off the market or re-price it significantly higher. Re-pricing is a non-starter. There are a number of, and growing, controller vendors, taking ever more clever approaches. Fusion-io is one; SandForce another. STEC's "enterprise" controller is generally believed to be superior, and has replaced other vendors. On the other hand, STEC has been pushing its lower performing drives, the Mach class, presumably due to customer pressure to move lower priced parts. The Zeus parts have higher margins; STEC has admitted that.

IBM buying STEC would be a watershed event. The company has been shedding physical production for the better part of a decade, as a result bringing production in house would be immense. They aren't likely to do that in order to sell the parts; I just don't see that. We're left with sequestering STEC controller technology in IBM. Could they break supply contracts with EMC, et al? Probably not. If they could make STEC exclusive to IBM, how would they exploit exclusive access to STEC tech? Again, SSD isn't competitive for massive byte store. The only way is to push the RELATIONAL part of RDBMS. IBM hasn't shown that they get it. Most of their database revenue (and virtually all of that growth over the last decade) has come from moving Fortune X00 companies from COBOL/VSAM to COBOL/DB2. Such companies aren't interested in doing anything more than getting the bytes from VSAM to DB2; DCLGEN is the extent of the database design.

So, in the end, while it would make me a few bucks and rock my world, I don't see IBM buying STEC. Dell or EMC wouldn't make any material difference, while Oracle has Sun's quasi-SSD flash appliance already.

Tempest in a teapot.

01 September 2010

Touching Me, Touching You (Second Chorus)

Diligent readers know that, while this endeavor began with my having some free time to make a public stand for the full relational model/database due to the availability of much less expensive flash SSD (compared to DRAM SSD, which have been around for decades) in a "normal" OLTP application, the world changed a bit from then to now. In particular, the iPad. I've mentioned the implications in earlier postings.

Now, as regular readers know, the iPad is not especially new, from a semantic point of view. Tablets have been in use in warehouse software applications (MRP/ERP/Distribution) for a very long time. (This is just a current version.) I programmed with them in the early '90s.

But the iPad does mean that mainstream software now has a new input semantic to deal with: touch me, touch me, not my type. So, it was with some amusement that I saw this story in today's NY Times. Small-ish touch screens means small bytes of data, a bit at a time. The 100 field input screen that's been perpetuated (in no small measure as a result of the Fortune X00 penchant for "porting" 1970's COBOL screens to java or php) now for what seems like forever is headed the way of the dodo. It simply won't work. And the assumption that "well, we'll just 'break up' those flatfiles into 'sections'" will fail miserably. There'll be deadlocks, livelocks, and collisions till the cows come home.

BCNF schemas, doled out in "just the right size" to Goldilocks, is the way forward. Very cool.

31 August 2010

What's It All Mean, Mr. Natural?

The issue is: what does Oracle think it can win? The answer appears to be a fat license fee from Google. The fact that some Google folk once worked at Sun is irrelevant. Dalvik was built independently of the jvm, and doesn't resemble it. It does not translate/compile java bytecode/classfiles on the fly. The development is done in java (SE, I believe, via Harmony/Apache; if they use the ME, then there's trouble). Once the .class file exists, it is translated to Dalvik .dex format. This is no different from using C to write a java compiler. Or using java to write any other DSL. Unless, and I don't know the answer, there is verbiage somewhere that .class files "must be" run on a certified jvm, then Google is fine.

27 August 2010

Ya Know How Fast You Was Goin', Son?

So yeah, boy, do ya know how fast you was goin'? Turns out, speed isn't everythin'. Just read up on the tests of mid to high-end SSDs. Here's AnandTech's page. And a quote from the Vertex Limited Edition: "Saturating the bandwidth offered by 3Gbps SATA, the Crucial RealSSD C300 and OCZ Vertex LE are the fastest you can get. However, pair the C300 with a 6Gbps controller and you'll get another 70MB/s of sequential read speed." And these are just retail/consumer parts.

I've seen (didn't note the cite, alas) articles stating that enterprise SSDs need to go mano-a-mano with controllers. This is easy to understand.

So, is SPEED the reason to use SSD? Well, of course not. The reason for SSD is BCNF (or higher, dare I submit) datastores. Those ridiculous speed numbers are for "sequential" reads, and sequential only happens (in the physical reality meaning of the word) on bare silicon. And, in any case, SSD won't be price competitive with rotating rust for quite sometime. Both rust and silicon have physical limits, it's just that the silicon limit happens at a much lower volumetric density when used for persistent storage. You're not going to be storing all those 3D dagger-in-your-neck B-movies you just have to have forever on silicon.

It is likely a Good Thing that these teenage SSDs are running into Boss Hogg; they need to find a more meaningful purpose in life. I've got just that.

25 August 2010

Oslo

Once again, the folks at simple-talk have loosened my tongue. The topic is Oslo, or more directly, its apparent demise. I was moved to comment, shown below.

I'll add a more positive note; in the sense of what Oslo could/would be.

I've not had much use for diagramming (UML, etc) to drive schema definition/scripting. Such a process seems wholly redundant; the work process/flow determines the objects/entities/tables, and converting these facts to DDL is not onerous.

OTOH, getting from the resulting schema to a working application is a bear. There have been, and still are, frameworks for deriving a codebase from a schema. Most are in the java world (and, not coincidentally I believe, COBOL some decades past and not the first; it's that ole enterprise automation worship), but without much fanfare. I suspect, but can't prove, that Fortune X00 companies have built internally (or, just as likely, extended one of the open source frameworks) such frameworks.

This is what I thought Oslo was to be: a catalog driven code generator. My obsession with SSD these days (and it's looking more and more like the idea is taking hold Out There) still convinces me that BCNF catalogs can now be efficiently implemented. Since such catalogs are based on DRI, and such kinds of constraints are easily found and translatable, generating a front end (VB, C#, java, PHP, whatever) is not a Really Big Deal. Generating, and integrating, code from triggers, stored procs, check constraints, and the like is a bit more work, but with more normalization, constraints become just more foreign keys, which are more easily translated.

That's where I expected Oslo was headed. This is not an ultimate COBOL objective, but "drudge work" tool for database developers (and redundancy notice for application coders, alas). Such a tool would not reduce the need for database geeks; quite the contrary, for transaction type database projects we finally get to call the tune. Sweet.

And, I'll add here, that the shift that's clearly evident with iStuff and Droids leads to another inevitable conclusion. A truly connected device, and phone devices are surely, means we can re-implement the most efficient transaction systems: VT-100/*nix/database. Such systems had a passive terminal, memory mapped in the server, so that the client/server architecture was wholly local. Each terminal has a patch of memory, and talks to the database server, all on one machine. No more 3270/web disconnected paradigm. With the phone paradigm, the client application can be on the phone, but it has a connection to the server. For those with short memories, or who weren't there in the first place, the client/server architecture was born in engineering. The point was to ship small amounts of data to connected workstations that executed a heavy codebase, not to ship large amounts of data to PCs that execute a trivial codebase. The key is the connection; http is a disconnected protocol, and is the source of all the heartburn. The frontrunners will soon have their fully normalized databases shipping small spoonfuls of data out to iPads/Droids. That's the future.

18 August 2010

The A-Team

A busy day. SanDisk just released news about their latest SSD. How does it fit the point of this endeavor? Read on, MacDuff.

Here's my floating in the clouds (well...) concept of the use of such a device.

Let's say you're running Oracle Financials to a Big Droid, mentioned in that post from earlier today. How does an embedded 64G SSD fit in? How about this: the Big Droid has SQLite installed, talking to that SSD, OF on the linux machine is fully normalized (I've no direct experience with OF, but I'll guess that it's been de(un)-normalized). The Big Problem(tm) with web based applications is the bloat load of data passed over the wire (increasingly virtual wires) from all those fat flat files coders love (I'm talking to you, xml).

Lots of local storage changes that equation. Rather than synthesizing the joined rows on the server, and sending the result set over the wire, we can install SQLite (or similar, SQLite is currently in the Droid stack) on the Big Droid, and send only the normalized rows, letting SQLite store them to a receiving table. SQLite then synthesizes the bloat rows, which the Big Droid App can see and do what it wants with same. After the User makes any changes, SQLite sends back the delta (normalized) rows. Wire traffic drops by a lot, as much as an order of magnitude.

To get really on the edge, Oracle on the linux server could *federate* those SQLite ciients and write to the SQLite tables *directly*. Normalized, skinny tables. Almost no data has to go over the wire. And they once said that Dick Tracy's wrist radio could never happen.

To quote my Hero Hannibal Smith, "I love it when a plan comes together".

[UPDATE]
OK, so perhaps I should have figured that I'm not the first person, although it seemed so since my circle of web sites haven't talked about it, to see that native apps on iStuff/Droid have a natural client/server architecture which can exploit RDBMS on the server (the SSD sort I'm promoting). Native apps, not web/http stuff. So, here's the first article that came up when I let Google do the searching.

The money quote:
In those cases where they actually need to capture data, they require ultra-simple applications that shape the device into a very specific tool that follows the most optimized data capture process possible. Indeed, this is what iPad is good for - it affords developers the opportunity to move the technology aside, replacing it with a shape-shifting experience. Successful data-centric apps will transform the experience and cause the technology to melt away.

Quite some number of posts ago, I made the point that iStuff changes the input paradigm; to picking, not typing. And that picking lends itself (since picking has to be reduced to some manageable number of choices) to normalized data; huge scrolling screens with dozens (hundreds, I've seen) of input fields just won't work. Again, there is existing prior art; the whole host of tablet based ERP modules.

Of course, I've not delved into the SDK's for these devices (don't have a Smart Phone), so it could be that none of my notions is possible. But SQLite is in the Droid stack, so I'd be willing to bet a body part that it is fully doable. Does this sound like it? And this is the framework, also using SQLite on the device.

So, yes, you can do tcp on Android, ignore the skateboard and scroll down to 13 May. Not quite ready for Prime Time, but really, really close; once you've got tcp, ya gots da database. Yummy.

Hannibal was right.

Black, No Cream, No Sugar

The Oracle/Google fight is too interesting not to write about. I've, until now, only contributed to various posts on various blogs, so here's my latest (from an Artima thread), somewhat expanded.

- java *is* Oracle's core, even before the buyout (more later)

- java ME is a bust, but Dalvik is a winner. If you're Oracle why not try to get some of that pie? They'll waste a lot of time and money if Google doesn't settle "Real Soon Now", which I don't they will.

- to the extent that cloud, SaaS, PaaS, WhateveraaS gains mindshare, Oracle either needs to quash it or get a wagon for the wagon train. This attack could accomplish either; Dalvik is made to go away, or Oracle gets it through a free cross-license. I mean, why not run Oracle Financials on a big Droid? Why not? It's not much different, semantically, from OF/*nix/VT-220, just with pixels. Folks run around warehouses today with tablets and WiFi, why not go all the way?

- there was a time when COBOL was the language of the corporation (still is in some parts of some corporations), and there was/is an ANSI standard COBOL, but no one bothered much with it (in the corporation). IBM had its own version, and that runs on its mainframes/minis. Oracle has made java the language of its corporate applications. It might be, they think, a Good Thing if there's Oracle java and some ANSI-java that no one cares about. IBM, unlike M$, forked java in a compliant way, too. If one believes, as I do, that part of the game plan in taking Sun was to build a platform to attack the IBM mainframe business (the last existing fruit on the tree), then having a market dividing stack of Oracle database/java/Sun machines makes some sense; a way to lock-in clients top to bottom.

Larry has always had a strategic view of business; he just wants to have the biggest one. Buying Sun has to be seen in that context. The question observers have to answer: how does buying Sun support that strategy? The knee-jerk reaction was java. Then it was MySql (if you review the initial objections, they related to control of java; only later was MySql considered). Again, the largest part of the existing computing pie that Larry has no part of is mainframe (and IBM has the largest part of mainframe computing as its ever had); I think he wants that, in the worst way. In order to do that, he has to have an alternative. The database is one-third of that. Oracle has been eating DB2's lunch, off mainframe, for years and it keeps getting worse. DB2, thanks to a special codebase just for the mainframe, is the only meaningful database for the mainframe.

To break the cycle, Larry has to have a combine of applications/language/machine which makes a case. Building the Total Oracle Stack(CR) is what he has to do. I've just spent a decade in the financial services industry, and there, the language of choice has become javBOL (or COBava): java syntax used in a COBOL sort of way, largely by COBOL coders re-treaded. DB2 still rules, but with the falling price, and rising power, of multi-core/processor/SSD linux machines (largely on X86 cpu's) Larry has an opening. Those re-treaded COBOL coders are nearing end-of-life, literally. While some number of Indians are conscripted into COBOL to backstop the shortage, none hangs around very long; domestic CS graduates still "won't do the work". But COBOL's days are numbered; there's just too much else to do in CS that's interesting and doesn't require such a stupid language.

Larry can make the case to switch to Oracle applications now that he has a stack, if he can control java.

12 August 2010

The Crucial Difference: Micron Sized

I missed this earlier AnandTech test which is referenced from today's newest. At least, I think so.

This is a long-ish announcement piece, not a test, so we'll have to wait on that. Both the P300 and C300 use Marvell controllers, which are not widely used. The P300 is labeled as Micron, not Crucial. We'll see. The photo isn't even the P300. If it doesn't come with a SuperCap, we can conclude that Micron isn't really serious.

I will say, just having scanned the earlier piece, that anyone who even tests a database with RAID 5 has significant issues with database design and implementation. That koder kiddies will do so is no excuse.

10 August 2010

Take the A Train

Rails and I have a contorted history. I first engaged when I was looking around for schema based code generation tools, and Rails had Scaffolds. Turned out that DHH didn't like Scaffolds, and they kind of disappeared from the Rails landscape, late 1.x time frame.

I've peeked in every now and again since, so today I wandered over here via the Postgres site. I don't yet know whether Mr. Copeland is a database guy at heart, or yet another coder pretending to be one. OTOH, these are his notes from a talk given by two others, who, if the notes are to be believed, haven't drunk the Koder Kool Aid (it really was Flavor Aid, for those who get it). Of particular piquancy:

9:00 One query, 12 joins - complicated, but query time goes from 8 seconds to 60 ms.

20:00 Use constraints, FKs, etc to preserve data integrity - "anything you don't have a constraint on will get corrupted"

42:00 Do analytics in the database. Saw speed improve from 90s to 5s and saved tons of RAM.

1:01:40 Tune PostgreSQL - shared_buffers, work_mem, autovacuum, etc. Rely on community knowledge for initial configuration.

The "Use constraints" one is really, really important. The notion that only the application code should edit input is the wedge issue. Iff the code will only, forever, be the sole user of the data (and you *know* that's baloney) should the application code do it all. And, in that case (presumably because "performance" can only be attained by ignoring RDBMS' services) suck up your cujones and write bespoke file I/O like a real COBOL coder. The RDBMS, modulo vanilla MySql, is going to provide the services anyway. Otherwise, never trust ANY client input. In most cases, never trust ANY client read request (bare sql). The purpose of the client code is to display (perhaps to another program or file, not just a screen) data and pass back data. That's it.

Which brings me to the analytics note. PG is a little short on analytical functions; DB2/Oracle/SqlServer all support SQL-99/03 functions and add more, but use what's there. The same can be said for ETL, too; in most cases sql will get the job done. What the ETL crowd don't get is that the database is closed over its datatypes. There are some syntactic issues going from vendor A to vendor B databases, but the engines are quite capable of transforming from one consistent state to another all on their lonesomes.

One of the knobs I miss from DB2 is the ability to assign bufferpools (DB2's term) at the tablespace level. PG now has tablespaces, but so far as I can see, buffering is at the engine/instance level. Someday.

08 August 2010

Why Don't You Even Write?

Do you recall, while porting the SSD test from DB2 to PostgreSQL (see, I'm even capping as one is supposed to), that I lamented not being able to write out rows to multiple tables in a Common Table Expression?? DB2 can't either, nor have I yet confirmed whether any ANSI level, including 2003, specs it.

But I've just found this presentation for PG. And here's a test of it. Boy-howdy. Finally, a feature that DB2 doesn't have; well soon, maybe. In any case, have a read, it's really cool.

06 August 2010

Mr. Fielding (not of Tom Jones)

Our friends at simple-talk have their weekly (?, thereabouts) interview, with Roy Fielding. He "invented" REST, and has some things to say about it, but this quote made me chuckle, because I've been there, and felt the pain:

You must remember REST is a style and SOAP is a protocol, so they are different. One can compare RESTful use of HTTP to SOAP's use of HTTP. In contrast, I don't know of a single successful large architecture that has been developed and deployed using SOAP. IBM, Microsoft, and a dozen other companies spent billions marketing Web Services (SOAP) as an interoperability platform and yet never managed to obtain any interoperability between major vendors. That is because SOAP is only a protocol - it lacks the design constraints of a coherent architectural style that are needed to create interoperable systems.

SOAP certainly had a lot of developer mindshare, though that was primarily due to the massive marketing effort and money to be found in buzzword-based consulting.

One of the hallmarks, at least to me, about REST is that its verbs (from HTTP) match, to a tee, those of the relational database. Yet, may be for that reason, the procedural goop of SOAP (lots 'o rinsing needed) enveloped the Fortune X00. Oh well, someone will pay the price.

05 August 2010

The End of the Yellow Brick Road?

There is an analysis of the Intel settlement at AnandTech, which talks about Intel, AMD, NVIDIA and how they may, or may not, be getting along swimmingly henceforth. But buried in the discussion is the fact that support for the PCIe bus by Intel was part of the suit and settlement. In particular, Intel is only committed to support the bus through 2016.

The article goes on at some length about NVIDIA not being on the same page as Intel and AMD; the analysis of PCIe support is only in GPU terms. But not just super duper gamers graphic cards use that connector. Fusion-io, among a growing number of others, does too. There have been changes to "standard" disk drive connections over the years, so in one sense this could be viewed as just business as usual in the computer business. On the other hand, will EMC, IBM, and such be as easily convinced that Fusion-io (or whoever) has gone on the right path to SSD implementation? Knowing that all of your storage protocol could disappear at a known point in the future might give one pause.

02 August 2010

A Crossword Puzzle

I thought I would take some time this weekend to stage the ultimate test: the cross-join. Now, the tables I have at my disposal for this test, with enough rows to make it interesting, are Personnel and Dependants. While one would not expect to meaningfully cross-join such tables (excepting data for certain primitive societies/religions), they do serve the direct purpose.

So, for those not familiar, the cross-join (old syntax):

Select count(*) from personnel, dependants

I chose count() simply to remove the screen painting cost from the exercise; I merely want to measure the data cost. There are 1,200,240,012 synthesized rows.

I ran the query against the SSD database, and the HDD database (well, sort of).

The timing for SSD: 452.87 seconds, or about 8 minutes.

The timing for HDD: well, it never finished.

I initially ran both with 5 bufferpools, in order to force hard I/O in both cases. The SSD tables ran just fine. When I ran the HDD tables, it eventually errored out with a bufferpool exhaustion error. So, I increased the bufferpools for the HDD database to 100, and let 'er rip. 3 hours (about) later it errored out with a divide error.

A, somewhat, more fair test might be the cost of a range query between the two structures, that is a PersonnelFlatCross with the billion plus rows versus the normalized tables. If I can get DB2 to load the table, I'll give it a try.

29 July 2010

C3PO's Basket

Tim Bray's blog is one I read for reasons opaque, even to me, and has this recent posting. After all, he's the source of the scourge of mankind, xml. But I visit anyway. In the course of the comments, I made the following:

@Ted Wise:

languages that don't expend too many CPU cycles or chew up too much memory since the first will kill your battery and the second is a limited resource.

Perhaps the answer is Flash with a SSD driver under SQLite (or similar). You move local data to the SQL engine on dense and cheap Flash, saving DRAM for code. This would entail writing RDBMS explicit code in the applications, which may not be to the liking of typical client coders. The 25nm parts are due in a few months.

The post questions which, if any, other language is appropriate to the 'Droid, and Tim Wise questions whether anything other than java (or other C-like compiled) is appropriate. Which assertion led to my comment, the relevant part is above. I had been thinking for sometime that always connected, sort of, devices on a network are semantically identical to the VT-220/RS-232/database/unix systems of my youth. In such a semantic, with appropriately provisioned multi-core/processor/SSD machines, BCNF databases with server-side editing of screens is perfectly, well, appropriate. Back to a future worth living (as opposed to the 3270 old future of the current web).

Wise raises a valuable question: can the Phone be provisioned with enough cpu/DRAM to support any dynamic (interpreter implemented) language using conventional local data? If not, then why not off-load the data to a lightweight (in both cpu and DRAM) database engine? With data stored in SSD/Flash, there is need for little data buffering in the database engine (DRAM), and since this is a single user, although likely multi-tasked, application, concurrency requirements can be ameliorated by segregating each application's data in the "database".

Sounds like a fine idea, to me.

28 July 2010

Railing at Rails

I try not to perpetuate the Blogsphere Echo Chamber, by merely linking to other's writings, but sometimes the urge can't be resisted. Today is such a day. I don't know Andrew Dunstan, beyond surfing to his blog when it is linked from the PostgreSQL page. Today is such a day. He takes up Hansson's diatribe against RDBMS (not the first or only time Hansson has printed such idiocy), as I have on more than one occasion. Stupid is as stupid does.

I'll laugh out loud when Hansson finally realizes that the database centric paradigm, aided and abetted by SSD's driven by multi-core/processor machines, puts him in the "legacy, we don't do that anymore except for maintenance" bucket. Client driven applications are the New Dinosaurs(tm), just like the Olde Dinosaurs(tm) -- all that COBOL/VSAM code from the 1960's these young-uns think they're way, way beyond and better than.

24 July 2010

Linus was Right (and so am I)

If you've visited more than once or twice, you've noticed that the first quote, chronologically, is from Linus Torvalds. In his 23 July entry, Zsolt talks with Fusion-io about a specific to SSD filesystem access. They call it a Virtual Storage Layer (VSL).

"The thinking behind the VSL is to provide software tools which enable developers to communicate in the new language of directly accessible flash SSD memory in a way which breaks away from the cumbersome restrictions and limitations of 30 year old software models which are layered on legacy hard disk sectors."

While Fusion-io initially named their devices as SSD, they eventually stopped doing so, and explicitly say that their devices are not disk drives.

It begins to look like Linus was right. The main issue now is that no standard exists, and reading between the lines, Fusion-io would be pleased if developers wrote to their protocol. They don't call it lock-in, but a rose is a rose is a rose. We'll see.

22 July 2010

Is the Death of COBOL Finally Happening?

Could it be? The wicked witch is dead? We can stroll down the Yellow Brick Road?

Some recent announcements hint that may be happening. COBOL may, finally, be melting. First, here is what Larry has to say. Then we have IBM's z machine announcement. In both cases, the emphasis is on analytics and databases, not COBOL, which has been IBM's bread and butter for decades. IBM bought SPSS recently, I gather because they couldn't get SAS, and the z announcement stresses analytics.

Larry, on the other hand, is doing essentially the same thing: stressing hardware for databases, and java. As I wrote when the Sun deal was in the making, Larry sees the mainframe business as the last piece of fruit on the tree. There is mighty opportunity to rebuild all that stuff out of COBOL into something else; could be java, but it will be real database oriented. Don't forget that Sun and Oracle have been really interested in SSD.

Yummy.

Bill & Ted Return Home to Find BI Corpses

The purpose of this simple test of SSD versus HDD for the basic structure of the BCNF datastore, the inner join, was to show that joins are, effectively, costless. Some object to the word "costless", asserting that joins always have some cost. But all queries have cost, of course. The root question is whether joins cost more or less relative to the flatfile image alternative. For real applications, I believe the answer to be costs less.

There is also the byte bloat aspect. My vision of the not too distant future looks like the BI vendors will be in bad shape. The entire point of BI schemas, and code to process same, was that joins were "too expensive" for operational datastores. The reaction was to extract operational data, explode it, then scan it against odd indexes. Odd from the point of view of operational data, that is. But now that SSDs can handle joins with aplomb, is there much point to all of that effort? You need additional servers, software to do the ETL exercise (even if you're smart and just use SQL), the BI software to create all those not-really-SQL-schemas-but-just-as-arcane (if not more so, see Business Objects) data definitions, and separate "experts" in the specific BI product you've bought into.

What if someone showed you how you can do all that BI reporting from a normalized, compact, schema; just like the normalized, compact, schema you've now got for your OLTP operational application on SSDs? Why spend all that effort to, quite literally, duplicate it all over again? Aside: there is software marketed by storage vendors, and some separately, which does nothing but remove redundant data before writing to storage; I find that one of the funniest things in life. Rube Goldberg would be so proud.

Yesterday I read in an article about the origin, disputed, of the term "disruptive technology". The point of that part of the article was that what gets label "disruptive" mostly isn't, just mostly marketing hype. Well, SSD/multi machines aren't just hype. The retail SSD was invented to simplify laptops, mostly reduce power draw and heat. Thus we have the 2.5" form factor persisting where it makes little sense, the server and power user desktop. Once the laggards in the CTO/CIO corner offices finally get beaten up enough by their worker bees (you do know that leaders, of techies, mostly follow, yes?) to use SSDs to change structure, not merely as faster spinning rust, the change will be rapid. Why? Because the cost advantages of going to BCNF in RDBMS is just so massive, as TCO calculation, that first adopters get most of the gravy.

Remember the 1960's? Well, remember what you read ABOUT the 1960's? There was a gradual shift of wifey from the kitchen to, initially, professional occupation. For those families that had a working wifey early on, they made out like bandits, since the economy overall was structured to a single income household. As time went on, the economy overall shifted (through inflation caused by all those working wifeys) to requiring a two income household to maintain the previous level.

And so it will go with SSD. There is a significant difference, one hopes, between this transition to that from tape to disc. The early disc subsystems were explicitly called Random Access Storage, but COBOL was the lingua franca of the day, and had already accumulated millions, if not billions, of lines of sequential (tape paradigm) processing code and an established development paradigm. So disc ended up being just a more convenient, and a bit faster, tape. Today is different. Today we have the SQL (almost Relational) database which can exploit SSD in a way that the byte bloat flatfile paradigm can't. It's a good time to be young and disruptive. Cool.

21 July 2010

Bill & Ted's Excellent Adventure, Part 6

Porting to Postgres will be no big deal. Well, not so far. Here are the changes needed just to create the tables:

CREATE TABLE personnel
(emp INTEGER NOT NULL
,socsec CHAR(11) NOT NULL
,job_ftn CHAR(4) NOT NULL
,dept SMALLINT NOT NULL
,salary DECIMAL(7,2) NOT NULL
,date_bn DATE NOT NULL --WITH DEFAULT
,fst_name VARCHAR(20)
,lst_name VARCHAR(20)
,CONSTRAINT pex1 PRIMARY KEY (emp)
,CONSTRAINT pe01 CHECK (emp > 0)
--,CONSTRAINT pe02 CHECK (LOCATE(' ',socsec) = 0)
--,CONSTRAINT pe03 CHECK (LOCATE('-',socsec,1) = 4)
--,CONSTRAINT pe04 CHECK (LOCATE('-',socsec,5) = 7)
,CONSTRAINT pe05 CHECK (job_ftn <> '')
,CONSTRAINT pe06 CHECK (dept BETWEEN 1 AND 99)
,CONSTRAINT pe07 CHECK (salary BETWEEN 0 AND 99999)
,CONSTRAINT pe08 CHECK (fst_name <> '')
,CONSTRAINT pe09 CHECK (lst_name <> '')
,CONSTRAINT pe10 CHECK (date_bn >= '1900-01-01' ));

CREATE UNIQUE INDEX PEX3 ON PERSONNEL (DEPT, EMP);

Beyond the commented out stuff, Postgres won't accept # in names, so those appended had to go. Had to do the same for Dependents and PersonnelFlat. The Insert routine needed a bunch of changes and ends up not looking at all like the DB2 data, but is populated, which is all I care about at this point (OK, this is a big deal; I wish I'd have sprung for SwisSQL then, and it turns out, there is a review ).

Now for the results:

select * from personnel where 1090000 < emp and emp < 1100000 -- for the join-ed table on SSD
28ms

SELECT p.*
,d.dependent
,d.dep_fst_name
,d.dep_lst_name
FROM personnel AS p
JOIN dependents AS d ON d.emp = p.emp
WHERE 1090000 < p.emp# AND p.emp# < 1100000
60ms

Not quite up to DB2, but not tuned, either. Still rather faster than HDD. Now, where's that Yellow Brick Road?

20 July 2010

You're My One and Only

I wonder how the "MySql is all the database you'll (and I'll) ever need" folks are feeling now?

18 July 2010

Organic Ribald Machines

These are the worst of times, these are the worst of times. I saw that somewhere the last day or two, but I don't remember where. Anyway, I'm reading these slides from a link on the PostgreSQL site. You *have* to see this. Eviscerates the ORM crowd. As much fun as a barrel of monkeys.

A quote or two (which I've said more than once):

ORMs encourage pathological iteration.

ORMs generally have bizarre transaction models.

ORM-think is one of the driving forces behind "NoSQL" databases.

Bill & Ted's Excellent Adventure, Part 5

In today's episode, our fearless heroes set out to build some larger tables. The Sample database tables, and The Mother Lode, both demonstrate the thesis. But let's create some more complex, and larger, scenarios just to see for ourselves.

We'll start with Birchall's figures 989 and 990, wherein we'll use the magic of CTE to populate our test data, running to 100,000 rather than 10,000 as he does. Aside: I've always wished that CTE could be used to insert both independent and dependent at one go, but alas, not that I can divine. One can buy software (Red Gate, the folks who run simple-talk, are one vendor) which generates test data; if that's not niche I don't know what is.

We start with the base single table:

CREATE TABLE personnel
(emp# INTEGER NOT NULL
,socsec# CHAR(11) NOT NULL
,job_ftn CHAR(4) NOT NULL
,dept SMALLINT NOT NULL
,salary DECIMAL(7,2) NOT NULL
,date_bn DATE NOT NULL WITH DEFAULT
,fst_name VARCHAR(20)
,lst_name VARCHAR(20)
,CONSTRAINT pex1 PRIMARY KEY (emp#)
,CONSTRAINT pe01 CHECK (emp# > 0)
,CONSTRAINT pe02 CHECK (LOCATE(' ',socsec#) = 0)
,CONSTRAINT pe03 CHECK (LOCATE('-',socsec#,1) = 4)
,CONSTRAINT pe04 CHECK (LOCATE('-',socsec#,5) = 7)
,CONSTRAINT pe05 CHECK (job_ftn <> '')
,CONSTRAINT pe06 CHECK (dept BETWEEN 1 AND 99)
,CONSTRAINT pe07 CHECK (salary BETWEEN 0 AND 99999)
,CONSTRAINT pe08 CHECK (fst_name <> '')
,CONSTRAINT pe09 CHECK (lst_name <> '')
,CONSTRAINT pe10 CHECK (date_bn >= '1900-01-01' ));

-- CREATE UNIQUE INDEX PEX2 ON PERSONNEL (SOCSEC#); we'll skip this one from Birchall, in that it generates collisions and we really don't need it for these purposes
CREATE UNIQUE INDEX PEX3 ON PERSONNEL (DEPT, EMP#);

Next, we need both the flatfile version and the joined version of some dependent table, and Dependents is the obvious candidate. We'll use 12 for the number of Kids (we'll ignore Spouses for this exercise; we could just add an Attribute...), they're cheaper at that amount.

Here's, truncated, the PersonnelFlat table:

CREATE TABLE PersonnelFlat
(emp# INTEGER NOT NULL
,socsec# CHAR(11) NOT NULL
,job_ftn CHAR(4) NOT NULL
,dept SMALLINT NOT NULL
,salary DECIMAL(7,2) NOT NULL
,date_bn DATE NOT NULL WITH DEFAULT
,fst_name VARCHAR(20)
,lst_name VARCHAR(20)
,dep1ID VARCHAR(30) NOT NULL
,dep1_fst_name VARCHAR(30)
,dep1_lst_name VARCHAR(30)
...
,dep12ID VARCHAR(30) NOT NULL
,dep12_fst_name VARCHAR(30)
,dep12_lst_name VARCHAR(30)
,CONSTRAINT pexf1 PRIMARY KEY (emp#)
,CONSTRAINT pe01 CHECK (emp# > 0)
,CONSTRAINT pe02 CHECK (LOCATE(' ',socsec#) = 0)
,CONSTRAINT pe03 CHECK (LOCATE('-',socsec#,1) = 4)
,CONSTRAINT pe04 CHECK (LOCATE('-',socsec#,5) = 7)
,CONSTRAINT pe05 CHECK (job_ftn <> '')
,CONSTRAINT pe06 CHECK (dept BETWEEN 1 AND 99)
,CONSTRAINT pe07 CHECK (salary BETWEEN 0 AND 99999)
,CONSTRAINT pe08 CHECK (fst_name <> '')
,CONSTRAINT pe09 CHECK (lst_name <> '')
,CONSTRAINT pe10 CHECK (date_bn >= '1900-01-01' ));

-- CREATE UNIQUE INDEX PEXF2 ON PERSONNELFlat (SOCSEC#); as above
CREATE UNIQUE INDEX PEXF3 ON PERSONNELFlat (DEPT, EMP#);

(For the loads, I had to put the bufferpools back to 1,000, otherwise DB2 locked up for both versions.)
Here's the, truncated, data generator:

INSERT INTO personnelFlat
WITH temp1 (s1,r1,r2,r3,r4) AS
(VALUES (0
,RAND(2)
,RAND()+(RAND()/1E6)
,RAND()* RAND()
,RAND()* RAND()* RAND())
UNION ALL
SELECT s1 + 1
,RAND()
,RAND()+(RAND()/1E6)
,RAND()* RAND()
,RAND()* RAND()* RAND()
FROM temp1
WHERE s1 < 100000)
SELECT 1000000 + s1
,SUBSTR(DIGITS(INT(r2*988+10)),8) || '-' || SUBSTR(DIGITS(INT(r1*88+10)),9) || '-' || TRANSLATE(SUBSTR(DIGITS(s1),7),'9873450126','0123456789')
,CASE
WHEN INT(r4*9) > 7 THEN 'MGR'
WHEN INT(r4*9) > 5 THEN 'SUPR'
WHEN INT(r4*9) > 3 THEN 'PGMR'
WHEN INT(R4*9) > 1 THEN 'SEC'
ELSE 'WKR'
END
,INT(r3*98+1)
,DECIMAL(r4*99999,7,2)
,DATE('1930-01-01') + INT(50-(r4*50)) YEARS
+ INT(r4*11) MONTHS
+ INT(r4*27) DAYS
,CHR(INT(r1*26+65))|| CHR(INT(r2*26+97))|| CHR(INT(r3*26+97))|| CHR(INT(r4*26+97))|| CHR(INT(r3*10+97))|| CHR(INT(r3*11+97))
,CHR(INT(r2*26+65))|| TRANSLATE(CHAR(INT(r2*1E7)),'aaeeiibmty','0123456789')

,'A'
,CHR(INT(r1*26+65))|| CHR(INT(r2*26+97))|| CHR(INT(r3*26+97))|| CHR(INT(r4*26+97))|| CHR(INT(r3*10+97))|| CHR(INT(r3*11+97)) || 'A'
,CHR(INT(r2*26+65))|| TRANSLATE(CHAR(INT(r2*1E7)),'aaeeiibmty','0123456789') || 'A'
...
,'L'
,CHR(INT(r1*26+65))|| CHR(INT(r2*26+97))|| CHR(INT(r3*26+97))|| CHR(INT(r4*26+97))|| CHR(INT(r3*10+97))|| CHR(INT(r3*11+97)) || 'L'
,CHR(INT(r2*26+65))|| TRANSLATE(CHAR(INT(r2*1E7)),'aaeeiibmty','0123456789') || 'L'

FROM temp1;

Now, we need the table for the dependents, named, Dependents:

CREATE TABLE Dependents
(
emp# INTEGER NOT NULL
,dependent# CHAR(1) NOT NULL
,dep_fst_name VARCHAR(20)
,dep_lst_name VARCHAR(20)
,CONSTRAINT dep1 PRIMARY KEY (emp#, dependent#)
)

To keep the joined and flat file tables in sync, we'll load both the Personnel and Dependents data from PersonnelFlat.

INSERT INTO personnel
SELECT emp#, socsec#, job_ftn ,dept ,salary ,date_bn ,fst_name , lst_name FROM personnelflat;

And, the, truncated, data loader for Dependents (the NOT NULL is there from habit, not really needed here since they all got a dozen kids, but would be when converting some arbitrary flat file):

INSERT INTO dependents (emp#, dependent#, dep_fst_name, dep_lst_name)
SELECT emp#,
dep1ID AS dep,
dep1_fst_name,
dep1_lst_name
FROM personnelFlat
WHERE dep1ID IS NOT NULL
...
UNION ALL
SELECT
emp#,
dep12ID AS dep,
dep12_fst_name,
dep12_lst_name
FROM personnelFlat
WHERE dep12ID IS NOT NULL
ORDER BY emp#, dep;

For grins, here's the timing (same buffering, self-tuning on, and all logging is on a HDD directory) for the Dependents data:
SSD: 21s
HDD: 2m 17s

Now, return to 5 bufferpools, 1 prefetch, self-tuning off. Let's do a simple range select on Personnel.

select * from personnel where 1090000 < emp# and emp# < 1100000 -- for the join-ed table on SSD
19ms

select * from personnelflat where 1090000 < emp# and emp# < 1100000 -- for the flat table on HDD
140ms

Lastly, the joined select:

SELECT p.*
,d.dependent#
,d.dep_fst_name
,d.dep_lst_name
FROM personnel AS p
JOIN dependents AS d ON d.emp# = p.emp#
WHERE 1090000 < p.emp# AND p.emp# < 1100000
25ms -- SSD
257ms -- HDD

While not TPC-C exercising, more evidence that SSD/BCNF datastores are more better. With transactional data that would otherwise span hundreds of gigs in flatfiles, one can have a few gigs in relational. And faster, too.

In this example, I didn't fake up a large independent table to couple to a small dependent table. In the canonical customer/order/orderline example, the size difference between them is significant; at least in the systems I've worked on over the years. This is where much of the byte bloat is saved. Note, too, that since I load a dozen kids across the board, the Dependents table is full; in the Real World, the flatfile would have, depending on how the engine works, allocated space for the dozen as either empty space or NULLs, but the Dependents table would have only allocated for living Kids.

Make America White Again - The Gang of Six, 29 April 2026

About

Shameless Plug

Extended Pieces

Good Stuff

Followers

Blog Archive