09 January 2014

Big Dig, Big Data, Big Deal?

Among the largest old city rehab efforts in the history of the country was The Big Dig in Boston. It finally finished, late and over budget. But it includes one of the prettiest bridges on this side of The Pond. Why is it that any European country manages to do civil engineering with greater beauty in its homeliest structures than the USofA does in its best? Why is it that virtually every "innovation" in automobiles since Henry Ford was created by some European company? Just asking.

Recently, this endeavor mused on the Death of Big Data. Or, perhaps, high morbidity. Watson has been getting ink recently on blogs, so it's not a surprise that IBM would take the opportunity to discuss the machine. And from what I can't tell is whether Watson is sui generis, or a model shippable in quantity. From the wiki description, it's built from off the shelf parts. Except, of course, for the software. What's even more interesting: Watson doesn't make it to the top 500 of supercomputers, and appears to be I/O bound by *hard drives*:
According to John Rennie, Watson can process 500 gigabytes, the equivalent of a million books, per second. IBM's master inventor and senior consultant Tony Pearson estimated Watson's hardware cost at about $3 million. Its performance stands at 80 TeraFLOPs which is unfortunately not enough to place it at Top 500 Supercomputers list. According to Rennie, the content was stored in Watson's RAM for the game because data stored on hard drives are too slow to access.

I guess these smart folks never heard of SSD!!!

What's even more interesting: some of that software is Prolog:
We required a language in which we could conveniently express pattern matching rules over the parse trees and other annotations (such as named entity recognition results), and a technology that could execute these rules very efficiently. We found that Prolog was the ideal choice for the language due to its simplicity and expressiveness.

Some background, some my own, on Prolog.
- it was created within months of Codd's relational model paper
- it uses what amounts to being a normalized, in memory, database. most Prologs refer to this data as "database".
- while at OMS in the early 90s, a couple of my colleagues attempted to build an AI sub-system for the main product, medical pre-qualification, in its database/4GL (Progress). never got very far, if only because Progress has never been particularly relational or normal in application
- while at CSC I had to endure a Prolog mutant called GraphTalk, which CSC had bought up a few years before from France. of course. my colleagues at CSC turned this mutant into COBOL/VSAM coding. Yum.
- one of the current uses of Watson is in medical diagnosis. hmm. twenty years too soon was I.

There's still a commercial version of Prolog/datastore called Amzi! (yes, the ! is part of the name just as Yahoo!). And guess what? It's major market is business rule and decision support implementations. As it happens, Prolog syntax/semantics is more alien to C inculcated coders than even R or SQL. But, according to its zealots, Prolog systems are orders of magnitude more compact than imperative (e.g., java/C/FORTRAN) equivalents. Kind of like what relational zealots say about RM/SQL databases versus flat files.

So, today's Times has a puff piece from IBM on the use and future of Watson. As others have concluded, but with suspicion, IBM sees Watson as central to its commercial success.
IBM's elevation of Watson is the biggest illustration yet of the technology industry's faith that so-called Big Data holds promise for the economy -- and the failure so far to meet that promise.

Big data is just descriptive statistics, since one has all the numbers. Look at any Baby Stat book, and an early chapter (and likely the shortest in the book) will cover all one needs to know about descriptive statistics. I know, I know. Big Data is really about correlation and finding the correct distribution. Mostly, for commercial uses, it's about finding a few golden correlation needles in a haystack of choices by millions of people. So you can spit more enticing ads at a few of them. I wonder how many of these Big Data projects were ever subjected, a priori that is, to a rigorous cost/benefit analysis? While Watson is a multi-million dollar machine, most Big Data can be handled using R or PL/R on a pumped up Dell. So, in such cases, one need only find a few silver needles.

The apostates are beginning to crawl out:
Likewise, IBM will have to sharpen its focus and what it delivers, said Henry D. Morris, an analyst at the consulting company IDC. "Big Data by itself isn't value, it has to deliver recommendations about what to do," he said. "They have to show people not just analysis, but action. They understand that there are challenges ahead."

By the way, I'd love to get invited to the Watson party (fat chance, of course): the staff will be located in the East Village. If you have to ask where the East Village is, you're so uncool.

No comments: