27 March 2013

Quo Vadis Data Science?

Insanity: doing the same thing over and over, yet expecting a different result.

"Sherman, set the WABAC for 1964".

The mid-60s to the end of the decade were a watershed period in the computer world. 1964 saw the announcement of System/360 from IBM. The 360 was IBM's attempt to merge the business class machines with the scientific class machines (each class being built on disparate architecture), producing one machine architecture that covered the entire circle of need. Turns out, on that measure the 3x0 machines have been an abject failure. CDC, thence Cray, took the science side of things, only the 370/158 had much credibility in the lab. Since then, highly parallel multi-processor RISC machines have ruled, some even from IBM; none on the 360 ISA. As commercial machines, they've decimated the Seven Dwarves. They began the road to coding as king, installing COBOL to the throne; where it still sits.

Then came DEC and the minicomputer. For anyone interested in computing, Tracy Kidder's "The Soul of A New Machine" should be read. While about a DG machine designed in the early 70s, the book describes how a new machine was built from discrete parts. It took real engineering to do this.

Finally, we get Dr. Codd in 1969 (publicly, 1970), who presented a math based approach to data. Whether or not his paradigm has yet been assimilated remains an open question. Most, even within the industry, conflate SQL with the RM. On the whole, one hears "we must denormalize for speed" more than any other assertion.

Real engineering and math were a problem for all those young men who wanted to "do computers". One first needed an BsEE. In the hierarchy of curricula, the most difficult (win/place/show) were/are: math, physics, electrical engineering. More folks wanted in, but most who couldn't cut the brain mustard. What to do, what to do?

What was done: invent a new curriculum, lighter on the maths, but still, "do computer". We did what the USofA turns out to do best: dumb something down. Thus was born Computer Science. The first such department, depending on how one defines CS, is likely Carnegie Institute of Technology (today, Carnegie Mellon University). Easy on the maths, heavy on the coding. Here's a proposal from 1994 which intends to drop COBOL from the curriculum. Was COBOL ever part of CS, you ask? Well, yes, yes it was, as late as 1982. The link should take you to page 96. Look at the faculty advert second down from the large IMS/Upjohn ad in the top left of the page. Yes, COBOL was the basis of CS. Anyone who's had the experience of Enterprise Java has seen loads of COBOL, in that other syntax. Kind of like a baby chewing razor blades funny.

So now we have much the same thing going on. Those who can't handle the rigors of operations research and statistics and mathematical analysis need a safe harbor, where they can claim to "do data". The wiki article says, "Data science requires a versatile skill-set. Many practicing data scientist commonly specialize in specific domains such as marketing, medical, security, fraud and finance fields." "Swell", to quote Dirty Harry. Marketing and finance, professional shills and thieves? More to the point: these are endeavors defined by rules made up by, and subject to change by fiat, humans. It's all soft serve, no hard science. I guess there's a bright future for data science. Just don't ask too many questions. There's a reason that financial quants brought the world's economy to its knees: it made them scads of money if they could pull it off, so they did. The "laws" of commerce were flouted or changed to suit the situation; thus destroying the rigor of the analysis.

From "Introduction to Data Science", by Jeffrey Stanton:
Data Science refers to an emerging area of work concerned with the collection, preparation, analysis, visualization, management, and preservation of large collections of information. Although the name Data Science seems to connect most strongly with areas such as databases and computer science, many different kinds of skills - including non-mathematical skills - are needed.

IOW, "we don't need no education". That could have been pulled from a 1970 job advert for an EDP position (ADP, if it were for the US government); it's just what COBOL applications did/do.

For a particularly trenchant take on the situation:
So why many scientists find data-driven research and large data exciting? It has nothing to do with science. The desire to have datasets as large as possible and to create giant data mines is driven by our instinctive craving for plenty (richness), and by boyish tendency to have a "bigger" toy (car, gun, house, pirate ship, database) than anyone else. And whoever guards the vaults of data holds power over all of the other scientists who crave the data.
The comments get a bit cranky with him.

The Big Data meme is increasingly melded with the Data Science meme. The blind leading the blind. Gad.

While this missive was in progress, the Sunday NY Times offered a darkly pessimistic take on Big Data.
[David C. Vladeck, a professor of law at Georgetown University] offers this example: Imagine spending a few hours looking online for information on deep fat fryers. You could be looking for a gift for a friend or researching a report for cooking school. But to a data miner, tracking your click stream, this hunt could be read as a telltale signal of an unhealthy habit -- a data-based prediction that could make its way to a health insurer or potential employer.

In othe words, Big Data is valuable only when it's misused; stealing from the ignorantly passive users by creating a valuable needle in the haystack of that data tsunami. There is a Big Brother, but it ain't the Damn Gummint, it's rapacious capitalists. Go looking for a fat frier, and your health premium doubles. In order to pay for all this Big Data deep mining, the miners have to find ways to monetize what little flecks of gold they find. Since insurance is either required (automobiles; Progressive has a nefarious program ongoing) or necessary (health), it is a fertile ground to plow.

Dumbing down, again. It's the 1-2-3/Excel-ization of data, yet again. Anybody, with an hour or two of training, will be able to do deep, detailed analysis. Yeah, right. The inappropriate quants (bailed out math/physics grad students) gave us The Great Recession (abetted, one should acknowledge, by a certain laxity of purpose in government). The US Senate report is especially chilling. One has to wonder what Data Scientists will concoct?

Mike Holmes has a running promo for his new, here in the US, episodes of "Holmes Inspection". In the promo he's sitting in a work area (which looks CGI, and thus fake), and talking about his father's advice. "If you can't do it right, don't touch it. Get the hell out and do something else." The Data Scientists may end up a use case.

No comments: