16 December 2011

Lies, Damn Lies, and Statistics

The Other-R folks have posted a recent entry which references an EMC paper (here if you follow the breadcrumbs) on the state of Data Analysis and Business Intelligence, from the point of view of practitioners. The blog post makes some useful points, but misses some.

I'm referring to the graph in the original post, which is on page 3 (in my screen) of the EMC paper.

What this graph tells me, mostly, is that BI folks are still tied to MS, Excel in particular. Data analysts, not so much; although they'll be tied to corporate policy in such venues.

A few words about each.

Data Storage: SQL Server is tops, which means that most folks, in both camps, are tied to corporate group level machines, not the Big Iron. It's been that way for decades; the analysts have to extract from the Big Iron, and crunch on their own PCs. The categories Other SQL, Netzza, and Greenplum leave room for the Triage with PL/R approach, since the latter are explicitly Postgres and Other SQL is likely as much Postgres as MySql (yuck!). The category is, possibly, misleading if one jumps to the conclusion that companies are MS centric with their data.

Data Management: No real surprise here. Excel is the tool of choice. Way back when I was teaching PC software courses, 1-2-3 was the spreadsheet of choice and all data went through it, and Excel inherited the mindset that a spreadsheet was sophisticated analysis. It is a bit unnerving to realize that so much of what corporations decide is supported by such drek. Note: the BI folks, in the past executive assistants and "secretaries", still use spreadsheets a lot. The Data folks, the other way round. There is small comfort in that. The presence of BASH (or Korn or ...) and AWK (Python and Perl too, but not quite so much; each has bespoke language I/O in the mix) is interesting, in that it means that a fair amount of data is clear text ASCII files. Think about that for a second.

Data Analysis: Clearly, the Data folks use stat packs while the BI folks mostly don't. SAS and SPSS and Stata leading says that the EMC client base is largely large corporate, which isn't a surprise. What is a surprise is the absence of Excel. On the other hand, in the original paper is this (next by each the graph): "While most BI professionals do their *analysis* and data processing in Excel, data science professionals are using SQL, advanced statistical packages...", which corresponds to my experience (emphasis mine).

Data Visualization: The absence of R is suspect, as any R user would understand.

And, finally, this has nothing to do with Big Data, in any case. BD is just another attempt to money-spin by those with an agenda. Janert, in his book "Data Analysis...", makes clear that BD isn't worth the trouble (my inference). The point being that population data, which is what BD offers, is just descriptive stats, and smart data folks aren't interested in descriptive stats. Sports fans, well yeah.

No comments: