15 March 2014

Big Dummies

Yet another cautionary tail from the Annals of Big Data.

Big Data, at best, becomes an exercise in descriptive stats. At worst, it's a colossal waste of time and money.

Some points:
- irreproducible research isn't of much use - that it was done "internally" by/for Google makes no difference
- Big isn't always better - in the case of Flu Trends, the Google folks (and if there were math stats involved, they should be ashamed) didn't have a clue about measurement or sampling, much less inference

What the Googlers didn't, or wouldn't, comprehend is that while the data was Big, it wasn't population data. And thus, it was sampled data (with no controls, apparently). And sampled data is subject to the constraints of frequentist (or Bayesian, if that's your bowl of porridge) inferential stats. None of that was done, of course.
Put another way, it's not uncommon to hear the argument that "computer algorithms have reached the point where we can now do X." Which is fine in and of itself, except, as the authors put it, it's often accompanied by an implicit assumption: "therefore, we no longer have to do Y." And Y, in these cases, was the scientific grunt work involved with showing a given correlation is relevant, general, driven by a mechanism we can define, and so forth.

Kiddies are so damn lazy, these days.

No comments: