27 July 2015

schizophRenia [update]

More than one (and, I suspect, growing in days to come) post via R-bloggers reference this IEEE post on computer language popularity. The R-blogger posts are laudatory, "R is becoming the Next Big Thing" and such.

But, the emperor has no clothes. I just checked CRAN, twice in a minute or so. The first time said 6911 packages, the second 6915. It's a cancer. OK, a bit strong. But, the point is: R isn't a programming language. It's a statistical command language which is also programmable with a common syntax. In particular, one needn't (and likely, shouldn't) view R through the lens of C++ or java or even PHP. The value of R lies in dirt common stat routines it implements.

More and more, one reads that the Real R Programmers are dissatisfied with performance or capabilities, and grouse. A lot. Most often, they grouse about leaving the R world for Rcpp (well, may be that's only a step) or Julia or python. Let them go.

Much of the corpus of R packages come from grad students in need of creating new work in order to satisfy thesis/dissertation requirements. (The same reason we've seen Bayes take over the field; frequentist methods cover the world, and a thesis/dissertation has to cover "new ground", so Bayes was dug up from his grave to give grad students some way to be "new". Gad.) Writing code is the avenue. The fact that it's a wholly redundant exercise is not relevant to the grad student. For working data folk, using R to do mainstream analysis is where the best bang for the buck comes from.

Once again, before posting, new information arises. This time, Dirk takes another swipe at Hadley. Poor Dirk.
Hadley is a popular figure, and rightly so as he successfully introduced many newcomers to the wonders offered by R. His approach strikes some of us old greybeards as wrong---I particularly take exception with some of his writing which frequently portrays a particular approach as both the best and only one. Real programming, I think, is often a little more nuanced and aware of tradeoffs which need to be balanced. As a book on another language once popularized: "There is more than one way to do things."

Poor Dirk. "Nuance" is just a euphemism for "ambiguous". Languages, whether human or computer, that promote ambiguity generally fail. English is the archetypal human language which affords no known structure. Of all the alphabet based languages, it is the most difficult to either learn as a second language or as one's first language in learning another alpha language. The mindset of English is chaos. And so it is with programming languages. The "more than one way to do things" language is Perl, the product of a right-wing Christian. It's a mess, and widely despised.

On the other side of the coin, one finds python, built by a European math, and eiffel, ditto. Both seek to be as close to fully orthogonal in syntax and semantics.

There should be one-- and preferably only one --obvious way to do it.

Exactly one way to do anything: in stark contrast to Perl's philosophy of there is more than one way to do it, Eiffel follows Bertrand Meyer's Principle of Uniqueness: "The language design should provide one good way to express every operation of interest; it should avoid providing two."

R will, in time, fail. It is an amateur language built by amateurs. To the extent it is used as a stat command language (its original purpose), it will succeed, but if the "R is a programming language" crowd get control, it will fail, because as a programming language it has far more warts than rosy cheeks.

And, of course, the RM:
The principle of orthogonal design (abbreviated POOD) was developed by database researchers David McGoveran and Christopher J. Date in the early 1990s, and first published "A New Database Design Principle" in the July 1994 issue of Database Programming and Design and reprinted several times.

Which is not say, sadly, that SQL engines enforce such. Or as Holub has said, you've got "Enough Rope to Shoot Yourself in the Foot".

Well, turns out I'm not the only one.
Revolutionary Dave:
I couldn't agree with the sentiment more, and I too [wish] the field of Statistics had more respect for solving these "mundane" (i.e. non-mathematical), but important problems.

Here's what Dave's agreeing with:
"There are definitely some academic statisticians who just don't understand why what I do is statistics, but basically I think they are all wrong . What I do is fundamentally statistics. The fact that data science exists as a field is a colossal failure of statistics. To me, that is what statistics is all about. It is gaining insight from data using modelling and visualization. Data munging and manipulation is hard and statistics has just said that's not our domain."

And this is the revelatory bit, makes my heart skip more than a single beat:
During this first job, Wickham began to reflect on better ways to store and manipulate data. "I've always been very certain that I could come up with a good way of doing things," he explained, "and that that way would actually help people." Although he didn't know it at the time, he believes it was then that he "internalized" the concept of Third Normal Form, a database design concept that would become central to his future work. Third Normal Form is essentially a manner of structuring data in a way that reduces duplication of data and ensures consistency. Wickham refers to such data as "tidy," and his tools promote and rely on it.
[my emphasis]

And, of course, Third Normal Form or its logical extension Organic Normal Form™ is just an implementation of the orthogonal principle. Great minds gather together.

No comments: