25 October 2011

From Sea to Shining Sea

As follow up, or update, to the Triage piece, I offer up this post from an R-blogger. As it stands, there's no code (the author claims ugliness), but does applaud ggplot2. The latter I expected, in that Wickham's book has a section (5.7) on using maps, but not much detail.

Of more interest, is the data source, shown as CCES on the plots. Turns out that this is CCES. While not real time data, as Sparks demonstrates, R and ggplot2 can show both categorical and discrete variable impact over a map. For the Triage project, one would need internal real-time (or close) for the effort to be worthwhile, but I'd wager that it is.

11 October 2011

A Model Citizen

While it is gratifying to be published by Simple Talk, so many more eyes that way, it isn't a platform where I can continue to prattle on at will. Each piece they publish, most of the time, is a stand alone effort. Since the piece was already rather long, there was one tangent I elected not to include, since it is a separate issue from the task being discussed.

"That subject: cleavages." Well, I only wish (and if you know from whence that quote came, bravo). No, alas, the topic is what to do with regard to fully understanding "bang for the buck". I elided that in the piece, since the point was to show that a useful stat graphic could be generated from the database. But how to discover the "true" independent variables of electoral primacy, and their magnitude? Could it be that with all the data we might have, both for free on the intertubes and costly which we generate, our best model is only 30% predictive? To reiterate, the exercise isn't to predict who'll win (FiveThirtyEight has been spectacular), but rather which knobs and switches a given organizations can manipulate to *change* a losing situation.

If you'll recall, most of the explanatory variables weren't of a continuous nature, that is, real numbers. The fitted lines in the scatterplots used a variation on simple linear regression to fit. The variation dealt with the differing best slopes over ranges. The technique doesn't account for the fact that most of the explanatory variables are either categorical (yes/no) or discrete (strongly disagree to strongly agree).

For this kind of mixed data regression, one typically uses analysis of covariance (aka, ancova). R, as one would expect, provides this. The Crawley book devotes a full chapter to ancova. I'll direct you there. Some say that discrete independent variables can be used directly in simple linear regression. Others would run to ANOVA immediately. Some distinguish categorical variables (gender) from discrete scaled variables (the 5 point agree scale on gun control). It is, suffice to say, not a slam dunk any way you go.

Exploratory data analysis, what R is particularly good at, is where the apparatchiks should be spending much of their effort (not worrying about the entrails of Rails!). Assuming that money is the driver of winning is an assumption, frequently wrong in the real world. Since their organization is large, national in scope, and full of dollars to spend; spelunking through all available data is the directive. That assumes, of course, that winning elections, without regard to policy positions, is the goal. Think of selling nappies.

While the goal of the piece was to display something simple to the Suits, determining a more accurate predictive model, which will be implemented with traditional text output, is the real goal. Same is true of selling nappies. The analogy is not so far fetched, as this book demonstrates; there have been similar treatises in the years since.

10 October 2011

By The Numbers

There's that famous quote from The Bard, "The fault, dear Brutus, lies not in our stars, but in ourselves if we are underlings." As my fork in the Yellow Brick Road tracks more towards (what's now called) Data Science, various notions bubble to the surface. One lies in an age old (within my age, anyway) dispute between traditional (often called frequentist) math stats and those who follow the Bayesian path. From my point of view, not necessarily agreed to exist by those on the other side, Bayesian methods are merely a way to inject bias into the results. Bayesians refer to this "data" as prior knowledge, but, of course, the arithmetic can't distinguish between objective prior knowledge and fudging the numbers.

So, I set out this morning, being Columbus Day (a day honoring Discovery for some, invasion for others), to see whether there're any papers floating about the intertubes discussing the proposition that our Wall Street Quants (those who fudged the numbers) bent Bayesian methods in their work. As I began my spelunking, I had no prior knowledge about the degree to which Bayesian had taken over the quants, or not. Quants could still be frequentists. On the other hand, it is quite clear that Bayesian is far more mainstream than when I was in grad school. Could Bayes have taken significant mindshare? Could the quants (and their overseer suits) abused the Bayesian method to, at least, exacerbated, at most, driven The Great Recession. It seemed to me likely, any crook uses any available tool, but I had no proof.

Right off the bat, search gave me this paper which references one (at a pay site) from the Sloan Management Review. The paper puts the blame on risk management that wasn't Bayesian. You should read this; while the post does discuss the SMR paper on its merits (which I couldn't read, of course), it also discusses the flaw in Bayes (bias by the name of judgment) as it applies to risk management.

Continuing. While I was a grad student, the field of academic economics was in the throes of change. The verbal/evidence/ideas approach to scholarship was being replaced by a math-y sort of study. I say math-y because many of the young Ph.D.s were those who flunked out of doctoral programs in math-y subjects. Forward thinking departments recruited them to take Samuelson many steps further. These guys (almost all, then) knew little if anything about economic principles, but department heads didn't care. These guys could sling derivatives (initially the math kind, but eventually the Wall Street kind) on the whiteboard like Einstein. I noted the problem then, the 1970's. This paper touches on this issue (linked from here). "These lapsed physicists and mathematical virtuosos were the ones who both invented these oblique securities and created software models that supposedly measured the risk a firm would incur by holding them in its portfolio." Nice to know it only took 40 years for the mainstream pundits to catch up.

And, while not specifically about Bayesian culpability, this paper makes my thesis, which I realized about 2003 and have written about earlier: "Among the most damning examples of the blind spot this created, Winter says, was the failure by many economists and business people to acknowledge the common-sense fact that home prices could not continue rising faster than household incomes." One of those, D'oh! moments. McElhone, the Texas math stat, introduced me to the term 'blit', which is 5 pounds of shit in a 4 pound sack. By 2003, and certainly following, the US housing market had become rather blit-y. The article is well worth the reading. There are links to many other papers, and it does raise the question of the models used by the rating agencies. Were these models Bayesian? Were the rating agencies injecting optimism?

Which leads to this paper, which I'll end with, as it holds (so far as I am concerned) the smoking gun (which I found to be blindingly obvious back in 2003): "Even in the existing data fields that the agency has used since 2002 as 'primary' inputs into their models they do not include important loan information such as a borrower's debt-to-income (DTI)..."

This few minutes trek through the intertubes hasn't found a direct link between Bayes and the Great Recession. I know it's out there. I need only posit such as initial condition to my MCMC (look it up).

07 October 2011

Book 'Em, Danno

For those of us of a certain age, the notion of physical books is important. I recommend any and all of Nick Carr's books, which deal, in significant manner, with ... books.

After finally figuring out where the house is, UPS dropped off my copy of "Visualizing Data" by Cleveland a day late (the widely regarded as incompetent Post Office and FedEx and the Pizza Guys all manage to find it). It's published by Bell Labs/AT&T (back when it still sort of was, 1993) and Hobart Press which is kind of down the street from Bell Labs. Their only listed books are Cleveland's.

What makes me giddy is what's printed as the end of the Colophon (few books even have such any longer). This is it:
Edwards Brothers, Inc. of Ann Arbor, Michigan, U.S.A., printed the book. The paper is 70 pound Sterling Satin, the pages are Smythe sewn, and the book is covered with Arrestox linen.

This is a real book. See you in a bit. Time to do some reading.

04 October 2011

King Kong, Enter Stage Right

Well, the Gorilla just sat on the couch. Oracle OpenWorld has this announcement.

Buried kind of deep is this:
Oracle R Enterprise: Oracle R Enterprise integrates the open-source statistical environment R with Oracle Database 11g. Analysts and statisticians can run existing R applications and use the R client directly against data stored in Oracle Database 11g, vastly increasing scalability, performance and security. The combination of Oracle Database 11g and R delivers an enterprise-ready deeply-integrated environment for advanced analytics.

OK, so now the King Kong has adopted R. Do you see a trend?

03 October 2011

Don't Pay the Piper

Big news day, today. And yet more of interest. We don't need no education.

This is specific to Britain, of course. Note that tuition is £9,000 (at today's rate, that's about $15,000), which is a piddling amount here in the USofA. Community college might be cheaper, in state and all that. More reactionary, back to the dark ages, assertions. Education isn't just vocational. That's why the business has VocEd and real college. And sure, if you want to be an Excel whiz, then learning all that mathy and logicy stuff is boring and a waste of time. I mean, how much do ya need to know to slap together a PHP web site?

Deja Vu, Yet Again

My (single?) long time reader may recall that I concluded that the Oracle buy of Sun wasn't about java or Solaris or any software. It was about stealing the one segment of computing Larry didn't own: the mainframe, the IBM mainframe. I was initially alone, so far as I could see, although in the months following, I would read an occasional story tending toward the hardware motivation. If memory serves, some Mainstream Pundits explicitly stated that hardware was dead in the new Oracle.

Time to feast on some baked bird, crow specifically. Here's the latest from Oracle.

"'We want to take IBM on in their strongest suit, which is the microprocessor,' said Ellison."

Oracle may, or may not, be able to pull it off. Given that IBM's DB2, off mainframe at least, is adopting (well, it depends on who's defining that word) MVCC semantics, one could conclude that Oracle has gotten the mindshare part of the problem solved.

Political Science

A while back, simple talk offered me an article, suggesting that something controversial would be appropriate. I pondered for a bit, and decided not to throw Molotov cocktails as I usually do here. Instead, based on some abortive conversations with apparatchiks in Washington, I set out to demonstrate how one can generate dashboard style graphs using stat output from R all within the database. In this case, the database is Postgres. Here's the piece. Enjoy.

01 October 2011

Are We There Yet?

An update on the world of (semi) serious SSD is in order. The Intel 710 is the successor, sort of, to the X25-E. AnandTech has a review and status update. Worth the read for the industry background alone.

The clearest description, and the one that was most logical: "Fundamentally, Intel's MLC-HET is just binned MLC NAND."

I'll mention in passing that AnandTech is dipping a toe into "Enterprise SSD" review with a piece on OCZ. Not that OCZ is really serious, of course; the Sandforce controllers depend on clear text data streams, which are getting yet more scarce in the Enterprise.