11 October 2011

A Model Citizen

While it is gratifying to be published by Simple Talk, so many more eyes that way, it isn't a platform where I can continue to prattle on at will. Each piece they publish, most of the time, is a stand alone effort. Since the piece was already rather long, there was one tangent I elected not to include, since it is a separate issue from the task being discussed.

"That subject: cleavages." Well, I only wish (and if you know from whence that quote came, bravo). No, alas, the topic is what to do with regard to fully understanding "bang for the buck". I elided that in the piece, since the point was to show that a useful stat graphic could be generated from the database. But how to discover the "true" independent variables of electoral primacy, and their magnitude? Could it be that with all the data we might have, both for free on the intertubes and costly which we generate, our best model is only 30% predictive? To reiterate, the exercise isn't to predict who'll win (FiveThirtyEight has been spectacular), but rather which knobs and switches a given organizations can manipulate to *change* a losing situation.

If you'll recall, most of the explanatory variables weren't of a continuous nature, that is, real numbers. The fitted lines in the scatterplots used a variation on simple linear regression to fit. The variation dealt with the differing best slopes over ranges. The technique doesn't account for the fact that most of the explanatory variables are either categorical (yes/no) or discrete (strongly disagree to strongly agree).

For this kind of mixed data regression, one typically uses analysis of covariance (aka, ancova). R, as one would expect, provides this. The Crawley book devotes a full chapter to ancova. I'll direct you there. Some say that discrete independent variables can be used directly in simple linear regression. Others would run to ANOVA immediately. Some distinguish categorical variables (gender) from discrete scaled variables (the 5 point agree scale on gun control). It is, suffice to say, not a slam dunk any way you go.

Exploratory data analysis, what R is particularly good at, is where the apparatchiks should be spending much of their effort (not worrying about the entrails of Rails!). Assuming that money is the driver of winning is an assumption, frequently wrong in the real world. Since their organization is large, national in scope, and full of dollars to spend; spelunking through all available data is the directive. That assumes, of course, that winning elections, without regard to policy positions, is the goal. Think of selling nappies.

While the goal of the piece was to display something simple to the Suits, determining a more accurate predictive model, which will be implemented with traditional text output, is the real goal. Same is true of selling nappies. The analogy is not so far fetched, as this book demonstrates; there have been similar treatises in the years since.

No comments: