29 November 2011

What's The Difference

To continue with the Triage project, I've spent a day or two with more graphics texts (about which I'll be musing anon), and getting more familiar with the mapping scenarios.

Separate from the scatterplot matrix data shown in Triage, which would be used to measure the micro components of a campaign, is the question of displaying national trend, twixt Us-uns and Them-uns. For that one turns to map graphics, which is a whole other world. Still in R, mind, but not statistical in nature.

What I have recently found is this site, which replicates a US map with 2004 election results. Now, our Apparatchiks won't be downloading zip files from outside sources, of course. On the other hand, the files make for a perfect dive board for the PoC. Load them into PG, swapping Republican for Bush and Democrat for Kerry and Other for Nader (that's not much of a stretch!). Just for completeness, I'd found much earlier (but can't find that I'd cited), this map exercise, but as of now, the author has been too embarrassed to post the R that does it. While only some form of income data (not specified), it is a follow-on (linked to) to an election stream map set, also not supplied with the R that made it. Nevertheless, one can conclude that with enough time, this is a task suited to R. As mentioned in an earlier post, the animation bits are likely via googleVis.

I'll be using his data, since it provides a basis and I don't have to concoct some, though not the R he used (still using the stock R from Wickham). It's not clear how the numbers were derived.

What is really useful about the 2004 map posting is the data source: a county level count. Get these into a PG table, and we have a surrogate for data which our Apparatchiks would have, and which we can further expand with relatively simple SQL; just to see how a map would change. The notion for this part of the Triage effort is to measure the effect of national campaign spending, post some event/ad/debate/foo, at the POTUS/party level; a RNC/DNC (or 501/527/foo group) view of the country.

Here's the new PG table where we load:

CREATE TABLE public.election (
state varchar(25) NULL,
county varchar(25) NULL,
tot_precincts int4 NULL,
precincts_reporting int4 NULL,
republican int4 NULL,
democrat int4 NULL,
other int4 NULL,
constraint pk_election unique(state, county)
TABLESPACE pg_default

And we get it loaded thus (concated from the state/county files in the zip):

copy public.election from '/databases/rawdata/2004election/output.txt' using delimiters ';' csv header

Note that column names are underscored, rather than camelCase, since PG forces quoting to use anything in the database if there are Caps in names. Yuck.

And here's the PG + PL/R (I've left it as is; comment/uncomment to generate each of the maps, this is the difference map, shown last. The first set are for the two event maps, while the other is for the diff map):

CREATE OR REPLACE FUNCTION "public"."us_graph" () RETURNS text AS
states <- map_data("state")
#elections <- pg.spi.exec ('select state, sum(republican) as "Republican", sum(democrat) as "Democrat" from election where event_number = 2 group by state order by state');
elections <- pg.spi.exec ('SELECT a.state, sum(a.republican - (SELECT b.republican FROM election b WHERE b.event_number = a.event_number - 1 and a.state = b.state and a.county = b.county)) as Republican FROM election a where a.event_number = 2 group by a.state ORDER BY a.state ')
elections$state <- tolower(elections$state)
elections$republican <- elections$republican/10000
choro <- merge(states, elections, sort = FALSE, by.x = "region", by.y = "state")
choro <- choro[order(choro$order), ]
#p <- qplot(long, lat, data = choro, group = group, fill = Republican / Democrat, geom="polygon", asp=.6)
p <- qplot(long, lat, data = choro, group = group, fill = republican, geom="polygon", asp=.6, main = "Poll Shift", xlab = "", ylab = "")
p + labs(y = "", x = "")
p + opts(panel.grid.major=theme_blank(), panel.grid.minor=theme_blank(), panel.background=theme_blank(), axis.ticks=theme_blank())
p + scale_x_continuous("")
p + scale_y_continuous("") + coord_map()
p + opts(axis.text.x = theme_blank(),axis.text.y = theme_blank(), axis.title.x = theme_blank(), axis.title.y = theme_blank(), axis.tick.length = unit(0, "cm"), axis.ticks.margin = unit(0, "cm"))
p + scale_fill_gradient(limits = c(0, 90))

All that spinach for the library calls got eliminated by making an .Rprofile in postgres user's home with the following line:


You could also call out the libraries explicitly; both ways work. The additional spinach is various directions to eliminate the lat/long grid on the maps. None work!

Here's the Event 1 map:

Now, let's update the table to include an event_number (easier than using a date, anyway) and an event_type. That way, we can generate maps in sequence, but also note what sort of event just/last happened. We could also generate maps sequences for only certain sorts of events (they'd be in a check constraint).

So, let's make some new data:

insert into election (select state, county, tot_precincts, precincts_reporting, republican * .8, democrat * 1.2, other, 2, 'foo' from election where event_number = 1);

We wouldn't get such dramatic shifts (modulo Swift Boats) in the real world, but this is PoC territory.

This yields a new Event 2 map:

I'm still grappling with my main wish list item: showing the changes in the colors. As it stands, each map takes the full gamut, leaving the legend to display the shifts; doesn't do that all that well. Viewed another way, why not show the delta of polling strength (vote displays are a bit late, after all)? We can do that with a single map. How to get the data out of the election table? For that a correlated subquery is sufficient. It's that big SQL statement.

Here's what the delta map looks like:

What we see is the shift, in absolute, not relative, numbers. So Texas looks to be more Democrat from Event 1 to Event 2 just because it started with more votes; same with California.

Getting rid of the lat/long grid is still a problem, but then, this is a free PoC. Cheap at half the price.

15 November 2011

The Red and the Blue

I was going to build a US map (using R facilities) showing net federal funds at the state level, but found there are a colossal number of these already. No need to demonstrate that yet again. The point would be to demonstrate doing so within the RDBMS, following in the Triage piece's footsteps. I'll just show the code to generate a US map, as shown by Wickham's book.

CREATE OR REPLACE FUNCTION "public"."test_graph" () RETURNS text AS
library(maps, lib.loc="/home/postgres/R/x86_64-unknown-linux-gnu-library/2.14/")
library(plyr, lib.loc="/home/postgres/R/x86_64-unknown-linux-gnu-library/2.14/")
library(proto, lib.loc="/home/postgres/R/x86_64-unknown-linux-gnu-library/2.14/")
library(reshape, lib.loc="/home/postgres/R/x86_64-unknown-linux-gnu-library/2.14/")
library(grid, lib.loc="/home/postgres/R/x86_64-unknown-linux-gnu-library/2.13/")
library(ggplot2, lib.loc="/home/postgres/R/x86_64-unknown-linux-gnu-library/2.14/")
states <- map_data("state") arrests <- USArrests 
names(arrests) <- tolower(names(arrests)) 
arrests$region <- tolower(rownames(USArrests)) 
choro <- merge(states, arrests, by = "region")
choro <- choro[order(choro$order), ] 
print(qplot(long, lat, data = choro, group = group, fill = assault, geom = "polygon", asp = .6) + borders("state", size = .5)) 

Rather more text than the Triage demonstration. R is built on a multi-user model, but is normally used as a standalone application on a PC. And then, there's the *nix issue. The upshot is that nothing need be done to use "base" modules, and those include the scatterplot matrix in the Triage piece. As mentioned in the piece, R supports (at least) two other graphics engines: lattice and ggplot2. Lattice is an extension of base graphics, while ggplot2 is an implementation of a grammar based graphics engine. This Grammar of Graphics is documented (but not a code base) in Wilkinson's book, at 712 pages and no code, we'll see (just Amazoned it)!

This map was created with ggplot2 functions, although no database data is used. It is necessary to call out each package/library explicitly, as well; PL/R doesn't know to load dependent packages, alas. In the context of the Triage piece, the application would show the net position of the party's candidates by state, along a Blue/Red vector. Just so happens that the R installation includes some state level data, which Wickham uses illustratively. One might extrapolate that Red States are more violent than Blue States, on the whole. Not that I'm making such an extrapolation, of course.

Loading of non-base libraries can be done in one of two ways: if the R engine library directory has global write permission (not normally so under *nix) then any package (which is then called a library in use, yeah, I know) loaded by any user goes to the directory and can be referred to directly by PL/R; on the other hand in usual installs, each user has packages installed to a local directory. Since postgres (the engine) runs as postgres the user, the packages need to be installed by postgres (the user) from an R session. In a corporate (and political campaigns are very much so) environment, standards and conventions would need to be established.

Ideally, what I'd want, following on the thesis of the Triage piece, is a clickable map (states), but that gets into non-rectangular html buttons (Google maps, I'd wager); not a topic I'm conversant with, yet. Whether it would make sense to generate the map in R with the cruft needed to implement the button logic is another puzzle. I think not, but not sure; R doesn't impress me as a strong string manipulation language. Ideally, then, the map would not only be generated by R, but each state would be a button, which would call a second R function in postgres to show the county/municipal/zip map. Could be a bit of work, but your candidates are worth it.

Here's the picture (this is a png, since Blogger won't chew pdf):

09 November 2011

Honesty in Government

[UPDATE] -- copied Sales the first time, same issue.

As I transition into data scientist, which means re-adding my stats mojo to my RDBMS mojo (not replacing the latter with the former, by the way), I've come across more than a few postings and writings in the statosphere about truth in data. The writing is always by data professionals (not lobbyists and the like, near as I can tell), and the point is always that the data is truth. By truth one means the most accurate picture of the real world, unadorned by propaganda.

Today's Federal data dump includes September wholesale inventories. They were down .1%. Here's the quote: "..were $462.0 billion at the end of September, down 0.1 percent (+/-0.2%)* from the revised August level." What's the starry thingee, one might ask? Well, it's the link to a footnote.

Here's the footnote:
"* The 90 percent confidence interval includes zero. The Census Bureau does not have sufficient statistical evidence to conclude that the actual change is different from zero."

Two points to note about the footnote: 1) the CI is 90% level, which is very generous and 2) it spans 0, which means what the note says. I wonder how many of the reports about the report will bother to tell us about that.

Here's the link to the original; click the link for Excel or PDF.

07 November 2011

Write Me a Song So Lovely

It's mostly considered declasse' to simply refer to somebody else's post on some subject. And I don't do so very often, but this post on writeable CTEs is too much to pass up, not least because I've yet to move to 9.1, and won't likely until it hits 9.1.3 (currently at 9.1.1). What makes the post of interest: Brown states that PG is alone with wCTE, and some quality time with searching supports that. My interest rests on the motivation for this endeavor: put, and keep, the logic in the datastore and let the client pretty-pretty the screens as it wishes. Writeable CTEs take this notion yet another step forward.

Not that I'd choose a joined table with a calculated, by code, column. But that's another story. I'd be more inclined to implement an inventory/order line update, since that fits with the canonical example of join. For example, if your customers are mixed with regard to price changes, some will get updated prices on unfilled order lines, the rest not. In that circumstance, you'd want to update some order lines when the inventory (or price break) table is updated. That sort of thing. 9.1.3 shouldn't be too far away.

Good on 'ya, as Texans say.