30 April 2012

The Tail Wags the Dog, 95% CI

There's been a spate of R pieces recently, dealing with R as a programming language, and in particular, its assumed deficiencies. Here, and and here, and here are examples.

It's a bunch of tails wagging the dog, and doesn't address the real question: how to make R the de-facto stat pack where SAS, SPSS, and Stata tread currently. As mentioned in the Triage piece some reviews of R are concerned with how much is R, how much C, and how much Fortran. Various reviewers have been puzzled by the poles: more R than expected, and less R than expected. There are reported to be 3,800 packages in CRAN, and rather fewer (554) in Bioconductor. Call it 4,400 in round numbers. Assume that a package has, on average, 3 authors, which I think is generous, given how many grad students are involved (hell, Hadley Wickham does ggplot2 all by his lonesome). That's 13,200 folks.

So, we have 2,000,000 useRs. We have 13,200 "developers" (not counting the core team maintaining the language). Which group should the "language" serve? Clearly, the 2,000,000. In particular, insurgency into the SAS/SPSS beachhead will not be supported by emphasizing R as a coders' paradise (it isn't; too many warts), rather than as an analysts' golden sword. It seems to me, having used most stat packs and "real" programming languages over the years, that this divided duties situation is what makes for some of the oddities of R. Oddities both from a command writer's point of view, as well as a coder's. The first link has all the gory details. I have used one 4GL, Progress, which was the best in high button shoes for databases in the early 1990's, which was bootstrapped. But the audience was other coders (the report generator had its own syntax, a bit of RPG), not analysts, so having one syntax for two groups of coders wasn't a big deal. With R, the two constituencies are much more different.

One can build a language successfully, while not being a (group of) language designer by profession: Perl and Ruby being the two most well known examples. Contrast with python (defined by a mathematician) and java (language builder). Which of these one finds most comfortable as a working syntax says more about oneself than the language. For what it's worth, python. I've read Chamber's book, and a good deal of others, and I'm still not clear about why or how R's syntactical oddities are supposed to serve uniquely the purpose of stats. Clearly, the vector paradigm comes from Fortran and BMDP and does fit. The rest, not so much. And it is true that numerical programming has been on a drift from Fortran to C for some time; one can argue that this represents a lowering of the semantic, and thus not helpful.

As I commented on a post, Rcpp is the likely platform for development going forward; how soon, I can't say. Such a transition will mean that stat folks won't be the driving authors anymore, unless they choose to be real coders too (or primarily). This opinion is all based on the received wisdom that what's holding R back from displacing SAS/SPSS is speed. I don't think it's the open source thing, really. After all, even IBM uses linux. The file based structure of SAS/SPSS does have advantages.

25 April 2012

Typhoid Marys

It turns out that there is an identifiable patient zero (well, two) to The Great Recession. "Money, Power and Wall Street" is a four-part, broadcast as two two-hour segments on "Frontline" on PBS. The first two were last night, and the remaining two will be next Tuesday night. Check local listing for time, but 9 PM is standard. Also, stations typically re-broadcast the first part in the week between. Again, check local listings. It's worth watching.

I've kept track of the Great Recession analyses, mostly through newspaper and magazine accounts, not by reading every book written. Can't afford the time or the money, as this endeavor is gratis. The reason I'm admonishing folks to watch because there is new, to me at least, information. The piece is constructed from interviews with participants, professional observers (other Fed members, for example), and authors. Paulson, Geithner, and Bernanke are only shown in news snippets. I doubt we'll see interviews next week.

The story is told chronologically, and begins in 1994. Here's where it gets creepy, for me anyway. Regular reader may recall my telling of what it was like being an economics grad student in the early 1970s. Up to then, economics, business, and finance existed, and were taught, from the point of view policy and historical evidence. Policy A intended to elicit response B, and did so in country X at time Y. Structural requirements were M, N, and O, and were strongly in place. That sort of thing. Learn from experience, and propose policies that logically met rational expectations of greedy people. Not a whiff of algebra in sight. Then came the fruition of Samuelson, and both grad and undergrad departments wanted to make it all more rigorous.

The early insurgents into these departments were, mainly, those who couldn't or wouldn't make it as mathematicians, physicists, or math stats. The insurgency was led by the dregs. In particular, these people had little to no idea of the policies or history of the fields into which they were enticed. What they could do was manipulate algebra a tad better than a (econ, business, or finance) Ph.D. from 1960. For this they got a safe teaching position and a steady paycheck.

Now, run the clock forward about a generation, to 1994 in Boca Raton. This is where it all started; with two 20-something GRRLSSS from JP Morgan who invented the CDO. Their interviews are sliced up and meted out in sound bites through the two hours. Toward the end the program, they kinda, sorta admit that they hadn't any idea what they were doing. A generation of quant insurgency, when they'd taken over the fields, and they still hadn't any idea. The orphans had taken over the orphanage.

The program doesn't solely blame them, and does go a bit overboard in blaming Clinton and Democrats. The facts are, even those presented in the course of the program, it was the banks and Republicans who created the legislation. Clinton was gulled and followed along. Notably, Schumer is not mentioned at all.

Two points that stuck out to me. First, no one states the obvious, that housing prices are inextricably tied to median income, which was either stagnant or falling depending on which numbers you looked at during the Dubya years. Therefore, for housing prices to explode, corruption had to be at play. No mo money. How could households afford these McMansions on ditch digger wages? They couldn't, of course, so it was at least willful ignorance by banks and regulators. Also, the program focuses on banks, which makes sense from the point of view of fall 2008, but from a forensic point of view, it was mortgage companies which pushed the subprime and ARM pollution into the system. This is mentioned, but only in passing.

Second, only one person, an author, states the other obvious point. The alleged reason these two quants created the CDO was to minimize risk. They state that the process had existed in other corners of finance for some time. They merely brought it into retail. Lame excuse. But, of course, financial manipulation, credit default swaps in particluar, doesn't reduce risk. It merely rearranges it, much like the deck chairs on the Titanic. While Lehman may have felt it had sold off the risky bits, when they all started to stink, the coverage didn't exist simply because it couldn't. The mortgage companies and banks had, largely, put all of their eggs in one, and the same, basket; one which had been manipulated to appear both high return and low risk. And they had done that because Greenspan sought to delay Dubya's recession by cratering interest rates. The program, last night anyway, doesn't tell us that.

On the whole, watch part one, and tune into part two next Tuesday.

22 April 2012

Solid Rock

Here in sunny Connecticut, land of steady habits (well, until the insurance folks dived into the subprime mess), there are classic rock radio stations. Not surprising, since baby boomers run the insurance industry, and the insurance industry (still) runs Connecticut. One of those stations is WPLR; and it's not owned by Clear Channel. CC, by the bye, is a Bain Capital company. Think of that as you will. Not surprisingly, using and writing about PL/R makes me feel like a rock star. Mark Knopfler, may haps.

But just a short note that upgrading PG to 9.1 from 8.4 turned out not to blow out PL/R, so it wasn't necessary to do a rebuild, with all that entails. It is necessary to build PG with plr language support, so a repository drop won't work. Triage continues to evolve. Darwin would be so proud.

17 April 2012

Things That Go Bump in the Night

Is the Web sustainable? That's a question for quants and investors and policy makers and inventive types. I've mentioned a few times here that advert based webbiness is doomed, in any long term. Since the users are bifurcated twixt those staring at the browser, clicking and the real clients, those paying for the clicks, sites which rely on adverts are susceptible to any vehicle (not even of the Web) which provides more bang for the advert buck. The sites attempt to serve two masters, but it's the advert buyer who gets his way. This ticks off the users, who then install AdblockPlus, or similar. This is why Google runs scared, all the time. MySpace looms over all.

The fickleness of advert buyers can't be underestimated. From the point of view of a quant intent on predicting where the Web is going (either for stock picking reasons or development trends or ...), it's all a wedge of black swans; there aren't any quantitative breadcrumbs to follow, just occasional tsunamis which destroy the Thought Leaders. Today, Seeking Alpha posted a paean to Pandora. This is not to say I'm recommending anyone buy Pandora, only that the true nature of Webbiness is finally getting the attention it deserves.

So long as the Web is based on the same model as newspapers, the Web is just as vulnerable.

16 April 2012

Yogi and Booboo [UPDATED]

For those who like to gamble in stocks, Seeking Alpha is a pumper's delight, balanced by the occasional hit-job on Chinese reverse mergers. Yesterday brought an interesting piece on SAP, which has a reasonably complete backstory, though somewhat uninformed, title. The useful part is the description of recent history. What's wrong is that SAP, from inception, had its own "database" to support the various applications and no other datastore could be used. It took Oracle some considerable effort to get access, here and here. And then, there's ABAP, the barbarous 4GL source language, in German, of course. That may have changed in recent years.

The main subject of the piece is HANA, SAP's in-memory database, which it intends to market as a separate product. Both Oracle and IBM, in the last few years, have acquired in-memory database companies, although there's not been a whole lot of press about them since the initial acquisition stories. SAP isn't doing anything new here.

A quote from the piece (with which I surely agree):
"... it's the apps that count, and increasingly, it's the database that makes the apps. Once again."

I couldn't agree more.

But here's what doesn't get said. In-memory databases will have to be compact; DRAM storage will always be more expensive per byte than SSD or HDD. The RM to the rescue. There is no other data model which is more parsimonious with data (if any of the alternatives can even be called models). High NF structures will be key to efficient implementation to in-memory databases. For those with long memories, Texas Memory began in the SSD business with DRAM devices, and held to that for a long time, perhaps too long. With high NF schemas, in-memory databases aren't penalized for joins or CTEs. Data is purely orthogonal, thus robust and with 64 bit memory addressing, all data ought to be equally accessible. I've gone on about that before, so just insert here.


Found an additional article, which had this to say:
"The longer term benefits of HANA will require new software to be written -- software that takes advantage of objects managed in main memory, and with logic pushed down into the HANA layer."

Yet another implication that the data is logic, and logic is data.

10 April 2012

Following The Money

For those who still watch the Golf (Masters) on TV, you've likely seen the "ads" about the shitty performance of American students in math and science. Exxon/Mobil paid for them. One wonders why.

From a few years ago, an article describing Exxon's outsourcing of high value jobs to India.

Beyond that bit of hypocrisy, is the simple fact: folks do tend to follow the money. With the service economy meme of the last couple of decades, enrollments in science/math/engineering have fallen, while in money laundering, risen on a relative basis. Pavlov was mostly right: unless a person considers some field a "calling", they'll do whatever pays the most, regardless of the nature of the work. Welcome to social Darwinist capitalism.

Here's the science/engineering data. No similar data came up as I searched, but this paper from SFSU did. Yes, just one university, but very detailed. Just compare what happened to EE and Corporate Finance. It wasn't until after the Great Recession that enrollment in finance collapsed.

07 April 2012

The Borg of Stats

I've been meaning to write up, and did post comments various places concerning, the infiltration of R into RDBMSs, beyond the wonderful PL/R for PostgreSQL.

Well, the folks at Revolution Analytics have published a piece laying out the universality, in a manner of speaking. You will be assimilated.

Must Be The Season of the Witch

Great weeping and wailing and gnashing of teeth, with the publication of the March monthly employment number. Note: that one number. As previously predicted here, January and February numbers were boosted by the seasonal adjustment weights, and that the piper would be paid in March.

Here are the unadjusted numbers (as usual, in thousands) for the three months:

Mar. 141,412
Feb. 140,648
Jan. 139,944

So, the raw sample value is increasing at about a steady value. The NY Times has a long article, with only a glancing mention of the seasonal adjustment effect.

04 April 2012

Red Hot Chili Peppers

Since mid-day yesterday, and the release of the current FOMC notes, Mr. Market has been doing a Norman Bates berserker. Therein lies a tale.

A few miles from the drafty garret in which I type this deathless prose lies the remains of Scovill Corporation. Scovill was the spawn of Mr. Wilbur Scovill. Scovill, the company, made brass widgets and from there some consumers goods. Scovill, the person, is better known for his scale of chili pepper heat. The way I've come to know the Scovill Scale is thus: people are asked to compare two glasses of water, one of which is pure (or with sugar) water, the other of which is water that's been doused with chili pepper. The ratio of water to chili when the tasters can't tell the difference is the Scale number for that sample. Simple enough.

The same could be said of interest rates, in the pure sense: what is the amount of interest needed to persuade people to forgo payment today until some time in the future? Of course, the simpleminded answer is: "more, lots more". But to get the real number, one has to either intuit, or experiment; find the Scovill Number (in the heads of savers) for money. Economists call this the time value of money, or more accurately, rate of time preference. Finance folks tend to view the definition the other way round: the available no-risk interest rate, take it or leave it. Note that this has nothing to do with deciding what to do with the money twixt now and then. Interestingly, WikiPedia falls down this time. The article is written, I suspect, by a finance graduate, not an economist, since it *assigns* earned interest as value. That's not what the time value of money is, which is the premium that must be paid for time not holding the money.

Long term interest rates on no-risk (do such exist?) bonds (typically, government) is most often taken as a proxy for this time premium. By that measure, how are we doing? Here's a section from WikiPedia on interest rates. And a short quote: "...the Austrian School of Economics sees higher rates as leading to greater investment in order to earn the interest to pay the depositors." The problem with that assertion, of course, is that fiduciary manipulation doesn't, and never has, generated real returns in the economy. No amount of fiduciary meddling will create another Edison or Einstein. Real returns come from technological improvements; new plant and equipment yield more bang for the buck and thus generate real returns on real investment. No matter what the Right Wingnuts say, that last clause is all one needs to know.

What we know now, without question, is that it is much, much easier to generate higher monetary returns through financial corruption than it is by inventing new, superior technology. The former requires only stealth, while the latter requires intelligence. One is more common than the other.

Once again, back to the Wiki; here's a precis of 19th century US monetarism. I don't know the provenance of this page, but it provides a list of interest rates going back to 1800. It does show, although to a lesser extent than I recall reading elsewhere, that long term interest rates in the 19th century were at times below short term. Orthodox economists deny that can happen. It was true that much of 19th century USofA was deflationary. Why? The usual suspect is gold. It is in limited supply, so as economies grow, deflation has to occur; only so much to go around. The Right Wingnuts don't wish to understand that, of course.

The basic notion that long term interest rates are necessarily higher than short term has been debated. On the face of it, it isn't the term of the investment which determines the value (rate of return), but the quality of the technology weighted by its market control (monopoly and monopsony; oligop- versions as well). A short term investment which enables a high return quickly is more valuable than a cruder, but longer lived, investment. In the world of IT these days, a couple of fiscal quarters is all one gets.

So, what then is the time value of money? It seems to be about 2 to 3%. Current rates, with near deflation, aren't so far out of bounds.

The Horn of Plenty

SSD news has flooded in the last couple of days. Head over to AnandTech for Intel's latest, as well as LSI (a fairly new player and new home of SandForce silicon) and OCZ.

Of most interest is the OCZ Vertex 4, built with in-house Indilinx silicon. Prosumer/consumer gear begins to approach enterprise performance; then again, OCZ continues to tout their invasion of the Enterprise Space. One aspect I noticed in the Vertex 4 piece, is that they're reporting random I/O in MBs rather than IOPS. Therefore, one can see that total bytes moved isn't hugely different twixt sequential and random, for the Indilinx silicon. I do wonder how soon it will be that the two components of SSD performance, NAND size and controller sophistication, will collide? At some point, the unreliability of decreasing feature size in the NAND will demand so many cycles in the controller that SSD performance will level out. And, perhaps, decline. Not yet, though.

The Vertex 4 doesn't sound ready for Prime Time Enterprise (no superCap), but for development machines building RDBMS applications using full fledged enterprise SSD, boy howdy is this yummy. (Perversely, OCZ share price is doing a swan dive!) Less and less reason to ignore Dr. Codd.