25 September 2016

Thought For The Day - 25 September 2016

It's advert time, but you should watch out for reruns of "Parts Unknown", since Bourdain somehow managed to get Obama to have dinner at a noodle shop in the middle of Hanoi. They had an adult conversation of various topics. Now, think for a moment whether that could have happened with King Donald of Orange.

22 September 2016

Dew Drop Inn, The Good News Cafe - part the second

It's kind of quick for a part the second, but Nate Cohn has let another cat out of another bag.
Well, well, well. Look at that. A net five-point difference between the five measures, including our own, even though all are based on identical data. Remember: There are no sampling differences in this exercise. Everyone is coming up with a number based on the same interviews.

With regard to the Census/BLS earnings surveys, here's how to decline. More bad data. Yum. Ayn Rand would be proud to turn back the clock to 1800.

20 September 2016

Dew Drop Inn, The Good News Cafe

The regular reader may recall the admonition in these endeavors that macro analysis is fraught with danger: nearly all the data is from sample surveys, of varying quality and coverage. This reader, who hasn't been living under a rock or as "The Martian", is aware that 2015 income has been widely reported as having risen in the last year. For the first time in many years. The reporting, in some places, does admit that the 2015 level is still below 2007 levels; but who's counting?

Here's the public release:
The increases of 5.3 percent and 5.4 percent for family and nonfamily households were not statistically different.
In fact, if you read through the various sections, that sentence repeats and repeats and repeats...

A question about stat sig makes it all a bit worse:
The Census Bureau uses 90 percent confidence intervals and 0.10 levels of significance to determine statistical validity. Consult standard statistical textbooks for alternative criteria.
-- here

So, right off the bat, we've got squishing differences. By usually accepted stat sig, .05, it wouldn't be even close.

It gets better. There's a link to a spreadsheet with some underlying numbers. If you look at these numbers, the claim is that most are stat sig at .10!

If you follow the first quote link (page 6):
The effect of nonresponse cannot be measured directly, but one indication of its potential effect is the nonresponse rate. The basic CPS household-level nonresponse rate was 14.9 percent. The household-level CPS ASEC nonresponse rate was an additional 15.8 percent. These two nonresponse rates lead to a combined supplement nonresponse rate of 28.3 percent.

Following on, one finds the description of imputing missing responses:
Multiple imputation is a general approach to analyzing data with missing values. We can treat the traditional sample as if the responses were missing for income sources targeted by the redesign and use multiple imputation to generate plausible responses. We use a flexible semiparametric imputation technique to place individuals into strata along two dimensions: 1) their probability of income recipiency and 2) their expected income conditional on recipiency for each income source.
Much of that document is devoted to describing how this is done.

All surveys must deal with non-responses, so those in the business wouldn't find such a process out of band. For civilians, not so much. As I said, most likely believe that all these numbers are full measures, probably from IRS. Were that it were so. Without the raw data, it's not possible (well, for humble self) parse out whether the wonderful increase was an artifact of imputation. But, it could be.

So, should we guess that the Kenyan President ordered the minions at Census and BLS to put a heavy thumb on the scales? No. Having done data and stats for the government (not public facing, though), there's a good deal of resistance to the corner office dudes telling us what to do. Case in point: Farkas at FDA resigned rather than be party to the data fuck up that was eteplirsen. There's been a good deal of fiddling with the sampling underlying the surveys (yes, more than one for these data), and described at length in the background docs. Is all of this fiddling enough to turn nearly a decade of the 1% getting richer and the 99% having kids the other way 'round? Could be. Certainly a question those with the data to answer.

19 September 2016

Pandering Central

Well, being a biostat just got more difficult. FDA, specifically the Boss in Charge Dr. Janet Woodcock, decided by fiat to approve Sarepta's eteplirsen, branded Exondys. The MoA data presented by the sponsor was not statistically different from 0, in aggregate, and no evidence that what dystrophin (the target compound) was produced was clinically meaningful. Woodcock threw not only the outside panel of experts, but also her staff under the bus. Both sets of experts saw eteplirsen for what it is: a (estimated) $400,000 saline solution. You wonder why healthcare cost goes nuts? This is one of the main reasons. The DMD parents, who lobbied ceaselessly for approval, win. Or so they think. When the boys die on schedule, they'll be really angry. That's not a good thing. If Exondys made some level of difference for all DMD patients, there might be some sense to this. But Exondys only affects 13% of DMD patients.

Sarepta is on the hook to conduct a confirmatory trial. I wouldn't hold my breath; they'll continue to find excuses, just as they've done so far.

It's a sad day for data.

06 September 2016

Thought For The Day - 6 September 2016

Vegetarian, vegan, gluten free and such dieters can be tweeked by simply telling them a bit of history. Among the dinosaurs, the herbivores were the big, slow, fat, dumb ones; while the carnivores were the fast, lean, smart ones.

05 September 2016

NoSql? No Mas! No Mas!

First they ignore you, then they laugh at you, then they fight you, then you win.
-- Gandhi

It may be a tad early to gloat, but indications are that the NoSql zealots have waved the white flag and admitted that CAP is silly and doing ACID is way more fun. I suppose they deserve hemorrhoids, too. Couldn't happen to a nicer bunch of folks. A bit of innterTubes searching confirms the occasional tidbit that drifts by: the thought leaders in the NoSql cabal finally admit that transactions and central control over data consistency ain't such an old fashioned idea after all. They've discovered it, and will be patenting it soon. Not that NoSql datastores were any kind of innovation, either. Just VSAM files in ASCII with a buzzy name.

Told ya so.

Here's the main wave from the perpetrator of CAP. At least the principal instigator admits the error. The silly part of the whole episode is that partitions really are rare occurrences. They are just extended latency when they occur. Federated RDBMS, which have been around since about 1990 (the general principles since the mid 80s), have handled the situation. Here's a DB2 tutorial from 2003. The semantics are about the same with other such RDBMS.
As the "CAP Confusion" sidebar explains, the "2 of 3" view is misleading on several fronts. First, because partitions are rare, there is little reason to forfeit C or A when the system is not partitioned.

Fact is, distributed RDBMS (both single and multi- vendor database) existed since at least the early 1990's. And it wasn't just casual; here's a paper on security from Mitre (just down the road from Progress, which supported federation) from 1994. While it's no secret that I'm not a big fan of The Zuck,
Facebook uses the opposite strategy: the master copy is always in one location, so a remote user typically has a closer but potentially stale copy. However, when users update their pages, the update goes to the master copy directly as do all the user's reads for a short time, despite higher latency. After 20 seconds, the user's traffic reverts to the closer copy, which by that time should reflect the update.

So, what's the deal?
Another aspect of CAP confusion is the hidden cost of forfeiting consistency, which is the need to know the system's invariants. The subtle beauty of a consistent system is that the invariants tend to hold even when the designer does not know what they are.

Or, as many RM zealots tell us, high NF schemas reveal facts about data relationships we didn't know before. The schema specifies the invariants, but the data reveals the real world correlations.

Later in the piece, Brewer goes off the deep end:
The essential ATM operations are deposit, withdraw, and check balance. The key invariant is that the balance should be zero or higher. Because only withdraw can violate the invariant, it will need special treatment, but the other two operations can always execute.

This is the PollyAnna view of how banks run ATMs, and transactions generally. His description, and what is assumed by most civilians, is that Your Bank updates Your Account in real time, whether at an ATM or human teller. Not true. Accounts are reconciled (sometimes so as to generate overdrafts!)) in batch at some time EOD. Much of the big money made on bank hacking happens because the perps know that they have hours to do the deed before the accounts used are reconciled. Sometimes the intermediate accounts never see the deed. COBOL cowboys much prefer batch. They've been doing things that way for six decades. BASE has been the default paradigm in banking since forever.

So, with so many cpu cycles, SSD, XPoint, NVRAM, bandwidth, and the like available why would anyone drop OLTP/ACID on purpose? Back in the thrilling days of yesteryear when the 360 and 2311 DASD ruled the world, may be there was no other choice. Times, they are a changin.

[For those that keep track of such things, this musing and its title were started before I saw the adverts for the new Roberto Duran movie.]

04 September 2016

Physicists Aren't From Mars

Was toddling back from the grocery this morning, and the wind from Hermine made its way to South Butt Fuck. For reasons unknown, that reminded me of "The Martian", which reminded me that the most vocal complaint about the movie (and, I guess, the book which I've not read) was the initial premise, that a wind storm on Mars would be powerful enough to endanger the MAV.

But, to me anyway, that wasn't the dumbest McGuffin in the movie. That was the Rich Purnell Maneuver, whereby the Hermes mothercraft is slingshot back to Mars to collect Watney. The story takes place in 2035 and that's important.
The Mariner 10 probe was the first spacecraft to use the gravitational slingshot effect to reach another planet, passing by Venus on February 5, 1974, on its way to becoming the first spacecraft to explore Mercury.
-- Wikipedia

The movie shows Purnell sitting in a cold room with hundreds of servers connected to his laptop, ostensibly to do the harder than hard calculations. Give me a break. NASA had a 60 year Rip Van Winkle moment?