If you haven't already, you need to get a copy of the R Inferno, and you need to read the latest presentation from Pat Burns.
If for no other reason than these two slides:
1984
This was the year that a mediocre actor who already had Alzheimer's would be re-elected to preside over a big pile of nuclear weapons. A very Strangelovian event.
And, not the least reason, he states, right there on page 25, in similar words to mine: "The second driver of problems with R is that it is both a programming language and an interactive language." The way I phrase it: R is both command language and programming language, but both amount to the same thing. Ah, camaraderie.
And, on page 34, he sets any arrhythmic heart ("THERE'S so much STUFF!!!") back to normal: "If you think you can learn all of R, you are wrong. For the foreseeable future you will not even be able to keep up with the new additions."
And, one last quote: "It should, and hopefully will, replace lots of the data analysis that is currently done in spreadsheets." Boy howdy!
31 May 2012
23 May 2012
Picture <- Thousand Words
Way back when, I worked as a Fed, initially and for most of the time for McElhone, who was a math stat from Iowa State ("Yes, sir. Pigs and corn".). This from 1975 onward. Early on, he decided that we (the Office of Analytic Methods), should have our own computer. There wasn't room in our office for a 360 or even a PDP-10. When you want to buy something, you have to have purchasing authority for the good or service, its cost, and the specific good or service has to be on a GSA list.
Since this was pioneering days for microprocessors, and GSA wasn't necessarily a geek fest, McElhone found that the Tektronix 4051 was listed on the calculator schedule, not computer. "Boy howdy", to quote the man. He had authority to buy calculators (but not Real Computers, of course), since stats used calculators a lot for small projects (HP and TI sell stat calculators still); ones that didn't demand BMDP on the mainframe. And it was within budget.
It soon arrived, and later we got disk storage. Well, it was a Shugart (not yet morphed into Seagate, sort of; it's a long story) 8" floppy drive. The box was about 3 feet by 2 feet by 6 inches, and sat on a movers' dolly; weighed a lot.
I am catapulted down fading memory lane because the 4051 was a graphics workstation. The tube was storage, so it didn't flicker, but did require a screen erase. We digitized the world. We studied employment discrimination ("Yes, Mr. Secretary, we need to know what the baseline level of acceptable discrimination is.") Today, Rob Hyndman announced a new edition of his forecasting book. This is the quote that jogged the memory:
"We emphasize graphical methods more than most forecasters. We use graphs to explore the data, analyse the validity of the models fitted and present the forecasting results."
Deja vu all over again.
Since this was pioneering days for microprocessors, and GSA wasn't necessarily a geek fest, McElhone found that the Tektronix 4051 was listed on the calculator schedule, not computer. "Boy howdy", to quote the man. He had authority to buy calculators (but not Real Computers, of course), since stats used calculators a lot for small projects (HP and TI sell stat calculators still); ones that didn't demand BMDP on the mainframe. And it was within budget.
It soon arrived, and later we got disk storage. Well, it was a Shugart (not yet morphed into Seagate, sort of; it's a long story) 8" floppy drive. The box was about 3 feet by 2 feet by 6 inches, and sat on a movers' dolly; weighed a lot.
I am catapulted down fading memory lane because the 4051 was a graphics workstation. The tube was storage, so it didn't flicker, but did require a screen erase. We digitized the world. We studied employment discrimination ("Yes, Mr. Secretary, we need to know what the baseline level of acceptable discrimination is.") Today, Rob Hyndman announced a new edition of his forecasting book. This is the quote that jogged the memory:
"We emphasize graphical methods more than most forecasters. We use graphs to explore the data, analyse the validity of the models fitted and present the forecasting results."
Deja vu all over again.
22 May 2012
Diseconomies of Scale
Is the cloud a good idea, or a bad idea? The case for a bad idea, from the point of view of clients and providers, has been made here before, but just based on experience and logic. Now, I'm kind of a fan of Seeking Alpha, mostly for comic relief. Most posts there are either blatant pumps or dumps. Every now and again, I'll find one data based and rational. Here's one on AWS, but as the author points out, true for any cloud provider.
If the cloud were the greatest thing since sliced bread, IBM would still have, and be making tons of cash from, its Service Bureau.
If the cloud were the greatest thing since sliced bread, IBM would still have, and be making tons of cash from, its Service Bureau.
16 May 2012
Cheap at Twice The Price [update]
[update]I transposed the tape capacity from the article. Now fixed. And I'll take the opportunity to admit there's a bit of apple/orange here: there are records of IBM prices in 360 days, while prices of qualified SSD/HDD today don't get publicly published, so prosumer parts are mentioned.
There's a new paradigm out there, first (in my memory) manifest by Sun with the FS5100 "pure flash" storage (not, repeat not, SSD) about three years ago. EMC has bought up an outfit name XtremIO, which is "pre-revenue". To the extent that a controller can skip implementing filesystem protocols, it will be faster (Fusion-io, for example), and might even be a tad cheaper.
What's been obvious for awhile is that the wholesale replacement of HDD by SSD, as asserted by zealots a couple of years ago, isn't going to happen; the price gap isn't bridgeable. Or is it? The transition from tape drives to DASD (as IBM named it, at least since the 360) also faced a price mountain, or cliff, depending on how you look at it.
Here's a random review of price differential. It calls for a 12 times ratio. Let's go to Amazon, and do a quick check.
Intel 320/300GB -- $505
Seagate 300GB/10K -- $240
Seagate 300GB/15K -- $200
Hmmm. The Seagate prices are all over the place, but in the general vicinity. In any case, not 12 times, modulo specific drives' quality. And, there's the TB cheapy, but we'll ignore them for now.
So, back in 1964, what was the price difference between a 9 track tape drive and 2311 DASD? Amazingly, there's a list. The 2311 was $25,510. You also needed disk packs, as the 2311 was removable. A pack stored 7.25MB. On the tape side, I could find information on 2420 1600bpi. Cost: $54,600, plus another $55,400 for the control unit, which supported eight drives (we'll take 1/8 per drive, or $6,925). 2400 ft tapes, at 1600bpi, could accomodate 40MB.
Assuming packs and tape cost nothing (packs cost more), this is what we get:
DASD -- $3,518/MB
Tape -- $1,558/MB
Difference on the order about what the common sense notion is today of SSD vs. HDD. In time, DASD won out, although (sadly, sniff) largely running sequential file formats. The tape/DASD capacity ratio is sort of where we sit now with HDD/SSD.
So, Mr. Peabody muses that we've been here before, and new and better won. I feel so much better.
There's a new paradigm out there, first (in my memory) manifest by Sun with the FS5100 "pure flash" storage (not, repeat not, SSD) about three years ago. EMC has bought up an outfit name XtremIO, which is "pre-revenue". To the extent that a controller can skip implementing filesystem protocols, it will be faster (Fusion-io, for example), and might even be a tad cheaper.
What's been obvious for awhile is that the wholesale replacement of HDD by SSD, as asserted by zealots a couple of years ago, isn't going to happen; the price gap isn't bridgeable. Or is it? The transition from tape drives to DASD (as IBM named it, at least since the 360) also faced a price mountain, or cliff, depending on how you look at it.
Here's a random review of price differential. It calls for a 12 times ratio. Let's go to Amazon, and do a quick check.
Intel 320/300GB -- $505
Seagate 300GB/10K -- $240
Seagate 300GB/15K -- $200
Hmmm. The Seagate prices are all over the place, but in the general vicinity. In any case, not 12 times, modulo specific drives' quality. And, there's the TB cheapy, but we'll ignore them for now.
So, back in 1964, what was the price difference between a 9 track tape drive and 2311 DASD? Amazingly, there's a list. The 2311 was $25,510. You also needed disk packs, as the 2311 was removable. A pack stored 7.25MB. On the tape side, I could find information on 2420 1600bpi. Cost: $54,600, plus another $55,400 for the control unit, which supported eight drives (we'll take 1/8 per drive, or $6,925). 2400 ft tapes, at 1600bpi, could accomodate 40MB.
Assuming packs and tape cost nothing (packs cost more), this is what we get:
DASD -- $3,518/MB
Tape -- $1,558/MB
Difference on the order about what the common sense notion is today of SSD vs. HDD. In time, DASD won out, although (sadly, sniff) largely running sequential file formats. The tape/DASD capacity ratio is sort of where we sit now with HDD/SSD.
So, Mr. Peabody muses that we've been here before, and new and better won. I feel so much better.
13 May 2012
Meet Forrest Gump
How often is one left speechless? At a loss for words? Well, in today's NY Times has a story that did so. Back? OK, what would you think of a physician who was surprised that antibiotics or surgery works? Well, that's the level displayed by this knucklehead; described as an actuary and math professor. And I quote: "We started with simple calculations and moved on to more involved ones. To me, the results were astounding: statistical sampling worked." Yeah, right! First semester sociology undergrads taking their watered down baby stat course may be. Further evidence that quants in financial services may be more the problem than the solution.
So, then we read this: "As far as I knew, no one had proposed this model." There's not enough detail to know what that model is, but the fact is that sampling from relational database engines has been around for a long time. It's not as straightforward as pulling balls from an urn, of course. Rows may, or may not, be stored in some key order. They may be stored in insertion order. If stored on SSD, in particular, they'll be scattered thither and yon on the silicon, which actually makes the process more efficient. Moreover, on what attribute(s) should randomness be enforced?
To make matters worse, he states: "I believed I had a solution to this cumbersome and costly process: create subgroups from the database, sample policies from each, repeat the process several times, then combine the results." If that's not stratified random sampling, prior art up an elephant's butt, I'll eat my hat. And yeah, sampling has been done in RDBMS for a very long time. Here's a SQL Server 2000 version. And here's a very long thread on sampling from 2005. Moreover, TABLESAMPLE has been a SQL standard for about a decade, although not implemented by all engines for that long. The notion that one can patent sampling will cause Snedecor to rotate in his grave. Please!
Whether there really is anything unique, and therefore patentable, here is impossible to say based on the text. While there remains a good deal of grey, algorithms aren't/shouldn't be patentable. One man's considered opinion. Given that Apple got patent protection for a rectangle (admittedly, in Germany), my guess is yes, and that's very too bad. They may actually lose.
(My taste in allusive puns as titles may have stretched as thin as a spider's thread this time, so: Gump made the remark about a box of chocolates (I've never watched the film), and at that time in the USofA, if one was of the lower-middle class, then that box was most likely a "Whitman's Sampler". Mea Culpa.)
So, then we read this: "As far as I knew, no one had proposed this model." There's not enough detail to know what that model is, but the fact is that sampling from relational database engines has been around for a long time. It's not as straightforward as pulling balls from an urn, of course. Rows may, or may not, be stored in some key order. They may be stored in insertion order. If stored on SSD, in particular, they'll be scattered thither and yon on the silicon, which actually makes the process more efficient. Moreover, on what attribute(s) should randomness be enforced?
To make matters worse, he states: "I believed I had a solution to this cumbersome and costly process: create subgroups from the database, sample policies from each, repeat the process several times, then combine the results." If that's not stratified random sampling, prior art up an elephant's butt, I'll eat my hat. And yeah, sampling has been done in RDBMS for a very long time. Here's a SQL Server 2000 version. And here's a very long thread on sampling from 2005. Moreover, TABLESAMPLE has been a SQL standard for about a decade, although not implemented by all engines for that long. The notion that one can patent sampling will cause Snedecor to rotate in his grave. Please!
Whether there really is anything unique, and therefore patentable, here is impossible to say based on the text. While there remains a good deal of grey, algorithms aren't/shouldn't be patentable. One man's considered opinion. Given that Apple got patent protection for a rectangle (admittedly, in Germany), my guess is yes, and that's very too bad. They may actually lose.
(My taste in allusive puns as titles may have stretched as thin as a spider's thread this time, so: Gump made the remark about a box of chocolates (I've never watched the film), and at that time in the USofA, if one was of the lower-middle class, then that box was most likely a "Whitman's Sampler". Mea Culpa.)
10 May 2012
SS Disc Hits Iceberg; All Hands Lost [update]
What's been happening to the SS Disc on its voyage to the New World? Well, the three better known pure SSD vendors (Fusion-io, STEC, and OCZ) have wet farted the bed with their quarterly reports in the last couple of weeks. Today (Thursday, 10 May), EMC is reported to have bought a private flash appliance builder. While the tribulations of the SSD vendors is not a good thing for RDBMS/SSD insurgency, EMC making a statement about flash likely is.
[UPDATE]
Found my way here, and left a comment, of course. At some point folks will understand that de-dupe is post-hoc/ad-hoc normalization. One can only pray to Codd.
[UPDATE]
Found my way here, and left a comment, of course. At some point folks will understand that de-dupe is post-hoc/ad-hoc normalization. One can only pray to Codd.
03 May 2012
A Chain's Weakest Link
LinkedIn is a fair haired child these days. Some view it as the up and comer. It reports later today after Mr. Market goes home, after a stop at the Bull and Bear Pub, of course. I'm not so sure.
I've been Linked In for some months, perhaps a year. In the last couple of months, most of the groups have become spam jars. Today's latest and greatest affront: an ad for tattoos on Database Developers and Architects. Beware the slippery slope.
I've been Linked In for some months, perhaps a year. In the last couple of months, most of the groups have become spam jars. Today's latest and greatest affront: an ad for tattoos on Database Developers and Architects. Beware the slippery slope.
01 May 2012
Watch More TeeVee
Reminder: Part 3/4 of "Money, Power, and Wall Street" is tonight at 9 PM, most places. Set your teevee timer.
Subscribe to:
Posts (Atom)