31 May 2018

A Hop, A skip, and A Jump

Free at last, Free at last, Thank God almighty we are free at last.
-- Martin Luther King

OK. So what does such a quote have to do with a site, titularly at least, devoted to the RM, stats, and data generally? Well, this new report on Intel's Optane DIMM.

[On a side note, the earlier piece on Optane SSD has a comment from the author to the point that the byte addressability of Optane is of no value for files. That's true for fixed sector non-mainframe HDD, and later SSD, but not true for CKD DASD:
IBM CKD subsystems initially operated synchronously with the system channel and can process information in the gaps between the various fields, thereby achieving higher performance by avoiding the redundant transfer of information to the host. Both synchronous and asynchronous operations are supported on later subsystems.
]

Current SQL engines do their work with a hop, skip, and a jump. Optane, with proper motherboard, chipsets, and (heavily?) re-written applications can go from cpu to datastore. We could, then, do what Dr. Codd said: "all at once". Yum.

The current paradigm is from cpu registers to on-chip cache to memory to SSD/HDD; more or less a hop, a skip, and a jump. Now, for SQL databases (or any serious datastore), what generally happens is that the engine writes to the trans log (more or less synchronously), which is at some later time flushed to the "permanent" datastore on SSD/HDD. A system/application crash is supposed to only lose open transactions; all commited trans exist in durable storage either in the trans log or named tables. Wouldn't it be cool if transactions commit to durable storage immediately?

As the piece says:
Intel has been laying the groundwork for application-level persistent memory support for years through their open-source Persistent Memory Development Kit (PMDK) project, known until recently as NVM Library. This project implements the NIA NVM Programming Model, an industry standard for the abstract interface between applications and operating systems that provide access to persistent memory. The PMDK project currently includes libraries to support several usage models, such as a transactional object store or log storage. These libraries build on top of existing DAX capabilities in Windows and Linux for direct memory-mapped access to files residing on persistent memory devices.

Application re-write may be non-trivial:
Intel will be offering remote access to systems equipped with Optane DC Persistent Memory so that developers can prepare their software to make full use of the new memory. Intel is currently taking applications for access to this program. The preview systems will feature 192GB of DRAM and 1TB of Optane Persistent Memory, plus SATA and NVMe SSDs. The preview program will run from June through August. Participants will be required to keep their findings secret until Intel gives permission for publication.

But for the RM world, Optane offers the ability to do what Dr. Codd said, "all at once".

One might ask, in a semantic sense, how a SQL engine on such a machine might do its work. Let's chew on that.

First, what about the divide between trans log and named tables? Do we still need a trans log? The blockchain example, which is a database-less trans log, suggests we do. For audit purposes, it is needed, too. Does the trans log still need be the bottleneck to the named tables? Not with such Optane storage. The engine would just do immediate writes to the log and the table(s); that's the duration of the transaction.

Second, do we still need buffers? May be, but may be not. The purpose of buffers is to mediate between fast memory and slow disk, now SSD, but still slower. According to the report, 512GB DIMMs will be available. Current boards go to eight slots, which works out to 4TB. How many applications need more than that? Google and Amazon and Facebook. Commercial applications? Not so much. Taking into account the data reduction side-effect of 3 or 4 or 5 normal form, may be most would be happy with that much storage.

Third, typical applications don't keep all data in hot storage even now, so 4TB looks to cover many, if not most, use cases.

Fourth, some DIMM would be normal DRAM to hold OS and application code and code's invariant data; this appears to be supported according to the report and comment stream. It would certainly make sense to do this.

So, how would a SQL engine work with such a machine? The engine and its data would reside in DRAM, while the tables and (active) log in Optane. By active log, we mean that the engine would flush completed transactions to SSD as needed; in reality, only long running transactions (you don't write such monsters, do you?) would be in the active log. There would be no need for memory buffers, but the notion of locks needs to be considered. Current practice whether locking or MVCC relies on memory buffers to "improve" performance. Simple, single row, update would be written directly to table and log. Normal memory locking would suffice. Multi-row/table update? Umm. In the MVCC case, it depends on multiple images of rows to segregate transactions, but would there still be any point to MVCC semantics with Optane storage? For now, I'd say not. Since update (could/should) happens "all at once", collision doesn't occur; each client sees the current state of each row all the time, modulo duration of actual write. Isolation amounts to memory locks. Cool.

Years ago, diligent reader will remember, I had occasion to work with TI-990 machines. One ran a Ryan McFarland COBOL application on a TI machine while the other ran the chip version on a custom board/OS/language (ah, those were the days). What was (nearly?) unique about the 990 architecture was that it was registerless, sort of. All code and data resided in memory, and there were three cpu registers, one of which was a pointer to the current application's base address in memory. A context switch required only resetting these registers. Each application retained its context in situ. I wonder whether TI will resurrect it?
The TI-990 had a unique concept that registers are stored in memory and are referred to through a hard register called the Workspace Pointer. The concept behind the workspace is that main memory was based on the new semiconductor RAM chips that TI had developed and ran at the same speed as the CPU. This meant that it didn't matter if the "registers" were real registers in the CPU or represented in memory. When the Workspace Pointer is loaded with a memory address, that address is the origin of the "registers".

Whether Optane has much value-add for most applications, I don't know. But for heavy multi-user/single datastore applications such as SQL engines, yeah man!!

This, of course, is merely first-pass thought experiment. More to come.

No comments: