21 August 2020

One Classy Dame - part the sixth

There are two issues/problems, and one contra, with using PMEM/Optane to the fullest extent:
1 - is it net efficient to update-in-place from cpu cache lines to Optane, rather than through all the caches, buffers, and the usual memory hierarchy used with conventional HDD/SSD machines?
2 - is it possible to update-in-place with Optane and implement ACID conditions using only memory latches; aka no traditional locks?
3 - after all, RDBMS engines update-in-place their buffers, so why not PMEM?

The notion, mentioned here in terms of AS/400 (actually first released years earlier on S/38) single level storage (with database object, rather than file protocol) and TI-990 memory-to-memory execution, that RDBMS (or any transactional application for that matter) would be a major beneficiary of persistent memory if it could be written to while skipping past all those caches and buffers and non-key data that exist in the standard compute architecture. One might envision a new cpu architecture designed specifically to persistent memory, rather than volatile memory + durable storage. But that's well into the future, if ever. What can be done today with Optane and the X86 ISA? Turns out, some pretty neat stupid pet tricks.

To wit, found a paper from Intel they released last fall which takes a detailed look at what can be done. In a previous essay, One Classy Dame - part the second, the Oracle implmentation (of unknown numbers in the wild) was discussed.

The Performance section of this Intel paper has this caveat:
Due to persistent memory's low latencies—as compared against traditional HDDs and SSDs— and the lack of a queue of requests in the device (see Figure 1), only synchronous I/O is recommended. Doing I/O in an asynchronous fashion adds extra software overhead and, as we have already discussed above and will see in the performance numbers shown below, persistent memory shifts the I/O bottleneck from hardware to software. If your application can't wait for an I/O operation to complete even at low latencies, then asynchronous I/O makes sense. In that case, however, you may need to reevaluate if persistent memory is the right technology for your I/O needs as you will lose the media speed advantage.
[my emphasis]

In other words (mine): if you write small, fast transactions (in the pre-persistent memory world) then you'll likely get into warp speed with Optane and App Direct. On the other hand, if you write 'long lived' transactions, not so much. Moreover, in most multi-user/application databases, simultaneous users don't generally update the same rows and columns, modulo 'batch' transactions and even these are generally run in some kind of pseudo-single user mode. With more granular 'locking', the opportunities for transaction collision diminish.

That said, synchronous writes are not incompatible with ACID, of course. Reading up Weikum and Vossen, you'll see that transaction integrity is based on whether an update can be transformed into a serializable history. The notion, after all, is that these are set operations, and each appears to the user to be 'all at once' and time ordered amongst themselves. With persistent memory, hardware gets most of the way there. RDBMS have to figure a way around asynchronous writing with locks, latches, and buffer managers especially when running higher than row-level locking (or massive copying with MVCC). The specifics of how Oracle and SAP have used Optane are only sketchy.
Intel has devised adaptions for specific software such as SAP but a general application software adaptation method would be helpful. And this is where Formulus Black, a venture-backed in-memory computing vendor, comes in.

This paper from SAP provides a bit of detail.
For example, Intel and SAP engineers demonstrated how a server, equipped with DRAM only and the SAP HANA 2.0 SPS 03 platform with 6 terabytes of data, required 50 minutes to start, including data loading. By contrast, a server equipped with a combination of Intel Optane DC persistent memory and DRAM reduced the start time of the SAP HANA 2.0 SPS 03 platform by 12.5x to only 4 minutes.

Not the most useful metric, but rather concrete. Of more note
App Direct Mode is the only mode that is currently supported by SAP HANA 2.0 SPS 03. In App Direct Mode, Intel Optane DC persistent memory modules appear as byte-addressable memory resources that are controlled by SAP HANA 2.0 SPS 03. In this mode, the persistent memory space is controlled directly by SAP HANA.
...
Note that SAP HANA uses persistent memory for all data that resides in the column data store.
[my emphasis]

In other words, code is in DRAM while data is in Optane, and may be update-in-place.

Nor have I found any report laying out why direct writing from cpu cache lines to persistent memory can't be done. Just from the Oracle documents and the SAP report just above, that's what appears to be going on. It would seem that, from a transaction point of view, it is only necessary that there be a start-log-entry record, then the data, then an end-log-entry record which happens in the log (which is not some sequential write store, either) in persistent memory, too. That log record would have the engine's unique row identifier (never visible to the user, btw), a timestamp/transID, column identifiers, and new values for said columns. Then, for updates, latch the 'old' data which might just be the modified columns, change the column values in place to 'new' data, release the latch on the log record, and release the latch on the 'old', now 'new' data. For an Insert, once committed, simply adding a pointer to what was originally the 'log' record of the Insert should suffice. Log records for committed data are not needed. Now, what kind of logging is this (noting that logging, in the minds of many, still thinks in sequential terms, since that was the fastest write to HDD)? In this scenario, a log record can be anywhere in the persistent memory patch, carrying only the necessary flags to mark it as log record. Recall that RDBMS logging is solely for integrity, not application history, so once the 'new' data is committed/durable, the log image is of no use. Now, I'll suppose that there are myriad applications written on RDBMS that ignore the mantra of 'small transactions done really fast' and thus depend on large logs generated from 'long lived' transactions for crash recovery. Ain't no help for those dumb as a sack of hair.

Unlike common row locking (and MVCC), the whole row wouldn't be replicated. Moreover, non-key columns could be updated 'at will' by any writer. Even in these olden days, last writer wins. That would still be true, just faster; there being no locks to contend with, it's possible (I think, don't ask me for details) with persistent memory to latch only key/unique columns on a row. Now that would speed things up considerably.

Here is a very old report (aka, not bleeding edge ideas these days) on memory resident RDBMS, from IBM, of all places.
A latch operation typically involves far fewer instructions than a lock operation, as a latch's data structures are statically allocated and directly addressable. In fact, in R* [Williams 821, a distributed relational database system prototype developed at IBM research, a latch and unlatch operation used about 20 CISC (IBM 370) instructions, which was roughly an order of magnitude less than a lock and unlock operation [Yost 92, Gray 89, Lehman 891. Thus, given a latch's relatively short pathlength, one might dismiss the concern of latches imposing a significant overhead in the overall storage component pathlength.

So, one might expect that such dubious applications wouldn't implement App Direct use of Optane, simply use Optane SSD as a faster, well, SSD. That won't buy much, compared to re-engineering using Optane as persistent memory.

No comments: