Exadata X8M uses Xeon SP CPUs, Optane DIMMs and RoCE (Remote Direct Memory Access across Converged Ethernet) over 100GbitE. RoCE enables Oracle's database to directly access persistent memory, thus bypassing the OS, network, and IO software stack.
RDBMS which do file management under the OS are not new, so Oracle's not breaking ground with that part. It also means they don't have to wait for linux to directly support such; in the sense that SCM isn't just another file.
There remains conflict on how to use SCM: either as a direct data (row in RDBMS terms) store or as an 'intermediate' filesystem surrogate such as this paper.
... replacing hard drives with SCMs often forces either major changes in file systems or suboptimal performance, because the current block-based interface does not deliver enough information to the device to allow it to optimize data management for specific device characteristics such as the out-of-place update.
As you might expect, I'll vote for something called 'object' store, or 'persistent buffer store', etc. The thrust being to eliminate all that 'impedance mismatch' that some coding folks like to throw out at RM advocates. All industrial strength RDBMS know how to do transactions within their buffers; some do the subsequent I/O to disk management themselves while others use the OS's filesystem to handle that. But both are doing the translation from 'row object' to 'file'. Why? Of course, because most (AS/400 et seq. possibly excepted) OS store data as files. It's also worth noting that the original 360, and successors, were not file oriented in the sense of *nix and successors. They're based on CKD protocol, which, if you twist your neck just right, can be viewed as a row store.
It is a self-defining format with each data record represented by a Count Area that identifies the record and provides the number of bytes in an optional Key Area and an optional Data Area. This is in contrast to devices using fixed sector size or a separate format track.
There is at least one book addressing the question. Here's a snip from the Amazon page:
Existing DBMSs are unable to take full advantage of this technology because their internal architectures are predicated on the assumption that memory is volatile. With NVM, many of the components of legacy DBMSs are unnecessary and will degrade the performance of data-intensive applications.
From the authors' earlier paper they hit the nail on the head (and confirm the notion that led my fingers to go walking through the Yellow Googles in the first place):
Consider a transaction that inserts a tuple into a table. A DBMS first records the tuple's contents in the log, and it later propagates the change to the database. With NVM, a DBMS can employ a logging protocol that avoids this unnecessary data duplication. The reason why NVM enables a better logging protocol than WAL is two-fold. The write throughput of NVM is more than an order of magnitude higher than that of an SSD or HDD. Further, the gap between sequential and random write throughput of NVM is smaller than that in SSD and HDD. Hence, a DBMS can flush changes directly to the database in NVM during regular transaction processing [15, 14, 12, 64, 40, 62, 80].
[links active in the cite]
The crux of the matter: the community has based the notion of transaction of 'slow' disk drives and 'fast' memory, with the transaction happening in memory, but only durable when flushed to disk. This boundary layer impacts to such an extent that many/most/all industrial strength RDBMS have offered the choice to do all the I/O under control of the engine, ignoring the OS facility. In the *nix world this is referred to as 'raw I/O'.
Here's another take on the meaning/purpose of SCM
This is good, of course, but it got me wondering whether it followed from requirement A — no rewrite of applications — that the solution B automatically follows — that the likes of Optane persistent memory must reside in I/O space. After all, such persistent memory was created to be directly attached to the processor chips, and be byte addressable just like RAM. Think paradigm busting, outrageously fast commits of data to persistent storage. Said differently, can a processor complex be created with directly attached persistent memory and where the typical use of that system does not require changes to the applications?
And, surprise surprise, this author remembers AS/400!
Another key — and here very applicable — concept basic to the IBM i operating system is that single-level storage. Even decades ago with the System/38, SLS meant that when your application used a secure token as an address to access data, it did not matter whether that data then was first found on disk or in RAM. Even after a system restart — say occurring due to a power failure — you restarted using exactly the same address token the address the same data.
Finally, this author, likely not by intention, stabs MVCC in the gut (fine by me)
See the difference, along with the impact on throughput as a result? The locks, required in any case since time is passing, are held for a minimum of time. The probability of any subsequent transactions seeing these locks decreases significantly. Subsequent transactions don't as often need to wait, and when they do their wait time is far less. In our train metaphor used earlier, a train doesn't even get built anywhere nearly as often. Life is good.
Die MVCC, DIE!!!
No comments:
Post a Comment