Database management system engine in Rust: storage overview

stencillogic stencillogic
4 min readNov 11, 2021

In this article storage organization of DBMS project (https://github.com/stencillogic/db-core) will be reviewed.
To remind, the project represents schema-less key-value store with ACID properties. It can be considered as a back-end of a relational DBMS with SQL support, non-relational, e.g. document-based, graph, or any other kind of DBMS. More about it in the previous article: https://stencillogic.medium.com/dbms-engine-in-rust-696ce79d18df
There are three types of stores in use:
1. Data store
2. Versioning store
3. Checkpoint store

Data store

Data store serves as the main storage of objects (records, documents, nodes, etc.). Data is stored in blocks of fixed size. Block is the smallest portion of data that can be stored to or loaded from disk. Blocks are organized in extents. Extent is a sequence of blocks and a unit of extension of a store.
In the current implementation all stores represented by files in the file system.
Sample file structure:

File header is an extent with number 0. Blocks in the file header contain file-level metadata and bitmap of free/full extents. Each bit corresponds to an extent in the file. Bitmap is used to search for the place in the file where data can be written. If bit is set then corresponding extent is considered full.
Each extent has header as well. The extent header consists of 1 or more blocks and contains extent-level metadata and bitmap of free/used blocks. Each bit in the bitmap corresponds to a block in the extent. If all bits in the bitmap are set to 1 then extent is considered full, and corresponding bit for the extent is set to 1 in the file header block.
After header blocks in extent follow data blocks. Each data block has a fixed-size header and directory of entries. Entries have variable size and contain data itself. The number of entries in the block can vary as well. New entries can be added, and some existing entries can be deleted with time. Entries are added from the bottom of the block while the header and directory of entries are kept at the top.
If block becomes full then corresponding bit is set in the bitmap of extent header. Contrary, when some space becomes available in the block the bit is set to 0.
When data is being written the system finds a free block, allocates entry there, and writes data into the entry. If entry size is too small to fit all the data the entry can be extended if there is enough of free space in the block for it. If there is no free space available then a new entry in another block is allocated, and previous entry will have a pointer to the newly allocated entry. This way data store allows storing objects of arbitrary size as linked list of entries.

Checkpoint store

Checkpoint store serves as a snapshot of data as of certain moment in time (checkpoint mark). It is used to restore database state after failure or restart.
Before any data block in the main data store can be modified its original state (the state as of checkpoint mark) must be saved in the checkpoint store.
When system starts, it copies all blocks from the checkpoint store to the main data store, then finds the latest checkpoint mark in the transaction log and applies changes starting from that mark.
Checkpoint requires time to complete. Only completed checkpoint can be used for restore. As a result, checkpoint store is used a little differently than the main store. The bitmaps of free/used extents and blocks are not used. Instead, blocks in the checkpoint store are added sequentially one by one.
Since checkpoint can be in uncompleted state the previous checkpoint’s data must be kept until checkpoint completion. To separate uncompleted checkpoint from completed the latter uses odd extent numbers while in-progress checkpoint uses extents with even numbers. During the next checkpoint roles change.

Versioning store

Versioning store serves as MVCC (https://en.wikipedia.org/wiki/Multiversion_concurrency_control). Each entry before modification is first copied to the version store. The versioned entry is used when transaction rolls back or when other transactions need to read corresponding version of the entry. E.g. when we use read committed isolation level we need not to read uncommitted changes which are present in the main store, we will use committed version of data stored in the version store.
Each entry in the main store has a pointer to an entry in the version store. Transaction can use it to find required version of the entry.
Each entry in the version store has a pointer to an entry in the main data store which is used during rollback as location for restore.
Also, entry in version store has pointer to previously created by the current transaction version entry. In other words, all versions created by the transaction are in linked list, which is navigated when transaction rolls back.
Version store represents circular list of extents. With time, oldest and already not used extents are removed from the list, and after can be reused by adding at the head of the list to accommodate new versioning entries.
Blocks from the version store are not checkpointed. Version store is discarded after system restart.

To sum up

There are three types of storage which serve different purposes. Checkpoint store allows restoring state of the database as of certain point in time and is used during system startup. Versioning store serves for transaction isolation and rollback. And the main store serves as a structured storage for the data.

--

--