A customer asked me to build a Transaction Log based Change Data Capture method for SAP Hana. This is a good reason to talk about how it works, what difficult questions to ask during a presentation and talk about a couple of misconceptions, e.g. the claim that it is the most efficient method always.
Every database works similar. A DML statement like insert into… is executed by a database user and the effects of this statement are applied to the database. But instead of updating the database file right away, these change records are persisted into a file containing a continuous stream if all changes and a batch of these changes update the database file later. This two-stage-write is done for performance reasons. Appending changes to a log file is much quicker than merging the changes with the actual database file for every single record. With Hana this is most obvious: Writing a single record into a database file would mean to decompress the entire table partition – all columns!, all rows! – adding the new row and compressing the entire partition again. A delta merge.
This log file is called the transaction log. In Oracle it is also called the redo log file, a more generic term is the Write Ahead Log (WAL).
SAP Hana does use the transaction log for these use case:
As Hana does not support an API to the transaction log, it misses out an other features:
So let’s build a software reading the Hana transaction log file and produce the changes. The parts to implement are:
The parsing is obvious. Any CDC program must be able to understand the log file contents and produce a useful output. Either by reading the file directly byte by byte and making sense of the content or by using an API the database provides. Any yes, it truly is a binary file where even single bits impact how to read the next information.
Oracle has the LogMiner API, Hana has no API. That is the reason why Fivetran/HVR came up with a product of their own and the reason SAP does not provide a certification for transaction log readers. No API, nothing to certify against.
The second point can be tricky. To visualize, lets assume the user executed an insert statement and 09:00 and saved (committed) the change in the database at 11:00. Only at 11:00 the changes are visible to others in the database because they are confirmed to be permanent now. The user could have executed a rollback of the changes at any time before the commit. Hence only at 11:00 the CDC solution should propagate the change information.
This uncommitted data must be stored, for as long as the transaction is open in the database and this can be hours or days. No assumption can be made and pausing the log read until all previous transactions are closed, is not an option either.
The biggest problem is the data stored in the transaction log. It contains the bare minimum for performance reasons. For an insert that is fine, an update table1 set name=’Sue’ where pk=1; writes the new value and the row pointer only. None of the either columns, not even the primary key can be found in the transaction log. While this is okay for a system replication, maybe even for a database replication, replying the update in the target is also expensive. For all other use cases, batch CDC, realtime streaming to various targets, all column values must be sent, maybe even the before and after values.
A delete contains just the row pointer of the row to be deleted in the transaction log.
Oracle provides a feature called supplemental logging. The user can tell what columns should be added in the transaction log, e.g. the primary key columns. For a CDC solution all columns are needed but adding all would slow down the database operations for the users significantly. Hence Oracle Goldengate is provided as a better option for CDC. In Hana no supplemental logging feature exists.
The requirement is to create a record that contains all column values, not just the changed ones to make it easier for the consumers. As the transaction log contains the bare minimum information about a change only, the rowid and and changed column values, the other data must be read from somewhere. Reading that data from the database would have side effects, as the database provides only the current committed version of the record and not the version matching the transaction log entry.
If we are not too strict and accept incorrect combinations, it is still a possibility but likely this is not a good idea. Example: A user updated an existing record at 09:00 modifying the first name and at 09:05 the address fields. If the log reader is fast enough and the change record is found immediately after the commit, fetching the then current record gives the correct version. But if the log reader finds the first update at 09:07, it would produce a change record with the new name and the new address as well and the second update would look as if there was an update for the address fields where nothing changed. And this assumes we know the primary key from the transaction log which we don’t. We know the rowid only and if we read the record with the first change, its rowid does no longer exist after 09:05! That would be sort of an eventual consistency model which likely is not acceptable.
Hence the only option is to remember the records externally. If an insert is found, it is stored with the corresponding rowid. If an update is found, the record is copied, the changed values and the new rowid updated and it is appended as well. Occasionally the outdated rowids are deleted to save space. When an existing table is added to the CDC process, its current data with the rowids must be initial-loaded first.
The only other option is to write all column values into the transaction log. This is possible with Oracle supplemental logging but the consequence is a massive slow down of all insert/update/delete operations in the database. Unfortunately many CDC products using Oracle require exactly that and suddenly a CDC solution is no longer the most effective method. I would strongly advise against using supplemental logging on all columns and hence CDC software requiring that! Unfortunately, most installations of Debezium reading from Oracle are using LogMiner and hence supplemental logging on all columns of all CDC enabled tables must be enabled.
A particularly nasty case are uncommitted transactions in the source. This happens frequently and is normal for databases. A user makes an update in a screen and will save it in a few hours. Quite normal and the log reader must be able to deal with it.
A first approach would be to remember all open transactions in memory. But a single update orders set archived=’Y’ where year < 2023; would execute as a single transaction with billions of records and thus exceed any memory available.
Another option would be to read the commit events from the transaction log and then read it again to find all the data belonging to this commit. For small transactions and with some caching this is doable, but the billion row update would require to read the entire log twice. Worse, as the transaction log is not indexed, the entire time range the transaction was open must be read again.
So the only feasible solution is to read the log immediately and produce the changes on commit time, remembering the read data temporarily.
A similar problem to long running transactions is at CDC restart. The log reader starts reading from the last successfully processed position and will find a commit for a transaction with changes in the past. In this example the log reader starts in the middle of transaction 2.
It must get all missing changes of this transaction from somewhere. Hence yes, all rows of uncommitted transactions must be persisted and removed later.
That is also the reason old versions of records should be kept. In the full-width-record paragraph was the statement that “changed values and the new rowid [are] appended” to an external persistence. Why append and not replace? Because of error handling and recovery. When the log reader re-reads data from the log that has been partially processed, it must be guaranteed all rowids with their column values can be found still. And if we allow that, we can also allow the user to rewind and reprocess from an earlier start date, e.g. if the target had a problem.
One point that has not been discussed yet is how long the transaction log files are kept by the database. And the answer is, a very short time! Transaction log files have a fixed size and when full the next file is used. During that time the database merges the data from the old transaction file into the database files and once completed this old transaction file can get overwritten.
In other words, the transaction log files are a cyclic buffer of usually three files. Which begs the question, what the log reader should do if it falls behind reading the data.
Some CDC mechanisms in combination with certain databases, SQL Server and SQL Service CDC API or Sybase replication server have the means to tell the database which transaction log file is needed still. Then the database is not allowed to overwrite that transaction log file and block all incoming changes until it is free. A dangerous approach.
But databases have the same requirement to read older transaction log files. The example of a database restore and then reapplying all transaction log files since then to bring the database to the most current state. So what database do is archiving the transaction logs to another location. A simple file copy mostly.
For the transaction log reader this means it should read the online transaction logs for the most current data and the archive log files for older data. Transitioning between the two states is a bit tricky to implement but can be done.
Hana and Oracle both use archive logs and in case of Hana, an archive log is a transaction log with an additional 4k of data at the beginning. The archive log name also tells block ids it contains, so it is very easy to find the first archive log file needed. And the system view M_LOG_SEGMENTS shows the state of all transaction log files.
SELECT VOLUME_ID, PARTITION_ID, SEGMENT_ID, FILE_NAME, FILE_OFFSET,
STATE, USED_SIZE, LAST_COMMIT_TIME, MIN_POSITION, MAX_POSITION
FROM SYS.M_LOG_SEGMENTS
WHERE VOLUME_ID = 2;
The MIN/MAX_POSITION is the page id which gets used in the archive log file name. For example, one archive log file is found at
/hana/shared/HXE/HDB90/backup/log/DB_HXE/log_backup_2_0_767588992_767595712.1628258469399
because the second row in above screenshot says the online log is free, meaning it is not currently used and the archive log copy did already finish. The name also contains the volume id and the timestamp in Unix epoch format.
The Hana transaction log is organized in 4k pages. It starts with an header containing useful information like the page id, timestamp of the change and length details to parse the block properly.
The payload of a page are one to many blocks. A block has again an header and the first important information is the block type and the detailed operation. These can be insert, update, delete, upsert, truncate, dictionary changes, save point, rollback, commit, … any kind of change happening to the database.
If the block type is about a table, e.g. an update of table, the next data is the container (= a table partition) and the number of rows impacted. The latter is important because there can be different kind of statements which are all encoded as dense as possible.
The difference is the number of rows and the number of distinct values. In the first case one row is updated with one value. In the second case 100 rows are updated with a single value. And in the third case 100 rows are updated with 100 values.
The payload of the block aligns with that. In the first case it contains the before and after rowid, the information which columns are contained together with how the values are encoded and the value itself.
In the third case 100 before rowids are stored, a single after rowid, the column information and 100 values. The after rowids start with the single after rowid and end with after rowid + 99. Saves again 99 times 8 bytes in the log ;-).
Example: This binary data has the equivalent SQL statement of
0x0190: 81 00 01 01 01 CA 00 00 00 00 C1 20 01 01 01 00
0x01A0: 09 00 00 00 00 00 00 00 53 21 20 04 00 00 00 00
0x01B0: 00 00 00 00 08 48 65 6C 6C 6F 31 2E 32 7F 01 00
H e l l o 1 . 2
update USER1.TEST1 set
COL1 = ‘Hello1.2’,
$rowid$ = 384
where $rowid$ = 383;
This is obviously just a fraction of all the knowledge needed to decode the transaction log but it shows the general principle.
Further more, database files can be encrypted, the Hana database running in a scale out mode.
rtdi.io GmbH
Tallach 150
9182 St. Jakob im Rosental
Austria
UID ATU74541169
Contact
[email protected]