Database replication tuning

QuestDB replication is tunable.

Replication can either:

  1. React to transaction events more quickly
    1. Lower latency with more network overhead
  2. Package more transactions together
    1. Higher latency with less network overhead

Before we dig into it, consider:

Overview‚Äč

In large part, tuning comes down to 3 key configuration options.

These options are set in your QuestDB server.conf:

PropertyDefault Value
cairo.wal.segment.rollover.size2097152 (2MiB)
replication.primary.throttle.window.duration10000 (10s)
replication.primary.sequencer.part.txn.count5000 (transactions)

Default settings‚Äč

By default, replication increases network utilization by around 9x compared to the baseline seen ingesting via the InfluxDB Line Protocol. This default optimizes for lower latency and higher activity, where responsiveness is most desirable.

cairo.wal.segment.rollover.size=2097152
replication.primary.throttle.window.duration=10000
replication.primary.sequencer.part.txn.count=5000
Network traffic with default settings
Network traffic with default settings

We'll go through an example tuning scenario and explain each option.

The following represents 12 hours of replication activity.

To demonstrate, we will trace network usage for a single table receiving one row with 31 columns per second.

Breakdown of key settings‚Äč

Tuning is a balance between 3 main options.

What do these options do?

cairo.wal.segment.rollover.size‚Äč

In QuestDB each table replicates in a cycle independently of other tables. Thus each of the database's network client connections writes its data to its own segment in the Write-Ahead Log (WAL).

Once a WAL segment reaches a certain size, it's closed and a new one is opened. The smaller the size, the more often segments roll over. This results in less overall network overhead.

PropertyDefault ValueUpdated Value
cairo.wal.segment.rollover.size2097152 (2MiB)262144 (256KiB)

However, small files may be inefficient or expensive to store in your object storage. For example, AWS S3 Intelligent Tiering, requires that a file must be over 128KiB.

Note that if you are replicating with a file system such as NFS, larger files will give better throughput for point-in-time recovery. As such, including 8x compresison, the default 2097152 (2MiB) is appropriate for many cases. But you can go smaller too, depending on your object store setup.

replication.primary.throttle.window.duration‚Äč

Next, replication transfers WAL segment files to the object store. If a segment is allowed to close first then it will only be uploaded once, otherwise it will be re-uploaded multiple times, each time with more data. This causes write amplification.

To reduce it, set to a longer throttle value:

PropertyDefault ValueUpdated Value
replication.primary.throttle.window.duration10000 (10s)60000 (60s)

This value represents your maximum replication latency tolerance. The default is 10 seconds (10000), but could be 60 seconds (60000), five minutes (300000) or whatever best suits your operational requirements.

Note that even with a longer value, QuestDB still actively manages replication to prevent a backlog. Therefore, increasing this duration will not result in a pile-up of data needing replication, even when there's a burst of activity.

If minimizing network traffic is a priority for your production environment, opting for a longer duration reduces the frequency of replication operations, providing better network efficiency minor replication delay.

replication.primary.sequencer.part.txn.count‚Äč

Data ingested into QuestDB is written to multiple WAL segments, one open segment per connection. However, for a given table transactions themselves are recorded in a central sequencer transaction log. The number of stored transactions is flexible and are packaged together as a "part".

We recommend that these "parts" remain as small as is reasonable for your object store, since the relevant parts are uploaded (or reuploaded) each and every replication cycle. By default, each part holds 5000 "txn" records.

PropertyDefault ValueUpdated Value
replication.primary.sequencer.part.txn.count5000 (transactions)1000 (transactions)

Since each record is 28 bytes, once compressed - usually at ~6x - we expect parts to be around ~23KiB each. Customize the part size as is appropriate for your case, but note that you can only change this value when initially enabling replication.

note

The setting is fixed for a given table once set.

Impact of compression‚Äč

Compression is Key! Although it can lead to fuzzy calculations if we're not clear on detail.

The tables below help inform calculations appropriate for your object store.

For our default scenario:

WAL PartDefault ValueExpected CompressionEstimate Size in Object Store
WAL Segment2 MiB8x~256 KiB
Sequencer Part5000 txns (*28 = 136 KiB)6x~22.7 KiB

For our compressed scenario:

WAL PartUpdated ValueExpected CompressionEstimate Size in Object Store
WAL Segment256 KiB8x~32 KiB
Sequencer Part1000 txns (*28 = 136 KiB)6x~4.5 KiB

Summary‚Äč

In summary, to tune for network efficiency:

  • Make the segments and sequencer parts as small as your specific object store can handle
  • Make the replication window duration as long as you can tolerate for your requirements

For example, reducing the segment size to 256KiB, the sequencer parts to 1000 records, and increasing the replication window duration to one minute, the network traffic reduces to 57% of the ILP ingestion traffic.

All together, the tuned, network effective settings are:

cairo.wal.segment.rollover.size=262144
replication.primary.throttle.window.duration=60000
replication.primary.sequencer.part.txn.count=1000
Network traffic with network efficiency settings
Network traffic with network efficiency settings