Enabling PostgreSQL Data Checksums on an Existing CNPG Cluster

When I initially set up my CNPG (Cloud Native PostgreSQL) cluster, I overlooked the option to enable data checksums. Data checksums ensure data integrity in PostgreSQL by allowing the system to detect corruption in data pages. When enabled, a checksum is calculated for each data page and stored alongside it. Upon reading, the checksum is recalculated and compared to the stored value, ensuring corruption can be detected early. This guide will help you enable data checksums on an existing CNPG cluster.

If you plan ahead, you can enable data checksums during the cluster initialization by adding the following to your cluster spec:

spec:
  bootstrap:
    initdb:
      dataChecksums: true

For clusters where this step was missed, PostgreSQL provides the pg_checksums tool to enable or disable checksums after initialization. However, this is an offline operation, meaning the database must be shut down to complete the process. Given CNPG’s focus on high availability, we need to take each instance offline individually to perform this operation.

Here’s a step-by-step guide to enabling data checksums on an existing CNPG cluster:

Check if Checksums are Already Enabled

Run the following SQL command to get the status:
```
SHOW data_checksums;
```
If the output is off, you need to proceed with enabling the checksums.
Backup Your Database

Before performing any manual operations on your database, ensure you have a current backup. This step is crucial to prevent data loss in case something goes wrong.
Install the CNPG Plugin

Ensure you have the CNPG plugin installed. This plugin helps manage CNPG clusters and perform various operations more efficiently. Refer to the official documentation for installation instructions.
Fence the Instance

To safely take a instance offline, we need to fence it. Fencing ensures the instance is isolated and no new connections are made. Use the following command:
```
kubectl cnpg fencing on --namespace <cluster-namespace> <cluster-name> <cluster-instance>
```
However this did not work for me, so I used the following annotation on the cluster.
```
apiVersion: postgresql.cnpg.io/v1
kind: Cluster
metadata:
  annotations:
    cnpg.io/fencedInstances: "[\"<cluster-name>-<cluster-instance>\"]"
```
Enable Data Checksums

Once the instance is fenced, exec into the pod and run the pg_checksums command to enable checksums. This command must be run while the database is offline:
```
pg_checksums --enable --progress --verbose
```
After running the pg_checksums command, the database will scan all data blocks, and calculate and store checksums. This process can be time-consuming depending on the size of your database.
```
72497/72497 MB (100%) computed
Checksum operation completed
Files scanned:   9675
Blocks scanned:  9279637
Files written:  0
Blocks written: 0
pg_checksums: syncing data directory
pg_checksums: updating control file
Data checksum version: 1
Checksums enabled in cluster
```
Unfence the Instance

Now that data checksums are enabled for this instance, you can remove the fence by either removing the annotation or running the following command:
```
kubectl cnpg fencing off --namespace <cluster-namespace> <cluster-name> <cluster-instance>
```
Repeat for the Remaining Instances

Once the instance has been unfenced, wait for replication to catch up. You can check the replication status with:
```
kubectl cnpg status --namespace <cluster-namespace> <cluster-name>
```
Repeat this process for the remaining instances until they all have data checksums enabled.

Refer to the PostgreSQL documentation on data checksums and the pg_checksums tool for more detailed information on checksums and the enabling process.