PostgreSQL to Kafka Data Replication

NineData Data Replication supports replicating PostgreSQL data to Kafka topics.

Overview

Use this page to create a PostgreSQL to Kafka replication task and understand the JSON payload written to each Kafka topic.

Before you begin

Add the source PostgreSQL data source and target Kafka data source to NineData. For instructions, see Add data sources.
The target system is Kafka 0.10 or later.
Make sure the source account has the following permissions:
Replication type
Permissions
Full Replication CONNECT, SELECT
Incremental Replication SUPERUSER
For incremental replication, open postgresql.conf and configure the following parameters. If you do not know the file location, run SHOW config_file; in the psql client.
- The wal_level parameter of the source data source must be logical. To confirm the current value, run SHOW wal_level in the client.
- Set wal_sender_timeout to 0. This parameter interrupts replication connections that stay idle for longer than the configured number of milliseconds. The default value is 60000 milliseconds. Setting it to 0 disables the timeout. To confirm the current value, run SHOW wal_sender_timeout in the client.
- Set max_replication_slots to a value greater than 1. This parameter specifies the maximum number of replication slots that the server supports. The default value is 10.
- Set max_wal_senders to a value greater than 1. This parameter specifies the maximum number of concurrent WAL sender connections. The default value is 10. To confirm the current value, run SHOW max_wal_senders in the client.

Replication type	Permissions
Full Replication	CONNECT, SELECT
Incremental Replication	SUPERUSER

Restrictions

Before you run a replication task, assess the performance of the source and target data sources. Run full replication during off-peak hours when possible because the initial load consumes read and write resources.
Each replicated table must have a primary key or a unique constraint, and column names must be unique. Otherwise, duplicate rows may be replicated.

Procedure

Sign in to the NineData Console.
In the left navigation pane, click Replication > Data Replication.
On the Replication page, click Create Replication.

On the Source & Target tab, configure the fields in the table, and click Next.

Parameter	Description
Name	Enter a name for the data synchronization task. To make the task easier to find and manage later, use a meaningful name. Up to 64 characters are supported.
Source	The data source that contains the objects to synchronize.
Datahub Project	Select the target Datahub Project. Data from the source data source will be written to the specified Project.
Kafka Topic	Select the target Kafka Topic. Data from the source data source will be written to the specified Topic.
Delivery Partition	When delivering data to a Topic, specify the partition to which the data is delivered. Deliver All to Partition 0: Deliver all data to the default partition 0. Deliver to different partition by the Hash value of [databaseName + tableName]: Hash data across different partitions. NineData uses the hash value of the database name and table name to calculate the target partition, ensuring that data from the same table is delivered to the same partition during hash delivery.
Type	Select the replication type. Full: Synchronize all objects and data from the source data source, namely full data replication. The switch on the right enables periodic full replication. For more information, see Periodic Full Replication. Incremental: After full synchronization completes, perform incremental synchronization based on the logs of the source data source.
Incremental Started	Required only when Type is Incremental. From Started: Use the current replication task start time as the baseline for incremental replication. Customized Time: Select the point in time from which incremental replication starts. Select a time zone based on the region of your business. If the configured time point is earlier than the current replication task start time and DDL operations occurred during that period, the replication task will fail.
Target Table Exists Data (Required when Full is selected)	Pre-Check Error and Stop Task: Stop the task when data is detected in the target table during the precheck stage. Ignore existing target data and append to it.: When data is detected in the target table during the precheck stage, ignore that data and append other data. Clear target existing data before write: When data is detected in the target table during the precheck stage, delete that data and write it again.
Incremental data conflict handling strategy for target table (Required when Incremental is selected)	Runtime error: During incremental replication, report an error when target data already exists and wait for manual intervention. Do not update target data: During incremental replication, do not write data when target data already exists, and continue subsequent tasks. Update target data: During incremental replication, overwrite the target data when target data already exists.

On the Objects tab, configure the parameters in the table, and click Next.

Parameter	Description

To create multiple replication tasks with the same replication objects, import a configuration file. Click Import Config, click Download Template to download the template, edit the file, and then click Upload to upload it and import the objects in bulk. The configuration file uses these fields:

Parameter	Description
`source_table_name`	The source table name of the object to synchronize.
`destination_table_name`	The target table name that receives the synchronized object.
`source_schema_name`	The source schema name of the object to synchronize.
`destination_schema_name`	The target schema name that receives the synchronized object.
`source_database_name`	The source database name of the object to synchronize.
`target_database_name`	The target database name that receives the synchronized object.
`column_list`	The list of columns to synchronize.
`extra_configuration`	Additional configuration information. This field supports: `column_rules`: Defines column mappings and value rules. Field descriptions: `column_name`: Original column name. `destination_column_name`: Specifies the target column name. `column_value`: Specifies the column value, which can be an SQL function or a constant value. `filter_condition`: Specifies row-level data filtering conditions. Only rows that meet the conditions are replicated.

tip

Example of extra_configuration:

{
  "extra_config":{
    "column_rules":[
      {
         "column_name": "created_time",
         "destination_column_name": "migrated_time",
         "column_value": "current_timestamp()"
      }
    ],
     "filter_condition": "id != 0"
  }
}

In this example, created_time is mapped to migrated_time, the target column value is changed to current_timestamp(), and only rows whose id value is not 0 are synchronized.

For a complete example of the configuration file, see the downloaded template.

On the "Mapping" tab, configure each column to replicate to Kafka. By default, all columns of the selected table are replicated. If source or target metadata changes while you configure mappings, click Refresh Metadata to refresh the metadata. After completing the configuration, click Save and Pre-Check.

On the Pre-check tab, wait for NineData to complete the precheck. After the precheck passes, click Launch.
- Select Enable data consistency comparison to start a data consistency comparison task based on the source data source after synchronization completes. Based on the selected Type, Enable data consistency comparison starts at these times:
  - Full: Starts after full replication completes.
  - Full+Incremental, Incremental: Starts when incremental data is consistent with the source data source for the first time and Delay is 0 seconds. Click View Details to view synchronization delay on the Details page.
- If the precheck fails, click Details in the Actions column for the failed check item, review the cause, fix the issue, and then click Check Again to run the precheck again until it passes.
- Items with Warning in Result can be fixed or ignored if required.
On the Launch page, the Launch Successfully message appears, indicating that the synchronization task has started. Then perform these actions:
- Click View Details to view the execution status of each stage of the synchronization task.
- Click Back to list to return to the Replication task list page.

Result

Sign in to the NineData Console.
In the navigation menu, select Replication > Data Replication.

On the Replication page, click the ID of the target synchronization task to open the Details page. The task details page shows the following information.

result_kafka

No.	Feature	Description
1	Synchronization Delay	The synchronization delay between the source and target data sources. `0` seconds means Kafka has caught up with the source data source.
2	Configure Alerts	When the task fails, NineData notifies the selected channel through the configured alert. For more information, see Operational Monitoring Overview.
3	More	Pause: Pause the task, only tasks with the status Running are selectable. Terminate: End tasks that are incomplete or listening (i.e., in incremental synchronization), after terminating the task, it cannot be restarted, proceed with caution. Delete: Delete the task, the task cannot be recovered after deletion, proceed with caution.
4	Full Replication (Displayed in scenarios including full replication)	Displays the progress and detailed information of full replication. Select Monitor to view various monitoring metrics during the full replication process. During the full replication process, select Flow Control Settings on the monitoring page to limit the rate of writing to the target data source per second. The unit is MB/S. Select Log to view the execution logs of full replication. Select the icon to view the latest information.
5	Incremental Replication (Displayed in scenarios including incremental replication)	Displays various monitoring metrics for incremental replication. Select Flow Control Settings to limit the rate of writing to the target data source per second. The unit is rows/second. Select Log to view the execution logs of incremental replication. Select the icon to view the latest information.
6	Modify Object	Displays the modification records of the synchronization object. Select Modify Objects to configure the synchronization object. Select the icon to view the latest information.
7	Expand	Displays detailed information of the current replication task, including Type, Replication Objects, Started, etc.

Appendix 1: Data format description

PostgreSQL data replicated to Kafka is stored in JSON. NineData splits source rows into JSON objects, and each JSON object represents one Kafka message.

Full replication phase: The number of PostgreSQL rows stored in a single message is determined by the message.max.bytes`message.max.bytes` is the maximum message size allowed in the Kafka cluster. The default value is `1000000` bytes, or 1 MB. Adjust this value in the Kafka configuration file to store more PostgreSQL rows in each message. If the value is too large, Kafka may need more memory per message and cluster performance can drop. parameter.
Incremental replication phase: A single message stores one PostgreSQL row.

Each JSON object contains the following fields:

Field	Field type	Description	Example
serverId	STRING	Information about the data source that owns the message. Format: `<connection address:port>`.	`"serverId":"47.98.224.21:3307"`
id	LONG	The message record ID. This field increments globally and can be used to identify duplicate message consumption.	`"Id":156`
es	INT	Unix timestamp. The meaning depends on the task phase: Full replication phase: Start time of the full replication task. Incremental replication phase: Event time of each source change event.	`"es":1668650385`
ts	INT	Unix timestamp when the message is delivered to Kafka.	`"ts":1668651053`
isDdl	BOOLEAN	Whether the message contains DDL. Values: true: Yes false: No	`"is_ddl":true`
type	STRING	Operation type. Values: INIT: Full replication. INSERT: INSERT operation. DELETE: DELETE operation. UPDATE: UPDATE operation. DDL: DDL operation.	`"type":"INIT"`
database	STRING	The database to which the data belongs.	`"database":"database_name"`
table	STRING	The table to which the data belongs. If the DDL statement corresponds to a non-table object, this field takes the value `null`.	`"table":"table_name"`
mysqlType	JSON	Reserved field.	`"mysqlType": null`
sqlType	JSON	Source PostgreSQL data types for each field.	`"sqlType": {"id": "NUMBER","shipping_type":"varchar2(50)"}`
pkNames	ARRAY	Primary key names for the record. Values: If the record is a DDL record, the value is null. If the record is an INIT or DML record, the value is the primary key name of the record.	`"pkNames": ["id", "uid"]`
data	ARRAY[JSON]	PostgreSQL data delivered to Kafka, stored as a JSON array. Full replication scenario (type = INIT): Stores full data delivered from PostgreSQL to Kafka. Incremental replication scenario: Stores data change details. INSERT: Values inserted into each field. UPDATE: Values after the update operation. DELETE: Values deleted from each field. DDL: Table structure after the DDL operation.	`"data": [{ "name": "jl", "phone": "(737)1234787", "email": "caicai@yahoo.edu", "address": "zhejiang", "country": "china" }]`
old	ARRAY[JSON]	Previous values for incremental changes from PostgreSQL to Kafka. UPDATE: Values before the update operation. DDL: Table structure before the DDL operation. For other operations, the value is `null`.	`"old": [{ "name": "someone", "phone": "(737)1234787", "email": "someone@example.com", "address": "somewhere", "country": "china" }]`
sql	STRING	If the current message is an incremental DDL operation, this field records the corresponding SQL statement. For other operations, the value is `null`.	`"sql":"create table sbtest1(id int primary key,name varchar(20))"`

Appendix 2: Precheck items

Check item	What NineData checks
Source data source connection check	Checks the gateway status of the source data source, whether the instance is reachable, and whether the username and password are correct.
Target data source connection check	Checks the gateway status of the target data source, whether the instance is reachable, and whether the username and password are correct.
Source database privilege check	Checks whether the source database account has the required permissions.
Target database privilege check	Checks whether the Kafka account has access to the topic.
Target database data existence check	Checks whether the topic already contains data.
Source database wal_level check	Checks whether the source data source `wal_level` is set to `logical`.
Source database max_wal_senders check	Checks whether `max_wal_senders` meets the replication connection requirements.
Source database max_replication_slots check	Checks whether `max_replication_slots` meets the replication slot requirements.

Next steps

Data Replication overview

PostgreSQL to Kafka Data Replication

Overview​

Before you begin​

Restrictions​

Procedure​

Result​

Appendix 1: Data format description​

Appendix 2: Precheck items​

Next steps​