Postgres CDC with Airbyte on AWS RDS
Before starting, You Need to Understand
what is the CDC?
it's change data capture technic for not full-sync. imagine you have 1 Tera byte of DB data. and you want sync to another destination like BigQuery or S3 or anywhere you want.
if you full-sync, you nee to copy 1 Tera each time. it's very slow and inefficient
Physical Replication & Logical Replication
setting CDC is Logical Replication
for that, you need to change the configuration of PostgreSQL logical_replication
to 1
after that, you can be set like this
Write-Ahead Log
THIS IS MOST IMPORTANT IF YOU DO NOT READ THIS CAREFULLY YOUR DB COULD SHUTDOWN
after you make cdc_slot and set identity, the publication
DB engine will make WAL(write-ahead log) for you UNTIL YOU CONSUME.
NO MATTER DB RESTARTS OR NOT, THEY WILL KEEP MAKE AND MAINTAIN WAL
so if you do not consume this, you should remove the slot.
Otherwise, the DB disk will be filled with WAL, and DB will shut down soon.
Max Wal Senders
max_wal_senders is also a PostgreSQL configuration parameter.
It controls the maximum consumer of DB for replication. The default on RDS is 35. so you may not need to modify.
Identity, Publication, Slot
Identity
you should set identify
To detect rows added or deleted, update
Publication
publication is a set of identities. You have to mention what table you will publish.
slot
the slot is like a ticket for access to the publication
CDC clients (such as airbyte) will decide what DB or what publication they subscribe
airbyte & bigquery setup tip
when multiple databases are set to big query I recommend making connections N times for prefix name
becuase if you just connect A,B,C database to bigquery B destination there could be conflict DB name.
so set N times according to how much you have a slot,
also, make destination N times for prefix (if you want distinguish)
Check List
# install Postgres on Mac
reference: psql: command not found Mac
download PostgreSQL https://postgresapp.com/downloads.html
add this to your
.zprofile
Postgres CDC with Airbyte on AWS RDS
Airbyte is an open-source data movement infrastructure for building extract and load (EL) data pipelines. It is designed for versatility, scalability, and ease-of-use.
Many common databases support writing all record changes to log files for the purpose of replication. A consumer of these log files (such as Airbyte) can read these logs while keeping track of the current position within the logs in order to read all record changes coming from
DELETE
/INSERT
/UPDATE
statements.
Quick StartHere is an outline of the minimum required steps to configure a Postgres connector:
Create a dedicated read-only Postgres user with permissions for replicating data
Create a new Postgres source in the Airbyte UI using
xmin
system column(Airbyte Cloud Only) Allow inbound traffic from Airbyte IPs
These are the additional steps required (after following the quick start) to configure your Postgres source using CDC:
Provide additional
REPLICATION
permissions to read-only userEnable logical replication on your Postgres database
Create a replication slot on your Postgres database
Create publication and replication identities for each Postgres table
Enable CDC replication in the Airbyte UI
access DB use account
master_user_do_not_use
on AWSrun
check schema
SELECT schema_name FROM information_schema.schemata;
grant read-only for all schema
> you have to repeat for all schema you also may not able to GRAND, because default setting already gives you 'USAGE', 'SELECT';
Step 2: Provide additional permissions to read-only user
RDS not working ALTER REPLICATION!
verify
grant to user
verify
Step 3: Enable logical replication on your Postgres database
if you use native postgreSQL
To enable logical replication on bare metal, VMs (EC2/GCE/etc), or Docker, configure the following parameters in the postgresql.conf file for your Postgres database:
if you use AWS RDS
AWS RDS not able to access postgresql.conf. you should use parameter group

search rds.logical_replication
default value is 0

change to 1

modify your DB parameter group from default
-> <YOUR_PARAMETER_GROUP>


wal_level -> logical_replication
debezium document - PostgreSQL on Amazon RDS
no need to setup max_wal_senders
& max_replication_slots
apply parameter group on RDS
Step 4: Create a replication slot on your Postgres database
verify
Step 5: Create publication and replication identities for each Postgres table (need root user)
1. Add the replication identity (the method of distinguishing between rows) for each table you want to replicate:
Create a shell script (e.g.,
set_replica_identity.sh
):
Make the script executable:
Run the Script:
you need input password twice when run script (I don't know why :D )
2. Create the Postgres publication. You should include all tables you want to replicate as part of the publication:
For all (RECOMMENDED)
For Specific Table (NOT RECOMMENDED)
Create a shell script (e.g., create_publication.sh
):
create_publication.sh
):verify
when you want to add CDC table automatically
(create database not working)
Step-1 grand more permission (RUN SQL!)
this script based on you made
custom_admin_role
and use user nameairbyte_cdc
Last updated
Was this helpful?