5 Engineering Lessons from Replicating Amazon RDS Postgres

Replicating a managed database service like Amazon RDS for PostgreSQL is not as simple as pointing pg_dump at it and calling it a day. The managed environment, designed for stability and security, imposes a unique set of constraints that require non-trivial engineering solutions. For the past three months, since SerenDB's founding in September 2025, we've been working 12-hour days to build a high-performance, open-source database replication tool for SerenAI agentic backend services. In this post, we’ll distill our experience into five key technical takeaways for engineers working with AWS RDS Managed Postgres database service.

Lesson 1: State Dumps Require a Compatibility Layer

Standard and trusty tooling like pg_dumpall generates a perfect snapshot of a self-hosted cluster, but that snapshot is incompatible with Amazon's RDS. Attempting to restore it verbatim fails because AWSRDS restricts superuser-only commands (ALTER ROLE ... SUPERUSER), privileged operations (GRANT pg_checkpoint), and modifications to certain GUCs (ALTER ROLE ... SET log_statement).

The engineering challenge is to transform this incompatible state dump into a portable format. We solved this by building a multi-pass sanitization pipeline that parses the SQL dump and comments out non-portable commands. It’s not just a simple filter; it's a state-aware parser that has to understand context.

typescript

1// From: src/migration/dump.rs
2
3// This is one of several sanitization passes. It specifically targets GRANT
4// statements for default roles that are restricted on RDS, and also handles
5// cases where the grantor is an internal RDS admin role.
6
7pub fn remove_restricted_role_grants(path: &str) -> Result<()> {
8    // A manually curated list of roles that RDS prohibits granting.
9    const RESTRICTED_ROLES: &[&str] = &[
10        "pg_checkpoint", "pg_read_all_data", "pg_write_all_data",
11        // ... and 11 more restricted roles
12    ];
13
14    // Internal roles that cannot act as grantors on other systems.
15    const RESTRICTED_GRANTORS: &[&str] = &["rdsadmin", "rds_superuser"];
16
17    // The implementation iterates through the file, checking each line against
18    // the blocklists and commenting out matches, while preserving valid statements.
19    // ...
20}

This approach is essentially a compatibility layer for database state, allowing us to treat AWSRDS as just another PostgreSQL instance, despite its underlying limitations.

Lesson 2: Statically-Linked TLS is a Deployment Superpower

AWSRDS enforces SSL/TLS, making the choice of a TLS library critical. We started with native-tls, which dynamically links against the host system's OpenSSL libraries. This created immediate CI/CD and deployment headaches. Builds would fail on runners that didn't have the exact right version of OpenSSL installed, and it complicated creating portable, statically-linked binaries for distribution.

We migrated to rustls for two primary reasons:

Build Portability: rustls is a pure Rust implementation, which allowed us to compile a single, dependency-free binary that runs consistently across different Linux distributions and container environments (e.g., Alpine, Debian).
Security and Control: rustls provides memory safety guarantees and offers a more explicit, verifiable API for certificate handling, which is crucial when dealing with customer data.

typescript

1// From: src/postgres/connection.rs
2
3// The `rustls` configuration is more verbose, but it's explicit and portable.
4let mut root_store = RootCertStore::empty();
5root_store.add_parsable_certificates(webpki_roots::TLS_SERVER_ROOTS.iter().map(|ta| {
6    // We explicitly build our root store from webpki-roots.
7    rustls::pki_types::TrustAnchor::from_subject_spki_name_constraints(
8        ta.subject,
9        ta.subject_public_key_info,
10        ta.name_constraints,
11    )
12}));
13
14let mut client_config = ClientConfig::builder()
15    .with_root_certificates(root_store)
16    .with_no_client_auth();
17
18// Note the explicit use of the `dangerous()` API to allow self-signed certs in
19// test environments. This makes the security trade-off obvious in the code.
20if allow_self_signed {
21    client_config
22        .dangerous()
23        .set_certificate_verifier(Arc::new(DangerAcceptInvalidCerts));
24}
25
26let tls = MakeRustlsConnect::new(client_config);
27let (client, connection) = tokio_postgres::connect(&connection_string, tls).await;

Lesson 3: The Network is Unreliable (Especially in the AWS Cloud)

Long-running database replications often involve periods where the connection is idle from the perspective of the client. On AWS, network infrastructure like Elastic Load Balancers (and even some NAT gateways) have idle connection timeouts (e.g., 350 seconds for an NLB). These services will silently drop idle TCP connections, causing the replication to fail with a cryptic "connection reset by peer" error hours into the process.

The solution is to operate under the assumption that the network is unreliable and proactively keep the connection alive. We implemented TCP keepalives directly in our connection logic. This is a network-layer fix for a problem that manifests as a database-layer failure. Keep-alives save lives!

typescript

1// From: src/postgres/connection.rs
2
3/// Automatically adds keepalive parameters to a PostgreSQL connection string
4/// to prevent idle connection timeouts from AWS network infrastructure.
5pub fn add_keepalive_params(connection_string: &str) -> String {
6    // ... function checks if keepalive params already exist ...
7    let mut params = Vec::new();
8    // Enable keepalives at the TCP level.
9    if needs_keepalives { params.push("keepalives=1"); }
10    // Send the first probe after 60 seconds of inactivity.
11    if needs_idle { params.push("keepalives_idle=60"); }
12    // Send subsequent probes every 10 seconds.
13    if needs_interval { params.push("keepalives_interval=10"); }
14    // ... function appends params to the connection string ...
15}

Lesson 4: Abstract Away Cloud-Specific Error Noise

Error handling is about creating useful abstractions. A raw tokio-postgres error might tell you Connection refused, but it won't tell you why. In an AWSRDS context, the "why" is often specific to the cloud environment: a misconfigured security group, a bad AWS IAM policy, or connecting to the wrong endpoint.

We built a diagnostic layer on top of the raw database driver errors. This layer inspects the error message and provides actionable, RDS-specific advice. This transforms a generic network error into a specific, solvable problem for the user.

typescript

1// From: src/postgres/connection.rs
2
3// This is a snippet from our error mapping logic.
4.map_err(|e| {
5    let error_msg = e.to_string();
6
7    if error_msg.contains("no pg_hba.conf entry") {
8        anyhow::anyhow!(
9            "Access denied: No pg_hba.conf entry for host.\n\n On AWS RDS, this often means your security group is blocking the connection, \  or you are not connecting to the private IP from within the VPC."
10        )
11    } else if error_msg.contains("Connection refused") {
12        anyhow::anyhow!(
13            "Connection refused: Unable to reach database server.\n\n Please check:\n\n - The RDS instance endpoint and port are correct.\n - The instance's Security Group allows inbound traffic from your IP.\n - The instance is in a public subnet or you are connecting from within the VPC."
14        )
15    } else {
16        // Fallback for other errors
17        anyhow::anyhow!("Failed to connect to database: {}", error_msg)
18    }
19})?;

Lesson 5: Reverse-Engineering Managed Service Internals of AWSRDS

As we dug deeper, we realized that simply filtering a fixed list of commands wasn't enough. Every time we patched, we'd be hit with another exception. RDS has its own internal metadata and objects that are not part of standard PostgreSQL. These need to be identified and handled with surgical precision.

This is essentially a process of reverse-engineering. For example, we discovered that RDS uses its own internal tablespaces like rds_temp_tablespace, which don't exist on other PostgreSQL instances. We also found that pg_dumpall would try to include the internal rdsadmin database. Our sanitization logic had to be extended to pattern-match and exclude these RDS-specific constructs.

typescript

1// From: src/migration/dump.rs
2
3/// Comments out tablespace-related statements, including RDS-specific ones.
4pub fn remove_tablespace_statements(path: &str) -> Result<()> {
5    // ...
6    for line in content.lines() {
7        let lower_trimmed = line.trim().to_ascii_lowercase();
8
9        // Pattern-match for various ways RDS might reference its internal tablespaces.
10        let references_rds_tablespace = lower_trimmed.contains("'rds_")
11            || lower_trimmed.contains("\"rds_")
12            || lower_trimmed.contains("tablespace rds_");
13
14        if is_create_tablespace || references_rds_tablespace {
15            // Comment out the entire line if it matches.
16            updated.push_str("-- ");
17            updated.push_str(line);
18            updated.push('\n');
19            modified = true;
20        } else {
21            updated.push_str(line);
22            updated.push('\n');
23        }
24    }
25    // ...
26}

This is an ongoing effort. With every new RDS version, there's a risk of new internal objects or commands that will require us to update our compatibility layer. We stand at the ready.

Lessons Hard Learned

Working with a managed service like RDS is a constant exercise in navigating abstractions. The key takeaway is that you cannot treat it as a black box. You have to build tools that are aware of the managed environment's specific constraints and behaviors. For us, this meant building a portable, dependency-free binary with robust networking, creating a compatibility layer for database state, and wrapping it all in an abstraction that provides actionable, context-aware diagnostics.

Explore and fork the replication code: