Data Replication Methods

Data replication takes data from your source database, business application, API, file storage, etc., and copies it into your destination data warehouse or transactional database. After identifying the data you want to bring in, you need to determine how to replicate it for meeting your business needs.

Choosing the right method

Given that the data replication method you choose will impact your data, we support various replication methods to give you as much flexibility as possible. The table below contains a high-level look at each of Etlworks' Replication Methods and compares their pros and cons.

Method Pros: Cons:

Change Data Capture (CDC)

Uses database redo [transaction] log to track changes in the source.
  • fast
  • no polling from database tables, uses database redo log instead
  • supports deletes
  • supports [almost] real-time replication
  • currently supports Postgres, MySQL, SQL Server, DB2, Oracle, and MongoDB
  • some older versions of the databases above do not support CDC
  • requires extra setup in the source database

Change Data Tracking (CT)

Synchronous tracking mechanism, in which the changes information will be available directly once the DML change is committed
  • fast
  • no polling from database tables
  • supports deletes
  • supports [almost] real-time replication
  • Microsoft SQL Server only
  • requires extra setup in the source database

High Watermark

Uses a designated field, typically a TIMESTAMP, to track changes in the source.
  • fast
  • no extra moving parts
  • works for all data sources, including all databases, files, and APIs

Database Triggers

Uses table(s) updated by the database triggers to track changes in the source.
  • works for any source database which has triggers
  • no extra requirements for the specific version of the database or extra field in each table
  • requires adding triggers to all database tables
  • triggers can negatively impact performance

Real-time CDC with Kafka

Polls CDC events from the Kafka topic(s) to track changes in the source.
  • fast
  • no polling from database tables
  • supports deletes
  • supports real-time replication
  • complicated setup (requires Kafka, Zookeeper, Kafka Connect, and Debezium)
  • currently supports Postgres, MySQL, SQL Server, DB2, Oracle, and MongoDB
  • some older versions of the databases above do not support CDC
  • requires extra setup in the source database

Full refresh

Always polls the entire dataset from the source.
  • the simplest to setup
  • can be quite fast for the relatively small datasets (<100K records)
  • works for all data sources
  • not recommended for large datasets

Ready to start using Etlworks Integrator?

 

Start Free Trial Request Demo