Data is an integral part of most organisations — they rely on data to understand market trends and customer preferences, for example. These insights in turn drive important business decisions. But there is a caveat.
If you are already familiar with the concept of Change Data Capture and its use-cases, you can skip to the part 2 of this blog series, where we run a demo pipeline.
Organisations would often have data arriving and subsequently being stored in different systems/sources — databases, data lakes, data warehouses, etc. This creates a variety of needs, depending on the use case. For example:
- An organisation would need to ensure that data sources do not grow out of sync, in scenarios where data consistency is crucial across systems.
- For a more comprehensive analysis, an organisation might need to integrate data from various sources.
- Lastly, sometimes, data might first arrive in a system that is not optimally equipped to handle it. For example, high-volume transaction data might first land in a traditional database not designed for real-time analytics.
All three scenarios require a way to be able to move data from a source system to a destination system, sometimes, with a data transformation step in between — think data pipelines.
Traditionally, organisations have been able to move data between systems via batch data replication — collecting data changes that occur over a period (like a week) and then replicating all those changes to another system in one large group at a specific time (usually during off-peak hours).
Beyond the strain that batch data replication puts on compute resources, it is simply not ideal for scenarios where data needs to be replicated in real time.
One remedy for the real-time limitation of batch data replication is dual writes — writing the same data to multiple destinations, simultaneously. For example, you can write the same data to your PostgreSQL database, Redis server and Elasticsearch cluster, concurrently.
While dual writes could work, it has its limitations:
- For example, what do you do when you’ve already written new data to your PostgreSQL database and Redis server but your Elasticsearch cluster is unavailable? Buffer the data and keep trying until its available — for how long?
- There is also no guarantee that the order in which entries are made in your PostgreSQL database would be preserved in Redis and Elasticsearch.
Essentially, it’s hard to replicate data in a decoupled, consistent and efficient manner with dual writes. We need a data migration mechanism, that not only replicates data in real-time, but in a more decoupled, consistent and efficient way…
Enter Change Data Capture
Change Data Capture (CDC) has gained popularity as an alternative to batch data replication. It is a way to automatically identify and capture changes (events) made in a data source (often, a database) as they happen. These changes can include new entries, updates, or deletions that are then propagated in real-time, often through an event streaming platform, to downstream components, ranging from other databases to consumer applications that process them.
The illustration below describes the different components of a typical CDC pipeline.
As seen highlighted in the illustration above, one of the key components of a CDC pipeline is the CDC mechanism. While there are different ways of implementing a CDC mechanism, the log-based approach is commonly adopted. In this case, the CDC mechanism hooks into the database's transaction log, captures new records added to the log, and streams these records as change events to streaming platforms like RabbitMQ Streams or Apache Kafka.
If we revisit our example on dual writes and view it from the lens of CDC, then things would look a little different.
The CDC mechanism would hook into the PostgreSQL’s transaction log and stream new changes to a Streaming platform. Two consumers would then subscribe to the streaming platform — one consumer will be responsible for updating Elasticsearch while the other consumer updates Redis. The illustration below demonstrates this new arrangement.
The new arrangement above has some benefits:
- It is a more loosely coupled approach: one service being unavailable would not affect the flow of execution
- It has a stronger potential for preserving the order of entries: Elasticsearch and Redis would receive messages in the same order in most cases
- Data is also propagated to the downstream component in real-time or near real-time, in this arrangement
Beyond the benefits above, CDC also has some very practical use cases. Let’s review a few examples:
- Microservice integration: As more and more people adopt the microservice architecture, CDC presents a way to continuously replicate changes between services in real-time and by extension ensure consistency across these services.
- Data replication: As hinted earlier, sometimes, data might first be written to a system that is not optimally equipped to handle it. As a result, we often find ourselves needing to replicate data from one system to another. As an example, we might want to replicate data in a relational database to a caching layer like Redis or a search index like Elasticsearch. CDC presents a more consistent, decoupled and efficient alternative to dual writes for writing data to multiple sources.
- Real-time analytics: In real-time analytics, CDC can be used to make data instantly available. As data is generated or modified in source systems, CDC captures these changes immediately, allowing analytics platforms to be continuously updated — enabling dashboards and reports to reflect current business conditions without latency.
- Event-Driven Architecture: CDC can be used to detect changes in data that act as events triggering downstream processes or workflows, such as an automated customer communication when an order status changes.
To bring the concept of CDC to life, in the part 2 of this blog series, , we will implement a CDC pipeline in an e-commerce scenario. The demo CDC pipeline will highlight our first use-case — replicating changes between services for data consistency in a microservice architecture.
We’d be happy to hear from you! Please leave your suggestions, questions, or feedback in the comment section or get in touch with us at contact@cloudamqp.com