Jump to Content
[go: nahoru, domu]

Data Analytics

Simplify historical data tracking in BigQuery with Datastream's append-only CDC

June 22, 2024
Etai Margolin

Product Manager

Yaara Gazit

Software Engineer

Try Gemini 1.5 models

Google's most advanced multimodal models in Vertex AI

Try it

Organizations often grapple with the need to have both an up-to-date “source of truth” as well as the ability to track the complete history of changes within their data. When managing this data within operational databases like MySQL or PostgreSQL, a common approach is to utilize change data capture (CDC) to replicate the changes to a cloud data warehouse such as BigQuery.

Datastream, Google Cloud's serverless CDC service, recently introduced a new feature called append-only mode that streamlines the process of replicating changes from your operational databases to BigQuery. This feature offers an efficient and cost-effective way to maintain historical records and track changes to operational data over time.

Understanding append-only mode

In traditional CDC-based replication, when a record in your source database is updated or deleted, the corresponding record in the destination is overwritten, making it difficult to track the history of changes. Append-only mode addresses this challenge by preserving every change as a new row in your target BigQuery table. Each row includes metadata that captures the type of change (insert, update, or delete), a unique identifier, timestamp, and other relevant information, which can be used to order and filter the data as needed.

Use cases and benefits

Append-only mode is particularly beneficial in scenarios where you need to maintain a historical record of changes. Some common use cases include:

  • Auditing and compliance: Track every modification to data for regulatory compliance or internal audits.
  • Trend analysis: Analyze historical data to identify patterns, trends, and anomalies over time.
  • Customer 360: Maintain a comprehensive view of customer interactions and preferences by tracking changes in customer data.
  • Analyzing embedding drift: With a historical record of embeddings, you can analyze how embeddings have drifted and assess the impact on your model's performance.
  • Time travel: Query your data warehouse as it was at a specific point in time, enabling historical analysis and comparisons.

Example

Suppose you store customer information in a MySQL table and need MySQL to act as your primary source of truth. Your analytics team needs to track changes to customer records to analyze behavior and preferences. With append-only mode activated, all changes to this table, such as inserts, updates, and deletes, will be recorded as new rows in the associated BigQuery table. This simplifies the process for retrieving the necessary data for analysis by the data analytics team.

Benefits of append-only mode

  • Cost efficiency: Reduces processing costs by only appending new rows instead of applying complex merge operations with existing data.

  • Improved data accuracy: Ensures a complete and accurate history of changes, minimizing the risk of data loss.

  • Real-time insights: Enables real-time analysis of changes as they occur, facilitating faster decision-making.

How to use append-only mode

You can easily enable append-only mode while creating the stream in the user interface or via the API. Datastream automatically generates BigQuery tables with the required metadata columns to allow you to monitor modifications.

https://storage.googleapis.com/gweb-cloudblog-publish/original_images/image1_OF6Bcfh.gif

Summary

Datastream's append-only mode is a valuable addition to Google Cloud's data replication and integration capabilities. By simplifying change data capture and providing a complete history of changes, it empowers businesses to gain deeper insights from their data, improve data accuracy, and streamline their data pipelines. To learn more, visit the documentation.

Posted in