Automating data quality checks in Google BigQuery

Michaël Scherding
5 min readFeb 29, 2024

--

In today’s data-driven world, ensuring the quality of your data is paramount. This article outlines a comprehensive approach to automate data quality checks across multiple tables within a Google Cloud Platform (GCP) project, utilizing BigQuery, Cloud Functions, and Cloud Scheduler.

Overview of the System

The system comprises several key components:

  • Centralized test definition table: Holds configurations for checks.
  • Centralized rejection log table: Logs issues found during checks.
  • Stored Procedures: Perform the data quality checks.
  • Cloud Function (“Executor”): Reads test definitions and triggers checks.
  • Cloud Scheduler Jobs: Scheduled jobs that trigger the Cloud Function.

Step 1: Designing BigQuery tables

Centralized test definition table

Create a BigQuery table with fields to specify the project, dataset, table, column for checks, the type of check (e.g., “check_not_null”), and the check’s schedule.

Schema Example:

  • project_name: STRING
  • dataset_name: STRING
  • table_name: STRING
  • column_name: STRING
  • check_type: STRING
  • schedule: STRING

To enhance the centralized test definition table for dashboarding purposes, allowing you to track the status and execution times of your data quality checks, you would add fields to record the last run timestamp and the last status of each check. This addition not only helps in monitoring and auditing the health of your data but also enables quick identification of issues and trends over time with:

  • last_run: TIMESTAMP - The timestamp of the last time the check was executed.
  • last_status: STRING - The outcome of the last check ("Success", "Failure", or any other relevant status).

Centralized rejection log table

This table will record details of data quality issues.

Schema Example:

  • project_name: STRING
  • dataset_name: STRING
  • table_name: STRING
  • column_name: STRING
  • rejection_reason: STRING
  • rejected_at: TIMESTAMP
  • data: STRING

Integrating a data column in JSON format into your centralized rejection log table is a strategic approach to capturing and analyzing the specifics of each data quality issue. This column not only stores the entire row of data that failed a quality check but also enables detailed monitoring and analysis of rejections by dataset, table, column, etc.

With the centralized rejection log table, creating a dashboard to monitor data quality becomes straightforward.

  • Rejection over time: Display the number of rejections over time to identify trends or patterns. This can help in spotting when particular changes lead to increased data quality issues.
  • Top offenders: Aggregate rejection data by dataset_name, table_name, or column_name to identify which areas of your data warehouse are most prone to issues. This can help prioritize efforts to improve data quality.
  • Detailed drill-downs: Use the JSON details in the data column to create drill-down reports that allow users to explore the specifics of rejected rows. This can be invaluable for debugging and rectifying source data issues.
  • Rejection reasons breakdown: Analyze the rejection_reason field to understand the most common causes of data quality issues, informing potential areas for improvement in data collection or processing pipelines.

Step 2: Implementing stored procedures

Stored procedures in BigQuery allow for the encapsulation of complex logic that can be reused. Here’s an example of a stored procedure for checking for null values in a specified column of a table. This procedure logs any findings to a centralized rejection log.

For deploying a Cloud Function that reads from a test definition table and creates Cloud Scheduler jobs, consider this Python example using Flask framework for the Cloud Function:

Step 4: Setting up cloud scheduler jobs

While the Cloud Function schedule_checks dynamically manages Cloud Scheduler jobs based on entries in the test definitions table, understanding how these jobs trigger the function is essential. Each job, when executed, makes an HTTP POST request to the Cloud Function endpoint with the necessary parameters extracted from the test definitions. This is illustrated in the Cloud Function code above where each job's HTTP target is constructed.

Step 5: The executor function

The executor function is designed to:

  1. Receive parameters from a Cloud Scheduler job via an HTTP POST request.
  2. Parse these parameters to determine the specifics of the data quality check to be performed, such as which table and column to check and what kind of check to perform (e.g., checking for null values).
  3. Call the appropriate stored procedure in BigQuery to perform the specified data quality check.
  4. Handle the response from the stored procedure, including logging any errors or issues detected during the check.

How it will be organized in BigQuery

This data quality control logic is designed to be project-specific within Google Cloud’s BigQuery. Each project will have a dedicated dataset named quality, containing two central elements: a table for test definitions (test_definitions) and a table for logging rejected rows (rejection_log). Additionally, the dataset will host various routines (stored procedures) for performing the quality checks.

The entire setup can be efficiently deployed across one or multiple projects using Terraform, allowing for quick and consistent implementation of data quality testing at the project level. This approach not only standardizes data quality checks across an organization but also simplifies management and scalability of data governance practices.

Conclusion

The executor function plays a pivotal role in the automated data quality control system. It acts as the bridge between the Cloud Scheduler jobs (which schedule and trigger data quality checks) and the BigQuery stored procedures (which perform the actual checks). By dynamically parsing parameters from incoming requests and executing the corresponding stored procedures, this function enables automated, scheduled data quality checks across multiple tables and datasets within a GCP project.

Hope it was fun, see ya 🤟.

--

--

Responses (3)