Enhancing data quality in BigQuery with custom stored procedures

Michaël Scherding
5 min readFeb 13, 2024

--

Prompt: Envision a serene Japanese garden representing software testing and quality (ok I know it’s a nonsense) :)

Introduction

Ensuring data quality is essential in today’s data-driven world, especially in environments like Google BigQuery. While tools like dbt offer robust solutions for data validation, there are situations where custom approaches are necessary.

Stored procedures in BigQuery offer a powerful alternative, allowing for direct data manipulation and quality checks. This article explores how stored procedures can manage data quality by moving problematic records to a reject table, offering a hands-on example for those instances where traditional tools may not suffice. Additionally, we’ll touch on integrating these procedures with Terraform to automate and manage them efficiently, providing a streamlined approach to maintaining high-quality data in BigQuery.

Understanding stored procedures

Stored procedures in BigQuery offer a direct and efficient method for managing data quality. They encapsulate SQL commands for repeated use, automating complex validation tasks and custom logic tailored to specific quality requirements.

This capability ensures consistency, reduces manual errors, and streamlines data processing workflows. By integrating closely with BigQuery, stored procedures execute within the same environment as the data, enhancing performance and leveraging platform-specific features for error handling and logging.

This approach not only automates but also customizes data validation, providing a powerful alternative to external tools for ensuring data integrity and reliability.

Case study: a custom procedure for Null value

Ensuring that critical fields in your datasets, such as customer_id, are complete and accurate is vital for data analysis and business operations. Null values in these fields can lead to inaccurate reports, faulty analyses, and missed insights. Manually checking and correcting these issues can be time-consuming and error-prone, emphasizing the need for an automated solution.

Objective

The primary goal of this stored procedure is to automate the identification and segregation of records with null values in a specified column (customer_id in this example) into a separate reject table. This process facilitates further review and corrective action by data stewards, ensuring that the primary dataset remains pristine and reliable for analysis.

Please find the code below:

Implementation

The stored procedure, named check_and_move_nulls_to_reject, dynamically addresses the issue of null values in any specified column of a table. Here's a breakdown of its operation:

  1. Parameters acceptance: It starts by accepting parameters that define the dataset name, source table name, column to check for null values, and procedure name. An optional parameter allows for the deletion of null-containing rows from the source table.
  2. Reject table creation: The procedure then attempts to create a reject table if it doesn’t already exist. This table is designed to capture details about the rejected rows, including the project name, dataset name, table name, procedure name, tested column name, rejection timestamp, and a JSON representation of the data.
  3. Data segregation: Next, it executes a query to insert into the reject table any rows from the source table where the specified column contains null values. This step ensures that all records failing the data quality check are captured for review.
  4. Optional row deletion: Finally, if the delete_rows parameter is set to TRUE, the procedure also removes these null-containing rows from the source table, helping maintain the cleanliness and integrity of the dataset.

Calling from BigQuery UI:

In the BigQuery Console, use the CALL statement with the necessary parameters. Here’s a simplified example:

Terraform integration

Integrating Terraform with BigQuery for managing stored procedures offers a structured, scalable approach to data quality management. Terraform, as an Infrastructure as Code (IaC) tool, allows you to define stored procedures and their related resources in code. This approach ensures modularity, allowing for reusable and easily manageable components.

Key Benefits:

  • Modularity: Terraform enables a modular approach, where each stored procedure and its dependencies are encapsulated in separate blocks, making them easy to manage and deploy across different environments.
  • Consistency: Using Terraform guarantees that stored procedures are deployed consistently, reducing errors and discrepancies that manual deployments might introduce.
  • Version control: Storing Terraform configurations in a version control system facilitates collaboration, change tracking, and ensures that any modifications are reviewed and documented, maintaining the integrity of your data quality management process.
  • Automation: Terraform automates the deployment and management of stored procedures, significantly reducing the effort required to maintain data quality and allowing for quick updates or rollbacks as needed.

By adopting Terraform for stored procedure management in BigQuery, you create a stable, controlled environment where changes are predictable, documented, and easily reversible, ensuring that your data quality checks remain robust and reliable over time.

Conclusion

In conclusion, leveraging stored procedures in BigQuery offers a powerful mechanism for automating and customizing data quality checks, ensuring the integrity and reliability of your data. By utilizing a practical example, such as the check_and_move_nulls_to_reject procedure, we've seen how targeted data quality management can be implemented directly within your data warehouse, offering a proactive approach to maintaining pristine datasets.

The integration of Terraform further enhances this process by providing a structured, scalable framework for managing these stored procedures. Terraform’s Infrastructure as Code (IaC) capabilities ensure modularity, consistency, and version control, streamlining the deployment and management of your data quality solutions. This approach not only saves time and reduces the likelihood of human error but also fosters collaboration and stability across your data engineering and DevOps teams.

Adopting these practices allows organizations to build a robust data quality management framework that is both flexible and resilient. As data environments become increasingly complex and critical to business operations, the importance of such scalable, reliable solutions cannot be overstated. By embracing these advanced techniques, your team can ensure that your data remains accurate, trustworthy, and ready to drive informed decision-making across your organization.

See ya 🤘

--

--

No responses yet