Monitoring Dataform assertions with BigQuery and GitHub actions
Suppose you’ve used Dataform for orchestrating your analytics workflows in Google Cloud. In that case, you might have discovered its Assertions feature. While these checks are helpful, Dataform doesn’t currently provide an out-of-the-box way to monitor how often assertions fail or track historical performance.
In the dbt world, there are many community-driven tools and UI integrations that display test (or assertion) results, but for Dataform, similar monitoring capabilities aren’t readily available. I decided to explore a quick, custom approach to fill this gap. This article outlines a side project that uses Google Cloud Platform (GCP), BigQuery, and GitHub Actions to collect and analyze Dataform assertion results at scale.
Why monitor Dataform assertions?
Quality checks are only as good as your visibility into them. If your assertions are firing off daily, how many are failing? Which assertions are most prone to errors? Has the average execution time increased over the past few weeks? These are questions you can’t easily answer if there’s no systematic way to collect and review the results.
High-level architecture
This project uses:
- Dataform API: To programmatically retrieve Dataform workflow invocations and their assertions.
- BigQuery: As a central place to store the data and create summary views.
- Python scripts: To orchestrate the entire flow — authenticate to GCP, call the Dataform API, parse the response, and insert into BigQuery.
- GitHub Actions: For CI/CD automation, enabling you to run the monitoring script on-demand or on a schedule.
Key components
1. Data extraction via Dataform API
- Workflow Invocations: Each invocation in Dataform can contain multiple actions, including assertions.
- Actions: We filter only those actions whose name indicates an assertion (contains “assertion”).
This logic lives in the dataform_api.py module, where list_workflow_invocations(…) fetches all runs for a given repository, and query_invocation_actions(…) returns details about the actions.
2. Storing results in BigQuery
We leverage the bigquery_client.py module to handle:
- Table creation (if it doesn’t exist).
- Appending new data (DataFrame) to the table, including columns such as Start_Time, End_Time, Action_Name, State, etc.
- View creation: We define two summary views — one for daily stats and another for assertion-level stats. These views are built with simple SQL statements (e.g., CREATE OR REPLACE VIEW …).
Example use cases
- Daily checks: Schedule the workflow every night to gather the latest Dataform runs and store the assertion outcomes. Use the daily recap view to get at-a-glance insights.
- Alerts: If you want to get fancy, you can hook up downstream dashboards or notifications to show a spike in failures or track repeated flakiness in a specific assertion.
- Historical trend: Because the data is in BigQuery, it’s straightforward to query how your assertions evolved over time, including average run duration.
Conclusion
Dataform’s assertion feature helps us ensure data quality, but without a clear monitoring layer, it’s easy to lose sight of which checks are passing or failing over time. This Dataform AssGuard project is a simple, modular approach to automatically gather assertion data and store it in BigQuery for long-term analytics.
And yes, you can find the repo here.
See ya 🤘