Achieving KPI Consistency with SQL-based Period Over Period Analysis
In this article, we discussed the benefits and drawbacks of using SQL for Period Over Period analysis, including data consistency and limitations in flexibility.
Setting Up the Example Dataset
We’ll start by creating a fake table and populating it with random data to simulate a sales dataset.
CREATE TABLE looker_scratch.sales_by_ticket (
store_id INTEGER NOT NULL,
date DATE NOT NULL,
ticket_id INTEGER NOT NULL,
sales_amount NUMERIC NOT NULL
)
PARTITION BY (date)
CLUSTER BY store_id;
The table sales_by_ticket
contains four columns: store_id
, date
, ticket_id
, and sales_amount
. We partition the table by date
and cluster it by store_id
to optimize query performance.
Next, we insert fake data into the table using random values for store_id, date, and sales_amount:
INSERT INTO looker_scratch.sales_by_ticket (store_id, date, ticket_id, sales_amount)
SELECT
-- Generate random values for store_id between 1 and 10
CAST((RAND() * 10 + 1) AS INT64) AS store_id,
-- Generate random dates between today and 3 years ago
DATE_SUB(CURRENT_DATE(), INTERVAL CAST((RAND() * 1095) AS INT64) DAY) AS date,
-- Generate unique ticket_id values for each row using ROW_NUMBER()
CAST(ROW_NUMBER() OVER () AS INT64) AS ticket_id,
-- Generate random values for sales_amount between 1 and 1000
CAST(ROUND(RAND() * 1000 + 1, 2) AS NUMERIC) AS sales_amount
FROM
UNNEST(GENERATE_ARRAY(1, 100000)) AS row;
Calculating Basic KPIs
With the dataset in place, we can now calculate basic KPIs such as total sales and count of tickets for each store on each date:
SELECT
date,
store_id,
COUNT(ticket_id) AS count_of_tickets,
SUM(sales_amount) AS total_sales
FROM
looker_scratch.sales_by_ticket
GROUP BY
date, store_id
ORDER BY
date, store_id;
Implementing Period Over Period Analysis
To compare these KPIs across different time periods, we’ll create a CTE (Common Table Expression) called daily_sales
and then perform the Period Over Period analysis:
WITH daily_sales AS (
SELECT
date,
store_id,
COUNT(ticket_id) AS count_of_tickets,
SUM(sales_amount) AS total_sales
FROM
looker_scratch.sales_by_ticket
GROUP BY
date, store_id
ORDER BY
date, store_id
)
SELECT
date,
DATE_SUB(date, INTERVAL 1 DAY) AS previous_date,
DATE_SUB(date, INTERVAL 7 DAY) AS previous_week_date,
DATE_SUB(date, INTERVAL 1 MONTH) AS previous_month_date,
DATE_SUB(date, INTERVAL 1 YEAR) AS previous_year_date,
store_id,
count_of_tickets,
total_sales,
(SELECT count_of_tickets FROM daily_sales WHERE date = DATE_SUB(d.date, INTERVAL 1 DAY) AND store_id = d.store_id) AS previous_day_count_of_tickets,
(SELECT total_sales FROM daily_sales WHERE date = DATE_SUB(d.date, INTERVAL 1 DAY) AND store_id = d.store_id) AS previous_day_total_sales,
(SELECT count_of_tickets FROM daily_sales WHERE date = DATE_SUB(d.date, INTERVAL 7 DAY) AND store_id = d.store_id) AS previous_week_count_of_tickets,
(SELECT total_sales FROM daily_sales WHERE date = DATE_SUB(d.date, INTERVAL 7 DAY) AND store_id = d.store_id) AS previous_week_total_sales,
(SELECT count_of_tickets FROM daily_sales WHERE date = DATE_SUB(d.date, INTERVAL 1 MONTH) AND store_id = d.store_id) AS previous_month_count_of_tickets,
(SELECT total_sales FROM daily_sales WHERE date = DATE_SUB(d.date, INTERVAL 1 MONTH) AND store_id = d.store_id) AS previous_month_total_sales,
(SELECT count_of_tickets FROM daily_sales WHERE date = DATE_SUB(d.date, INTERVAL 1 YEAR) AND store_id = d.store_id) AS previous_year_count_of_tickets,
(SELECT total_sales FROM daily_sales WHERE date = DATE_SUB(d.date, INTERVAL 1 YEAR) AND store_id = d.store_id) AS previous_year_total_sales
FROM
daily_sales d
ORDER BY
date DESC, store_id;
The query above calculates total sales and the count of tickets for each store at various intervals (previous day, week, month, and year) using correlated subqueries.
The final result will look like:
Then you can use your BI tool to add more measures or continue adding more SQL KPIs integrated.
Refresh strategy
Let’s explore three refresh options for our Period Over Period analysis:
- Daily full refresh with real tables
- Materialized tables
- Incremental refresh
Daily Full Refresh with Real Tables
A daily full refresh involves executing the Period Over Period analysis query every day, recalculating all values and overwriting the existing table with the new data.
Pros:
- Simple to implement and maintain
- Data is up-to-date every day
Cons:
- Can be expensive due to repeated processing of large volumes of data
- Performance can suffer if the query execution time is long
Materialized Tables
A materialized table is a table that stores the precomputed results of a query. The table is refreshed periodically, and the query results are stored for faster retrieval.
Pros:
- Improved query performance since the results are precomputed
- Can be more cost-effective by reducing the number of times the query runs
Cons:
- Data freshness depends on the refresh frequency
- Additional maintenance overhead for managing materialized tables
Incremental Refresh
Incremental refresh involves updating only the data that has changed since the last refresh. This approach requires careful planning and data pipeline design to identify and process only the changed data.
Pros:
- Reduces the amount of data processed, potentially improving performance and cost efficiency
- Can provide more up-to-date data compared to other options
Cons:
- Requires more complex implementation and maintenance
- Potential data consistency issues if not properly designed
Why perform Period Over Period directly in SQL?
Performing Period Over Period analysis directly in SQL can be a good idea for several reasons:
- Performance: By calculating the Period Over Period metrics in SQL, you leverage the processing capabilities of the underlying database system. This can result in faster query execution, especially when dealing with large datasets, as opposed to retrieving raw data and processing it in an external tool or application.
- Flexibility: Writing Period Over Period calculations in SQL allows you to customize your comparisons and tailor the analysis according to your specific requirements. You can easily compare data over different time periods (e.g., daily, weekly, monthly, yearly) and adjust the intervals as needed.
- Data consistency: Calculating metrics directly in SQL ensures consistency in the data, as you retrieve and process the data using a single query. This eliminates potential discrepancies that may arise when using multiple queries or external tools for calculations.
- Reduced data transfer: When you perform Period Over Period calculations in SQL, you limit the amount of data that needs to be transferred between the database and the client application. This can lead to reduced network overhead and improved application performance, especially in cases where large datasets are involved.
- Easier integration: By including Period Over Period logic within your SQL query, you can easily integrate the results into reporting tools, BI applications, or dashboards without the need for additional data processing steps. This simplifies the overall data pipeline and reduces the complexity of your data architecture.
While there are several advantages to performing Period Over Period analysis directly in SQL, there are also some potential drawbacks to consider:
- Complexity: As you introduce more comparisons and calculations in your SQL query, it can become more complex and difficult to understand and maintain. This can be particularly challenging for team members who are not as proficient in SQL, making it harder for them to contribute to or modify the query.
- Performance degradation: In some cases, particularly with large datasets or complex calculations, running Period Over Period analysis within a single SQL query may cause performance issues due to the database system’s resource utilization. This might be more noticeable when using correlated subqueries or complex window functions.
- Limited flexibility: While SQL provides flexibility for custom comparisons, some databases might not support advanced analytical functions, which could limit your ability to perform complex Period Over Period calculations. Additionally, certain types of analysis might be easier to perform in a specialized analytical tool or programming language.
- Less interactive: When you perform Period Over Period analysis directly in SQL, your results are typically static, which means users cannot easily interact with the data to explore different time periods or drill down into the details. This can be a limitation compared to using a specialized BI tool that allows for more interactive analysis and visualization.
- Dependency on database resources: Performing Period Over Period analysis in SQL relies on the database system’s resources and capabilities. If the database is under heavy load or experiences performance issues, this could negatively impact the execution of your Period Over Period calculations.
Conclusion
After years of working in business intelligence and enabling non-technical users to build their solutions, it’s common to observe that the same KPIs are calculated with slightly different rules, leading to discrepancies and confusion. Implementing Period Over Period logic directly in SQL, while limiting possibilities for non-technical users, has some important advantages:
- Standardized calculation rules: By centralizing the logic in the SQL, you ensure that everyone in the company uses the same calculation rules for each KPI. This reduces inconsistencies and helps maintain data accuracy across different teams and reports.
- Better data governance: Embedding the logic within SQL queries encourages better data governance practices, as the technical team can validate and maintain the calculations, ensuring they adhere to the organization’s data standards and policies.
- Improved communication: Although having a bottleneck where the business team must discuss with the technical team to make progress might seem like a disadvantage, it can foster better communication between teams. This collaborative approach can lead to a deeper understanding of business needs and ensure the right KPIs are being tracked and calculated correctly.
- Focus on quality over quantity: By having fewer KPIs with accurate and consistent calculations, you prioritize data quality over the sheer number of KPIs. This can lead to more meaningful insights and better decision-making, as stakeholders can trust the data being presented to them.
In conclusion, while incorporating Period Over Period logic directly in SQL might limit possibilities for non-technical users, it offers significant benefits in terms of data consistency, governance, communication, and quality. By promoting collaboration between technical and business teams, you can strike a balance between flexibility and accuracy, ensuring that the most important KPIs are calculated correctly and consistently across the organization.