Why I think denormalization is (almost) always a good idea.

4 min readOct 19, 2022

Hey all 👋,

Hope you are all doing great!

Well, today MASSIVE 🤯 topic about data modeling and how to deal with, especially if you are working in massive and complex organization and of course why denormalization can be a simple and good solution to deal with.

Context

Data Modeling is the art of creating a simplified flowchart that represents data entities, their attributes and relationships using text and symbols to visualize the data and how it flows. Data models provide a blueprint (or a single source of truth… yes I love buzzwords) for data management or analytics teams in order to help them document data requirements for usage and prepare future development. Overall, data modeling helps an organization use its data effectively to meet business needs.

With the explosion of the Modern Data Stack (I told you, I love buzzwords), Data Modeling has been out of age. All of that new ecosystem was built around the fact to go quicker and propose new insights in the fastest way possible (but most of the time in the dirtiest way possible). In fact the promise was tempting, go fast and avoid the pain of defining the big picture (Data Modeling).

The problem

When you are navigating in a Data environment, and especially in a big company it’s hard to explain that taking the time to work on Data Modeling is useful. Speaking about tables, relationships, keys will not be the primary topic on the table. Especially because working on Data Modeling is not only a ‘’technical area’’ where you will only have engineers discussing the best combination possible. Nope, today Modeling is tightly linked with business. You should link Business with Modeling in order to build the best approach possible and in the meantime the most valuable solution possible for the actual usages and of course the upcoming ones.

But in big companies, having dedicated time is hard especially if you want to join all the right people around the same table. Especially for topics that, on the first impression, have no direct income. And, I know I will not make myself a lot of friends with my next move but Agile iterative approach is not really helpful for defining really deep work first, we want results in the fastest way possible.

I will make a shortcut, but most of the time and in the end you will start to have messy results, and you will have feedbacks like:

why my reports are so slow
why kpis seems weirds and not correct
why I have the same dimension so many different locations
why my platform is so expensive

It’s already too late, and congratulations you have reached the top of the Data Spaghetti Mountain. You can add as many new tools as you want, it will only be an addition of wrong decisions.

The solution

The more I work on Data projects, the more I’m always falling for the same result > Data Modeling is almost always the best answer for 80% of the trouble you will face. You can add more power, more machines, more capacity, more tools, more and more… But Modeling is always the best approach.

But how to deal with Modeling when:

we do not have the time to work on it
no one want to think about the best star schema possible
no one want to maintain the right version of it
and you have to deal with fast results

This is where denormalization is a good option. Denormalization is a way to improve read performance of a database at the expense of losing write performance. In the past, storage costs were really expensive (Hello Oracle!), and it was logic to limit redundancy. But today, storage is not a big deal, and the focus is now on how manage effectively computing. Some powerful stuff help a lot by improving computing (BI Engine) and they are working better on large tables denormalized.

And this is where I think (especially when we do not have the dedicated time necessary to build elegant solutions) denormalization is (almost) alway a good idea. Maintaining denormalized tables is way more simple and less challenging than playing with star schema.

We should think about having a few massive denormalized tables, answering the majority of the questions necessary rather than letting Engineers build the wheel again and again and letting them play with over complex data layers. We should have less data owners and let them only work on a couple of denormalized tables.

Denormalized business oriented table helps to:

don’t repeat or duplicate the same information in many tables
know where to find the information as we have less tables
avoid unnecessary joins, keep SQL simple
hight reading performance, dashboards will finally work at the speed of light
easy to maintain and update
more easy to define an owner on it

Next steps

Please find below some great source of inspirations for your studies:

Activity Schema is by far the best answer I found on this topic
Agile Data Warehouse Design, great book advised by my data fellow Charles Verleyen
Follow Chad Sanderson about data quality topics and any good advises you need to practice

See you soon and take care 🤟

Why I think denormalization is (almost) always a good idea.

Context

The problem

The solution

Next steps

Written by Michaël Scherding

No responses yet