Why you should build your data lake (when it’s possible) on BigQuery!

Michaël Scherding
4 min readJan 26, 2022

--

Hi everyone 👋

First of all, happy new year. Hope everything is fine for you and your family.

Well, today I want to discuss about what is the best service to pick in order to build your data lake. So first of all, let’s deep dive in some general concepts.

What is a Data Lake?

A data lake is a system or repository of data stored in its natural/raw format, usually object blobs or files. A data lake is usually a single store of data including raw copies of source system data, sensor data, social data etc., and transformed data used for tasks such as reporting, visualization, advanced analytics and machine learning. A data lake can include structured data from relational databases (rows and columns), semi-structured data (CSV, logs, XML, JSON), unstructured data (emails, documents, PDFs) and binary data (images, audio, video).

source

In this general definition you will have 3 main points:

  • structured data (rows and columns)
  • semi-structured (.csv, logs, xml, json…)
  • unstructured (emails, documents, pictures…)

Keep these 3 points in mind, they will be critical for the following reflexion.

What is BigQuery ?

BigQuery is a fully-managed, serverless data warehouse that enables scalable analysis over petabytes of data. It is a Platform as a Service (PaaS) that supports querying using ANSI SQL. It also has built-in machine learning capabilities. BigQuery provides external access to Google’s Dremel technology, a scalable, interactive ad hoc query system for analysis of nested data. BigQuery requires all requests to be authenticated, supporting a number of Google-proprietary mechanisms as well as OAuth.

source

As you can read, BigQuery is a powerful PaaS service. As a Data Warehouse, it’s the perfect match for analytics with large amounts of data to query in SQL. So from now, BigQuery seems to be the perfect service for any Data Warehouse topics and not really for a Data Lake purpose. But… since January 6, 2022 you can store and analyze semi-structured data in BigQuery. The game changer is by ingesting semi-structured data as a json data type, BigQuery allows each json field to be encoded and processed independently. So from now, BigQuery can handle the first 2 points of what a Data Lake can store…

So why use BigQuery as a Data Lake?

Well, first of all I’m a huge fan of simplicity. This is why I love serverless solutions. I want to focus on business problems and not really on technical bottlenecks. BigQuery (as many other Data Warehouse solutions) helps a lot to process massive amounts of data in the simplest way possible. As a SQL lover, it’s for me a great service to find and query data as fast as possible. In the meantime BigQuery can helps you with some great features:

  • time travel function, where you can retrieve your data 7 days in the past if you break something
  • auto cold storage after 90 days without any updates
  • really simple to manage on terraform (project creation, dataset creation, access management…)

In the meantime, if you want to use solutions like dbt or dataform to transform your data, the starting point will always be a Data Warehouse. So in order to go fast and have the simplest design possible you can start with some basic design pattern like :

You will find:

  • Data Lake: raw data, no transformation implemented. Most of the time organized by data sources
  • Staging: simple transformations or cleaning (null, classification…). Most of the time still organized by data sources and most of the time managed as views and not tables
  • Data Warehouse: transformed data, joins, new columns… Most of the time organized by business topics

Last but not least

This pattern suits most of the time when you want to deal with analytics / business intelligence projects. If you are in this kind of project, jumping on the idea of building your Data Lake directly in BigQuery can help you a lot to go fast and clean. By the way, you can also start to think about Machine Learning especially if you use the ML feature of BigQuery.

If you are in a position where more than 90% of you data are structured and semi-structured, in my point of view, you can directly jump on BigQuery as a Data Lake.

Be careful BigQuery is great and powerful but there is some limits to keep in mind, you can have an overview here. In my opinion, most of the limits can handle A LOT of the use cases except if you are in really specific streaming topics (iot, gaming, finance).

That’s all, thanks for reading and stay safe!

Michaël

--

--

Responses (2)