Understanding Airbyte and its efficient data ingestion

Michaël Scherding
4 min readJul 26, 2024

--

Introduction

Airbyte is a versatile data integration platform that simplifies the process of data ingestion. In this article, we’ll focus on how Airbyte handles data ingestion from Google Cloud Storage (GCS) to BigQuery, with a particular emphasis on its non-intrusive approach to managing source files and its effective use of BigQuery tables.

Unlike many standard ingestion solutions that modify or move files within the source bucket, Airbyte takes a different approach. It does not alter the files in GCS nor does it move them. Instead, Airbyte reads the data and manages the ingestion process using dedicated tables within BigQuery. This method ensures the integrity and stability of your source data while providing robust data management capabilities.

Key BigQuery tables managed by Airbyte

Airbyte uses three primary tables in BigQuery to handle the ingestion process: logs, raw, and states. While the three tables in Airbyte’s BigQuery setup share some common schema elements, they serve different purposes and have some differences in their schema. Below is a detailed comparison of their schemas:

Common schema elements

  • Record UUID (_airbyte_raw_id): A unique identifier for each record.
  • Job Start Timestamp (_airbyte_job_started_at): Indicates when the ingestion job started.
  • Slice Start Timestamp (_airbyte_slice_started_at): Indicates when each data slice started (relevant in incremental mode).
  • Extract Timestamp (_airbyte_extracted_at): The exact time the data was extracted from the source.
  • Load Timestamp (_airbyte_loaded_at): The time the data was ingested into BigQuery.

Differences in schema elements

While the _airbyte_data column exists in all three tables, its content varies according to the table’s function:

Logs Table

  • Content: Contains detailed log information about the ingestion process, such as metadata and timestamps related to each step of the job.

States Table

  • Content: Stores state information, which includes data necessary for resuming the ingestion process from the last checkpoint, such as the last successful sync timestamp.

Raw Table

  • Content: Holds the raw, unprocessed data ingested from the source, maintaining the original format and structure of the data.

Example by ingesting data from GCS to BigQuery

In this example, we will break down each step that Airbyte takes to ingest data from a Google Cloud Storage (GCS) bucket to BigQuery, focusing on how it uses the logs, raw, and states tables to efficiently manage the process and ensure only new or modified data is ingested.

When you run the ingestion job, Airbyte performs the following steps:

  1. Job initialization: Airbyte initializes the job and records the start timestamp in the logs table.
  2. State retrieval: Airbyte queries the states table to retrieve the last known state. This state includes the last modified timestamp of the files that were previously ingested.
  3. Identify new/modified files: Using the information from the states table, Airbyte identifies files in the GCS bucket that have been added or modified since the last sync. This is done by comparing the last modified timestamp of each file with the timestamp stored in the states table.
  4. Extract data: Airbyte reads the data from these new or modified files. It does not move or alter the files in the bucket.
  5. Store raw data: The extracted data is stored in the raw table in BigQuery.
  6. Log the process: Airbyte logs each step of the process in the logs table. The log entries include details about the job, such as the start time, slice start time, extraction time, and load time.
  7. Update the state: Once the data is successfully loaded into BigQuery, Airbyte updates the states table with the latest state. This includes the last modified timestamp of the ingested files and arecord of the successfully ingested files and their timestamps.

Example in action

Let’s say we have the following scenario:

  1. Initial ingestion: Airbyte runs an initial ingestion job and reads three files from the GCS bucket:
  • file1.csv (last modified: 2023–07–20T12:00:00Z)
  • file2.csv (last modified: 2023–07–21T12:00:00Z)
  • file3.csv (last modified: 2023–07–22T12:00:00Z)

These files are read and their data is stored in the raw table. The states table is updated with the timestamp of the most recently modified file (2023–07–22T12:00:00Z).

2. Subsequent ingestion: After a few days, a new file (file4.csv) is added, and file2.csv is modified:

  • file2.csv (new last modified: 2023–07–25T12:00:00Z)
  • file4.csv (last modified: 2023–07–26T12:00:00Z)

Airbyte queries the states table and identifies file2.csv and file4.csv as new/modified files. These files are read, and their data is stored in the raw table. The states table is updated with the latest timestamp (2023–07–26T12:00:00Z).

Throughout the process, the logs table captures detailed information about the job’s progress, providing a comprehensive audit trail.

Conclusion

Airbyte’s efficient data ingestion process ensures that only new or modified data is ingested, maintaining data integrity and optimizing performance. By leveraging BigQuery tables (logs, raw, and states), Airbyte provides a robust framework for managing and monitoring data ingestion. This approach is ideal for scenarios where you need to manage small to medium data workflows efficiently, simplifying your data stack and stabilizing your operations.

Airbyte’s approach to data ingestion offers several significant advantages, especially in how it handles data from Google Cloud Storage (GCS) to BigQuery. By utilizing BigQuery tables for logs and state management, Airbyte avoids the common pitfalls associated with moving files between different states or folders in the bucket. This method allows the GCS bucket to maintain a peaceful, stable state without the constant churn of file movements, which can lead to complexity and potential errors.

By managing everything within BigQuery, Airbyte simplifies the entire data ingestion process. The use of dedicated tables for logs, raw data, and states means that all the metadata, processing information, and state tracking are centralized and easily accessible. This centralized approach provides a clear audit trail and makes it straightforward to monitor and manage the ingestion process.

Take care 🤟

--

--

Responses (1)