Automating BigQuery exports to FTP

Michaël Scherding
5 min readApr 15, 2024

--

Introduction

In the fast-paced world of data-driven decision-making, seamless data integration and automation are crucial for efficiency and reliability. Organizations increasingly depend on powerful data analysis tools like Google BigQuery, but often need to export this data securely and automatically for use in external systems.

This article outlines a robust methodology for automating the export of datasets from BigQuery to external FTP servers using a combination of Google Cloud’s advanced technologies. We will walk through the process from initial testing to full-scale operational deployment, demonstrating how to integrate continuous integration and deployment practices to streamline your data workflows effectively. Let’s dive into the detailed setup and explore how these integrations can enhance your organizational data strategies.

Part 1: Understanding the code and testing in Google Colab

Before diving into the Python script that automates the export of BigQuery data to an FTP server, it’s crucial to understand the configuration setup that drives this process. This setup involves a config.json file, which plays a key role in maintaining the flexibility and modularity of our script.

Configuration File Explained (config.json)

The config.json file contains key parameters that define how our script interacts with various Google Cloud services and ensures seamless operation across different environments. Here’s what it typically includes:

BigQuery settings:

  • BIGQUERY_PROJECT: The Google Cloud project ID containing the BigQuery dataset.
  • BIGQUERY_DATASET: The specific dataset within the project where the table is located.
  • TABLE_NAME: Name of the table from which data is extracted.
  • DATE_COLUMN: The column in the table that is used to filter records by date.

Google Cloud Storage (GCS) settings:

  • GCS_PROJECT: The project ID for the Google Cloud Storage where files will be uploaded.
  • GCS_BUCKET: The specific GCS bucket used for storing files.
  • BASE_GCS_PATH: The base path in the GCS bucket where files will be organized.
  • UPLOAD_FOLDER: Subdirectory under the base path for successfully processed files.
  • ERROR_FOLDER: Subdirectory for storing logs or files that failed to process correctly.

File Naming and FTP settings:

  • FILENAME_PREFIX: A base name for generated files, which helps in organizing and identifying files easily.
  • FILE_EXTENSION: The format of the file, typically .csv for data files.
  • SFTP_DIRECTORY: The path on the SFTP server where files will be uploaded.

This configuration approach not only allows for easy changes to the dataset or storage details without modifying the script but also supports scalability as new parameters or services can be added with minimal changes.

{
"BIGQUERY_PROJECT": "my-bigquery-project-123",
"BIGQUERY_DATASET": "analytics_data",
"TABLE_NAME": "user_activity",
"DATE_COLUMN": "activity_date",
"GCS_PROJECT": "my-gcs-project-456",
"GCS_BUCKET": "analytics_data_bucket",
"BASE_GCS_PATH": "data/",
"UPLOAD_FOLDER": "processed/",
"ERROR_FOLDER": "failed/",
"FILENAME_PREFIX": "daily_activity_export",
"FILE_EXTENSION": ".csv",
"SFTP_DIRECTORY": "path/to/sftp/upload/",
"SCHEDULE": "0 0 * * *"
}

Script overview

With a solid configuration foundation, we can now focus on the script itself. Developed and tested in Google Colab, this Python script comprises several functions, each designed to handle specific aspects of the data extraction, processing, and transfer workflow:

The Python script we devised is structured to perform several key operations:

  • Data fetching: Connects to Google BigQuery to retrieve data based on a specific query that filters entries by the current date, ensuring only the most relevant data is processed.
  • Data processing and validation: Checks if the resulting data frame from BigQuery is empty to prevent the processing of non-existent data, which optimizes resource use and processing time.
  • File management: Handles the conversion of data into a CSV format and uploads it to Google Cloud Storage. This step acts as a buffer and ensures that data integrity is maintained before it is sent to the FTP server.
  • Secure file transfer: Uses SFTP (Secure File Transfer Protocol) to safely transfer the processed file to the specified FTP server. The script employs advanced Python libraries such as Paramiko to manage SFTP connections and file transfers securely.

Overall process

Part 2: Scaling with GitHub, Docker, Cloud Run, and Cloud Scheduler

As we move from a prototype tested in Google Colab to a full-scale production system, it’s essential to establish a robust infrastructure that can handle multiple configurations and ensure consistent deployments. This can be achieved by using GitHub for source control, GitHub Actions for CI/CD, Docker for containerization, and Google Cloud Run and Scheduler for execution and scheduling. Please find below the process I will follow next.

(Basic) Repository organization and configuration management

/your-repo

├── src/ # All source code files including the main Python script
│ └── main.py # Main script to be executed

├── config/ # Directory for configuration files
│ └── config.json # Central configuration file that defines all operational parameters

├── Dockerfile # Dockerfile for building the Docker image

└── .github/
└── workflows/ # Contains all GitHub Actions workflow definitions
└── main.yml # Main CI/CD pipeline for deploying to Google Cloud Run

Central configuration management:

  • config.json includes all necessary parameters (see above) for each data file that needs to be exported. This might include BigQuery table names, SFTP paths, and specific configurations like file naming conventions.
  • Each entry in config.json also includes a cron schedule expression that dictates how frequently each task should run. This allows each export task to have a unique schedule that can be managed centrally.

Dockerization and CI/CD workflow

The application is containerized using Docker, encapsulating it in a portable environment that includes all necessary dependencies. This ensures consistency across different execution environments.

CI/CD with GitHub actions

  • GitHub Actions is used to automate the deployment of the Docker container to Google Cloud Run whenever changes are pushed to the repository.
  • The CI/CD pipeline in .github/workflows/main.yml builds the Docker image, pushes it to Google Container Registry (GCR), and deploys it to Cloud Run.

Scheduling with Cloud Scheduler

Dynamic Scheduling Logic:

  • For each entry in config.json, a corresponding Cloud Scheduler job is created or updated by the CI/CD pipeline to trigger the Cloud Run service at the specified times.
  • This setup allows for dynamic scheduling where changes to schedules in the config.json file are automatically propagated to Cloud Scheduler through the CI/CD process.
  • The Cloud Scheduler uses HTTP requests to trigger the Cloud Run instances, passing specific configuration details as parameters to ensure that each run is configured correctly based on the config.json settings.

Conclusion

By integrating GitHub for version control, GitHub Actions for continuous integration and deployment, Docker for containerization, and Google Cloud Run and Scheduler for execution and scheduling, the system is not only more robust and scalable but also easier to manage and adapt to changing requirements. This architecture ensures that data export tasks are performed reliably, securely, and timely, without manual intervention, and can easily be monitored and modified from a central repository.

See ya ✌️

--

--

No responses yet