3 min to read
Building ETL Pipeline with BigQuery and Dataflow
A comprehensive guide to data ETL process using GCP services

Overview
Following our previous post about BigQuery, let’s explore how to build an ETL (Extract, Transform, Load) pipeline using BigQuery and Dataflow. Our workflow will follow this path:
MongoDB → BigQuery → Google Sheets
GCP Resources Used
- Dataflow
- BigQuery
- Cloud Scheduler
- Cloud Functions
- Cloud Storage
What is Google Cloud Dataflow?
Google Cloud Dataflow is a fully managed streaming and batch data processing service provided by GCP.
It’s built on Apache Beam and provides methods for building, deploying, and running data processing pipelines for both real-time and batch processing use cases.
Key Features
- Fully Managed Services
- Apache Beam SDK based
- Auto Scaling
- Integration capabilities
- Real-Time and Batch Processing
- Code Portability
- Debugging and Monitoring
What is Google Cloud Functions?
Google Cloud Functions (GCF) is a serverless computing service that allows developers to run code triggered by cloud events without managing servers.
Key Features
- Serverless architecture
- Event-driven
- Scalability
- Multiple language support (Node.js, Python, Go, Java, .NET, Ruby, PHP)
- GCP Services Integration
- Stateless execution
- Pay-as-you-go pricing
- Security features
- Local Development & Testing capabilities
What is Google Cloud Scheduler?
Cloud Scheduler is a fully managed cron job scheduling service that’s similar to Linux cron but with added cloud capabilities.
Key Features
- Fully Managed Service
- Supports HTTP, Pub/Sub, App Engine calls
- CRON Grammar
- Retry on Failure
- Time Zone Support
- Monitoring and Logging
- Security
- Integration with other GCP services
ETL Workflow Implementation
The implementation follows this sequence:
Infrastructure as Code (IaC) Implementation
The pipeline can be built using two IaC tools:
Terraform Implementation
Source code available at: terraform-infra-gcp
somaz-bigquery-project/
├── README.md
├── bigquery-to-google-sheet
│ ├── bigquery.json
│ ├── main.py
│ ├── main.py.all_data
│ ├── main.py.date
│ ├── main.py.single_db
│ ├── main.py.time_UTC
│ └── requirements.txt
├── bigquery.tfvars
├── cloud-storage.tf
├── locals.tf
├── mongodb-bigquery-googlesheet-workflow.tf
├── mongodb-to-bigquery
│ ├── MongoDB_to_BigQuery
│ ├── main.py
│ ├── main.py.local
│ ├── main.py.single_db
│ └── requirement.txt
├── provider.tf
├── terraform-backend.tf
└── variables.tf
Pulumi Implementation
Source code available at: pulumi-study
bigdata-flow-functions/
├── Pulumi.yaml
├── README.md
├── __main__.py
├── bigquery-to-google-sheet
│ ├── bigquery.json
│ ├── main.py
│ ├── main.py.all_data
│ ├── main.py.date
│ ├── main.py.single_db
│ ├── main.py.time_UTC
│ └── requirements.txt
├── bq_dataset.py
├── bq_sheet_archive.py
├── bq_sheet_function.py
├── config.py
├── mdb_bq_archive.py
├── mdb_bq_function.py
├── mongodb-to-bigquery
│ ├── main.py
│ ├── main.py.local
│ ├── main.py.single_db
│ └── requirements.txt
├── requirements.txt
├── scheduler_manager.py
├── storage.py
└── utils.py
Comments