GSC → BigQuery Pipeline

The Problem

The agency manages SEO for 60 clients. They faced a critical operational problem:

Google Search Console only retains 16 months of data. For an SEO agency tracking long-term performance, this meant:

No historical trend analysis beyond 16 months
Manual data exports prone to gaps and errors
No cross-client benchmarking capability
Reporting limited to GSC’s native interface

Technical Challenges

GSC API Data Suppression: When querying GSC API with multiple dimensions, significant data is suppressed for privacy:

65% of clicks missing when combining dimensions
22% of impressions missing
This suppression is worse than the GSC UI shows

Scale Requirements:

60 customers with varying property configurations
16 months of historical backfill needed
Daily incremental updates
Zero manual intervention required

The Solution

Platform Evolution

The project went through three iterations—a valuable lesson in matching tools to requirements:

Version 1: Make.com
├── Problem: Rate limits, limited error handling, hard to debug
├── Outcome: Worked for ~10 customers, broke at scale

Version 2: n8n
├── Problem: Better than Make.com, but still limited control
├── Outcome: Could handle more, but reliability issues persisted

Version 3: Cloud Run (Final)
├── Full Python control
├── Proper retry/backoff
├── Resume from failure
├── Hash-based deduplication
└── Outcome: Production-grade, handles 60 customers reliably

Key Learning: Client wanted self-maintainability (no-code). Production data infrastructure requires engineering ownership. The “build it so they can maintain it” approach failed; the “build it right” approach succeeded.

Architecture

┌──────────────────┐     ┌──────────────────┐     ┌──────────────────┐
│  Cloud Scheduler │────▶│    Cloud Run     │────▶│     BigQuery     │
│  (Daily trigger) │     │  (Python service)│     │  (Data warehouse)│
└──────────────────┘     └──────────────────┘     └──────────────────┘
                                  │
                    ┌─────────────┼─────────────┐
                    ▼             ▼             ▼
            ┌─────────────┐ ┌─────────────┐ ┌─────────────┐
            │ GSC API     │ │ Sheets API  │ │ Secret Mgr  │
            │ (data)      │ │ (config)    │ │ (creds)     │
            └─────────────┘ └─────────────┘ └─────────────┘

Data Architecture (3-Layer)

┌─────────────────────────────────────────────────────────────┐
│                     STAGING LAYER                           │
│  stg_gsc_by_date        │  stg_gsc_by_dimensions           │
│  (daily totals)         │  (page/query combinations)       │
└─────────────────────────────────────────────────────────────┘
                              │
                              ▼
┌─────────────────────────────────────────────────────────────┐
│                       FACT LAYER                            │
│  fct_gsc_performance                                        │
│  - Unified fact table                                       │
│  - is_complete flag for reconciliation                      │
│  - Hash-based deduplication                                 │
└─────────────────────────────────────────────────────────────┘
                              │
                              ▼
┌─────────────────────────────────────────────────────────────┐
│                       MART LAYER                            │
│  mart_gsc_reporting     │  mart_gsc_pages                  │
│  (customer dashboards)  │  (page-level analysis)           │
└─────────────────────────────────────────────────────────────┘

Reliability Features

Feature	Implementation	Purpose
Retry/Backoff	Exponential backoff with jitter	Handle GSC API rate limits
Resume Support	Run-level state tracking	Recover from partial failures
Hash Deduplication	SHA256 on key columns	Prevent duplicate rows
Run Logging	Detailed run_log table	Debugging and auditing
Smart Date Ranges	Gap detection + incremental	Minimize API calls

Configuration Management

Single Google Sheet drives all 60 customers:

Column	Purpose
customer_id	Unique identifier
gsc_property	URL or domain
timezone	Date normalization
backfill_start	Historical start date
active	Enable/disable

Zero-code customer onboarding: Add a row, system picks it up automatically.

Results

Production Metrics

Metric	Value
Total rows inserted	53,300,000+
Daily ingestion	~70,000 rows
Customers served	60
Historical depth	16 months
Uptime	Production since Sep 2025

Business Value Delivered

Outcome	Impact
Historical data preserved	SEO data secured indefinitely
60 customers automated	All clients served through single infrastructure
Zero manual exports	Eliminated time-consuming data handling
Dashboard-ready data	Powers Looker Studio reporting for all clients

Operational Benefits

Before: Data older than 16 months was lost forever
After: Complete historical record available for all clients
Before: Manual exports, CSV wrangling, inconsistent data formats
After: Automated daily ingestion, standardized data model
Before: Each new client required custom setup
After: Add one row to a config sheet—system automatically includes new client

Technology Stack

Layer	Technology
Compute	Google Cloud Run
Scheduling	Cloud Scheduler
Data Warehouse	BigQuery
Secrets	Secret Manager
Configuration	Google Sheets API
Source API	Google Search Console API
Visualization	Looker Studio

Impact

This pipeline changed how the agency operates:

Historical data is now an asset. SEO strategies can reference 16+ months of trends instead of losing data to GSC’s rolling window.
60 clients, one system. Adding a new customer takes one row in a config sheet—no custom setup, no additional infrastructure.
Analyst time freed up. Hours previously spent on manual exports now go toward strategy and client work.

The agency can now pitch long-term SEO tracking as a differentiator—something competitors relying on native GSC can’t offer.

Lessons Learned

No-code platforms have ceilings. Make.com and n8n work for simple workflows but break under production data engineering requirements.
GSC API suppression is underestimated. The 65% click suppression on multi-dimension queries is a fundamental data quality issue that most GSC tools ignore.
Configuration-driven architecture scales. One config sheet managing 60 customers is operationally superior to per-customer deployments.
Hash deduplication is non-negotiable. Without it, re-runs and retries create duplicate data that corrupts dashboards.
AI-assisted development accelerates delivery. The entire production system was built using Claude Code—dramatically faster than traditional development for this type of data engineering work.

Future Roadmap

Enhanced suppression handling: Complement GSC data with Google Analytics and third-party ranking tools
Anomaly alerting: Automatic notifications for significant ranking changes
Competitive intelligence: Integrate Ahrefs Share of Voice data

Want to discuss data pipelines?

Need to build production data infrastructure that actually stays running? I help teams move from fragile no-code automations to reliable cloud systems. Get in touch.

The Problem

Technical Challenges

The Solution

Platform Evolution

Architecture

Data Architecture (3-Layer)

Reliability Features

Configuration Management

Results

Production Metrics

Business Value Delivered

Operational Benefits

Technology Stack

Impact

Lessons Learned

Future Roadmap

Want to discuss data pipelines?

Let's Build Something

Taking on new work.

Book a Call

Send a Message

The Problem

Technical Challenges

The Solution

Platform Evolution

Architecture

Data Architecture (3-Layer)

Reliability Features

Configuration Management

Results

Production Metrics

Business Value Delivered

Operational Benefits

Technology Stack

Impact

Lessons Learned

Future Roadmap

Want to discuss data pipelines?

Multi-Agent FDA Document Review

AI Keyword Classifier

How I Learned to Build

Let's Build Something

Taking on new work.

Book a Call

Send a Message