Data Engineering SEO BigQuery GCP

GSC → BigQuery Pipeline

60 customers, 53M+ rows, solving GSC's data retention limitation

Client
Copenhagen-based SEO agency
Industry
Digital Marketing
Period
September 2025 - Present
Role
Independent Consultant
5 min read

The Problem

The agency manages SEO for 60 clients. They faced a critical operational problem:

Google Search Console only retains 16 months of data. For an SEO agency tracking long-term performance, this meant:

  • No historical trend analysis beyond 16 months
  • Manual data exports prone to gaps and errors
  • No cross-client benchmarking capability
  • Reporting limited to GSC’s native interface

Technical Challenges

GSC API Data Suppression: When querying GSC API with multiple dimensions, significant data is suppressed for privacy:

  • 65% of clicks missing when combining dimensions
  • 22% of impressions missing
  • This suppression is worse than the GSC UI shows

Scale Requirements:

  • 60 customers with varying property configurations
  • 16 months of historical backfill needed
  • Daily incremental updates
  • Zero manual intervention required

The Solution

Platform Evolution

The project went through three iterations - a valuable lesson in matching tools to requirements:

Platform evolution from Make.com to n8n to Cloud Run

Key Learning: Client wanted self-maintainability (no-code). Production data infrastructure requires engineering ownership. The “build it so they can maintain it” approach failed; the “build it right” approach succeeded.

Architecture

Pipeline architecture: Cloud Scheduler triggers Cloud Run, which queries GSC API, Sheets API, and Secret Manager, then loads into BigQuery

Data Architecture (3-Layer)

3-layer data architecture: Staging tables flow into unified Fact table, then into Mart views for reporting

Reliability Features

FeatureImplementationPurpose
Retry/BackoffExponential backoff with jitterHandle GSC API rate limits
Resume SupportRun-level state trackingRecover from partial failures
Hash DeduplicationSHA256 on key columnsPrevent duplicate rows
Run LoggingDetailed run_log tableDebugging and auditing
Smart Date RangesGap detection + incrementalMinimize API calls

Configuration Management

Single Google Sheet drives all 60 customers:

ColumnPurpose
customer_idUnique identifier
gsc_propertyURL or domain
timezoneDate normalization
backfill_startHistorical start date
activeEnable/disable

Zero-code customer onboarding: Add a row, system picks it up automatically.


Results

Production Metrics

MetricValue
Total rows inserted53,300,000+
Daily ingestion~70,000 rows
Customers served60
Historical depth16+ months
UptimeProduction since Sep 2025

Business Value Delivered

OutcomeImpact
Historical data preservedSEO data secured indefinitely
60 customers automatedAll clients served through single infrastructure
Zero manual exportsEliminated time-consuming data handling
Dashboard-ready dataPowers Looker Studio reporting for all clients

Operational Benefits

  • Before: Data older than 16 months was lost forever

  • After: Complete historical record available for all clients

  • Before: Manual exports, CSV wrangling, inconsistent data formats

  • After: Automated daily ingestion, standardized data model

  • Before: Each new client required custom setup

  • After: Add one row to a config sheet - system automatically includes new client


Technology Stack

LayerTechnology
ComputeGoogle Cloud Run
SchedulingCloud Scheduler
Data WarehouseBigQuery
SecretsSecret Manager
ConfigurationGoogle Sheets API
Source APIGoogle Search Console API
VisualizationLooker Studio

Impact

This pipeline changed how the agency operates:

  • Historical data is now an asset. SEO strategies can reference 16+ months of trends instead of losing data to GSC’s rolling window.
  • 60 clients, one system. Adding a new customer takes one row in a config sheet - no custom setup, no additional infrastructure.
  • Analyst time freed up. Hours previously spent on manual exports now go toward strategy and client work.

The agency can now pitch long-term SEO tracking as a differentiator - something competitors relying on native GSC can’t offer.


Lessons Learned

  1. No-code platforms have ceilings. Make.com and n8n work for simple workflows but break under production data engineering requirements.

  2. GSC API suppression is underestimated. The 65% click suppression on multi-dimension queries is a fundamental data quality issue that most GSC tools ignore.

  3. Configuration-driven architecture scales. One config sheet managing 60 customers is operationally superior to per-customer deployments.

  4. Hash deduplication is non-negotiable. Without it, re-runs and retries create duplicate data that corrupts dashboards.

  5. AI-assisted development accelerates delivery. The entire production system was built using Claude Code - dramatically faster than traditional development for this type of data engineering work.


Future Roadmap

  • Enhanced suppression handling: Complement GSC data with Google Analytics and third-party ranking tools
  • Anomaly alerting: Automatic notifications for significant ranking changes
  • Competitive intelligence: Integrate Ahrefs Share of Voice data

Want to discuss data pipelines?

Need to build production data infrastructure that actually stays running? I help teams move from fragile no-code automations to reliable cloud systems. Get in touch.