GSC → BigQuery Pipeline

60 customers, 53M+ rows, solving GSC's data retention limitation

Data EngineeringSEOBigQueryGCP
Client Copenhagen-based SEO agency
Period September 2025 — Present
Role Independent Consultant
Key Impact:
53.3M+ rows accumulated | 70,000+ rows ingested daily | 60 customers on single infrastructure | 16+ months historical data preserved
Google Cloud RunBigQueryCloud SchedulerGoogle Search Console API

The Problem

The agency manages SEO for 60 clients. They faced a critical operational problem:

Google Search Console only retains 16 months of data. For an SEO agency tracking long-term performance, this meant:

  • No historical trend analysis beyond 16 months
  • Manual data exports prone to gaps and errors
  • No cross-client benchmarking capability
  • Reporting limited to GSC’s native interface

Technical Challenges

GSC API Data Suppression: When querying GSC API with multiple dimensions, significant data is suppressed for privacy:

  • 65% of clicks missing when combining dimensions
  • 22% of impressions missing
  • This suppression is worse than the GSC UI shows

Scale Requirements:

  • 60 customers with varying property configurations
  • 16 months of historical backfill needed
  • Daily incremental updates
  • Zero manual intervention required

The Solution

Platform Evolution

The project went through three iterations—a valuable lesson in matching tools to requirements:

Version 1: Make.com
├── Problem: Rate limits, limited error handling, hard to debug
├── Outcome: Worked for ~10 customers, broke at scale

Version 2: n8n
├── Problem: Better than Make.com, but still limited control
├── Outcome: Could handle more, but reliability issues persisted

Version 3: Cloud Run (Final)
├── Full Python control
├── Proper retry/backoff
├── Resume from failure
├── Hash-based deduplication
└── Outcome: Production-grade, handles 60 customers reliably

Key Learning: Client wanted self-maintainability (no-code). Production data infrastructure requires engineering ownership. The “build it so they can maintain it” approach failed; the “build it right” approach succeeded.

Architecture

┌──────────────────┐     ┌──────────────────┐     ┌──────────────────┐
│  Cloud Scheduler │────▶│    Cloud Run     │────▶│     BigQuery     │
│  (Daily trigger) │     │  (Python service)│     │  (Data warehouse)│
└──────────────────┘     └──────────────────┘     └──────────────────┘

                    ┌─────────────┼─────────────┐
                    ▼             ▼             ▼
            ┌─────────────┐ ┌─────────────┐ ┌─────────────┐
            │ GSC API     │ │ Sheets API  │ │ Secret Mgr  │
            │ (data)      │ │ (config)    │ │ (creds)     │
            └─────────────┘ └─────────────┘ └─────────────┘

Data Architecture (3-Layer)

┌─────────────────────────────────────────────────────────────┐
│                     STAGING LAYER                           │
│  stg_gsc_by_date        │  stg_gsc_by_dimensions           │
│  (daily totals)         │  (page/query combinations)       │
└─────────────────────────────────────────────────────────────┘


┌─────────────────────────────────────────────────────────────┐
│                       FACT LAYER                            │
│  fct_gsc_performance                                        │
│  - Unified fact table                                       │
│  - is_complete flag for reconciliation                      │
│  - Hash-based deduplication                                 │
└─────────────────────────────────────────────────────────────┘


┌─────────────────────────────────────────────────────────────┐
│                       MART LAYER                            │
│  mart_gsc_reporting     │  mart_gsc_pages                  │
│  (customer dashboards)  │  (page-level analysis)           │
└─────────────────────────────────────────────────────────────┘

Reliability Features

FeatureImplementationPurpose
Retry/BackoffExponential backoff with jitterHandle GSC API rate limits
Resume SupportRun-level state trackingRecover from partial failures
Hash DeduplicationSHA256 on key columnsPrevent duplicate rows
Run LoggingDetailed run_log tableDebugging and auditing
Smart Date RangesGap detection + incrementalMinimize API calls

Configuration Management

Single Google Sheet drives all 60 customers:

ColumnPurpose
customer_idUnique identifier
gsc_propertyURL or domain
timezoneDate normalization
backfill_startHistorical start date
activeEnable/disable

Zero-code customer onboarding: Add a row, system picks it up automatically.


Results

Production Metrics

MetricValue
Total rows inserted53,300,000+
Daily ingestion~70,000 rows
Customers served60
Historical depth16 months
UptimeProduction since Sep 2025

Business Value Delivered

OutcomeImpact
Historical data preservedSEO data secured indefinitely
60 customers automatedAll clients served through single infrastructure
Zero manual exportsEliminated time-consuming data handling
Dashboard-ready dataPowers Looker Studio reporting for all clients

Operational Benefits

  • Before: Data older than 16 months was lost forever

  • After: Complete historical record available for all clients

  • Before: Manual exports, CSV wrangling, inconsistent data formats

  • After: Automated daily ingestion, standardized data model

  • Before: Each new client required custom setup

  • After: Add one row to a config sheet—system automatically includes new client


Technology Stack

LayerTechnology
ComputeGoogle Cloud Run
SchedulingCloud Scheduler
Data WarehouseBigQuery
SecretsSecret Manager
ConfigurationGoogle Sheets API
Source APIGoogle Search Console API
VisualizationLooker Studio

Impact

This pipeline changed how the agency operates:

  • Historical data is now an asset. SEO strategies can reference 16+ months of trends instead of losing data to GSC’s rolling window.
  • 60 clients, one system. Adding a new customer takes one row in a config sheet—no custom setup, no additional infrastructure.
  • Analyst time freed up. Hours previously spent on manual exports now go toward strategy and client work.

The agency can now pitch long-term SEO tracking as a differentiator—something competitors relying on native GSC can’t offer.


Lessons Learned

  1. No-code platforms have ceilings. Make.com and n8n work for simple workflows but break under production data engineering requirements.

  2. GSC API suppression is underestimated. The 65% click suppression on multi-dimension queries is a fundamental data quality issue that most GSC tools ignore.

  3. Configuration-driven architecture scales. One config sheet managing 60 customers is operationally superior to per-customer deployments.

  4. Hash deduplication is non-negotiable. Without it, re-runs and retries create duplicate data that corrupts dashboards.

  5. AI-assisted development accelerates delivery. The entire production system was built using Claude Code—dramatically faster than traditional development for this type of data engineering work.


Future Roadmap

  • Enhanced suppression handling: Complement GSC data with Google Analytics and third-party ranking tools
  • Anomaly alerting: Automatic notifications for significant ranking changes
  • Competitive intelligence: Integrate Ahrefs Share of Voice data

Want to discuss data pipelines?

Need to build production data infrastructure that actually stays running? I help teams move from fragile no-code automations to reliable cloud systems. Get in touch.

Let's Build Something

Taking on new work.

I build AI workflows and agents that actually run in production—and stick around to maintain them.

Best fit: growing companies where ops can't keep up with volume, teams who tried AI and got burned, or regulated industries where you can't afford to get it wrong.

Based in Copenhagen. Available for remote or on-site (SF, NY, London).

What to expect: I respond within a few days. If there's a fit, we'll find 30 minutes for coffee or a call.

Have a quick question? — an AI that knows my work.

Book a Call

Skip the back-and-forth. Pick a time that works for you and let's talk about your project.

Book a 30-minute call →

Send a Message

Prefer email? Drop me a note and I'll get back within a few days.