GSC → BigQuery Pipeline
60 customers, 53M+ rows, solving GSC's data retention limitation
The Problem
The agency manages SEO for 60 clients. They faced a critical operational problem:
Google Search Console only retains 16 months of data. For an SEO agency tracking long-term performance, this meant:
- No historical trend analysis beyond 16 months
- Manual data exports prone to gaps and errors
- No cross-client benchmarking capability
- Reporting limited to GSC’s native interface
Technical Challenges
GSC API Data Suppression: When querying GSC API with multiple dimensions, significant data is suppressed for privacy:
- 65% of clicks missing when combining dimensions
- 22% of impressions missing
- This suppression is worse than the GSC UI shows
Scale Requirements:
- 60 customers with varying property configurations
- 16 months of historical backfill needed
- Daily incremental updates
- Zero manual intervention required
The Solution
Platform Evolution
The project went through three iterations—a valuable lesson in matching tools to requirements:
Version 1: Make.com
├── Problem: Rate limits, limited error handling, hard to debug
├── Outcome: Worked for ~10 customers, broke at scale
Version 2: n8n
├── Problem: Better than Make.com, but still limited control
├── Outcome: Could handle more, but reliability issues persisted
Version 3: Cloud Run (Final)
├── Full Python control
├── Proper retry/backoff
├── Resume from failure
├── Hash-based deduplication
└── Outcome: Production-grade, handles 60 customers reliably
Key Learning: Client wanted self-maintainability (no-code). Production data infrastructure requires engineering ownership. The “build it so they can maintain it” approach failed; the “build it right” approach succeeded.
Architecture
┌──────────────────┐ ┌──────────────────┐ ┌──────────────────┐
│ Cloud Scheduler │────▶│ Cloud Run │────▶│ BigQuery │
│ (Daily trigger) │ │ (Python service)│ │ (Data warehouse)│
└──────────────────┘ └──────────────────┘ └──────────────────┘
│
┌─────────────┼─────────────┐
▼ ▼ ▼
┌─────────────┐ ┌─────────────┐ ┌─────────────┐
│ GSC API │ │ Sheets API │ │ Secret Mgr │
│ (data) │ │ (config) │ │ (creds) │
└─────────────┘ └─────────────┘ └─────────────┘
Data Architecture (3-Layer)
┌─────────────────────────────────────────────────────────────┐
│ STAGING LAYER │
│ stg_gsc_by_date │ stg_gsc_by_dimensions │
│ (daily totals) │ (page/query combinations) │
└─────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────┐
│ FACT LAYER │
│ fct_gsc_performance │
│ - Unified fact table │
│ - is_complete flag for reconciliation │
│ - Hash-based deduplication │
└─────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────┐
│ MART LAYER │
│ mart_gsc_reporting │ mart_gsc_pages │
│ (customer dashboards) │ (page-level analysis) │
└─────────────────────────────────────────────────────────────┘
Reliability Features
| Feature | Implementation | Purpose |
|---|---|---|
| Retry/Backoff | Exponential backoff with jitter | Handle GSC API rate limits |
| Resume Support | Run-level state tracking | Recover from partial failures |
| Hash Deduplication | SHA256 on key columns | Prevent duplicate rows |
| Run Logging | Detailed run_log table | Debugging and auditing |
| Smart Date Ranges | Gap detection + incremental | Minimize API calls |
Configuration Management
Single Google Sheet drives all 60 customers:
| Column | Purpose |
|---|---|
| customer_id | Unique identifier |
| gsc_property | URL or domain |
| timezone | Date normalization |
| backfill_start | Historical start date |
| active | Enable/disable |
Zero-code customer onboarding: Add a row, system picks it up automatically.
Results
Production Metrics
| Metric | Value |
|---|---|
| Total rows inserted | 53,300,000+ |
| Daily ingestion | ~70,000 rows |
| Customers served | 60 |
| Historical depth | 16 months |
| Uptime | Production since Sep 2025 |
Business Value Delivered
| Outcome | Impact |
|---|---|
| Historical data preserved | SEO data secured indefinitely |
| 60 customers automated | All clients served through single infrastructure |
| Zero manual exports | Eliminated time-consuming data handling |
| Dashboard-ready data | Powers Looker Studio reporting for all clients |
Operational Benefits
-
Before: Data older than 16 months was lost forever
-
After: Complete historical record available for all clients
-
Before: Manual exports, CSV wrangling, inconsistent data formats
-
After: Automated daily ingestion, standardized data model
-
Before: Each new client required custom setup
-
After: Add one row to a config sheet—system automatically includes new client
Technology Stack
| Layer | Technology |
|---|---|
| Compute | Google Cloud Run |
| Scheduling | Cloud Scheduler |
| Data Warehouse | BigQuery |
| Secrets | Secret Manager |
| Configuration | Google Sheets API |
| Source API | Google Search Console API |
| Visualization | Looker Studio |
Impact
This pipeline changed how the agency operates:
- Historical data is now an asset. SEO strategies can reference 16+ months of trends instead of losing data to GSC’s rolling window.
- 60 clients, one system. Adding a new customer takes one row in a config sheet—no custom setup, no additional infrastructure.
- Analyst time freed up. Hours previously spent on manual exports now go toward strategy and client work.
The agency can now pitch long-term SEO tracking as a differentiator—something competitors relying on native GSC can’t offer.
Lessons Learned
-
No-code platforms have ceilings. Make.com and n8n work for simple workflows but break under production data engineering requirements.
-
GSC API suppression is underestimated. The 65% click suppression on multi-dimension queries is a fundamental data quality issue that most GSC tools ignore.
-
Configuration-driven architecture scales. One config sheet managing 60 customers is operationally superior to per-customer deployments.
-
Hash deduplication is non-negotiable. Without it, re-runs and retries create duplicate data that corrupts dashboards.
-
AI-assisted development accelerates delivery. The entire production system was built using Claude Code—dramatically faster than traditional development for this type of data engineering work.
Future Roadmap
- Enhanced suppression handling: Complement GSC data with Google Analytics and third-party ranking tools
- Anomaly alerting: Automatic notifications for significant ranking changes
- Competitive intelligence: Integrate Ahrefs Share of Voice data
Want to discuss data pipelines?
Need to build production data infrastructure that actually stays running? I help teams move from fragile no-code automations to reliable cloud systems. Get in touch.