GSC → BigQuery Pipeline
60 customers, 53M+ rows, solving GSC's data retention limitation
- Client
- Copenhagen-based SEO agency
- Industry
- Digital Marketing
- Period
- September 2025 - Present
- Role
- Independent Consultant
The Problem
The agency manages SEO for 60 clients. They faced a critical operational problem:
Google Search Console only retains 16 months of data. For an SEO agency tracking long-term performance, this meant:
- No historical trend analysis beyond 16 months
- Manual data exports prone to gaps and errors
- No cross-client benchmarking capability
- Reporting limited to GSC’s native interface
Technical Challenges
GSC API Data Suppression: When querying GSC API with multiple dimensions, significant data is suppressed for privacy:
- 65% of clicks missing when combining dimensions
- 22% of impressions missing
- This suppression is worse than the GSC UI shows
Scale Requirements:
- 60 customers with varying property configurations
- 16 months of historical backfill needed
- Daily incremental updates
- Zero manual intervention required
The Solution
Platform Evolution
The project went through three iterations - a valuable lesson in matching tools to requirements:
Key Learning: Client wanted self-maintainability (no-code). Production data infrastructure requires engineering ownership. The “build it so they can maintain it” approach failed; the “build it right” approach succeeded.
Architecture
Data Architecture (3-Layer)
Reliability Features
| Feature | Implementation | Purpose |
|---|---|---|
| Retry/Backoff | Exponential backoff with jitter | Handle GSC API rate limits |
| Resume Support | Run-level state tracking | Recover from partial failures |
| Hash Deduplication | SHA256 on key columns | Prevent duplicate rows |
| Run Logging | Detailed run_log table | Debugging and auditing |
| Smart Date Ranges | Gap detection + incremental | Minimize API calls |
Configuration Management
Single Google Sheet drives all 60 customers:
| Column | Purpose |
|---|---|
| customer_id | Unique identifier |
| gsc_property | URL or domain |
| timezone | Date normalization |
| backfill_start | Historical start date |
| active | Enable/disable |
Zero-code customer onboarding: Add a row, system picks it up automatically.
Results
Production Metrics
| Metric | Value |
|---|---|
| Total rows inserted | 53,300,000+ |
| Daily ingestion | ~70,000 rows |
| Customers served | 60 |
| Historical depth | 16+ months |
| Uptime | Production since Sep 2025 |
Business Value Delivered
| Outcome | Impact |
|---|---|
| Historical data preserved | SEO data secured indefinitely |
| 60 customers automated | All clients served through single infrastructure |
| Zero manual exports | Eliminated time-consuming data handling |
| Dashboard-ready data | Powers Looker Studio reporting for all clients |
Operational Benefits
-
Before: Data older than 16 months was lost forever
-
After: Complete historical record available for all clients
-
Before: Manual exports, CSV wrangling, inconsistent data formats
-
After: Automated daily ingestion, standardized data model
-
Before: Each new client required custom setup
-
After: Add one row to a config sheet - system automatically includes new client
Technology Stack
| Layer | Technology |
|---|---|
| Compute | Google Cloud Run |
| Scheduling | Cloud Scheduler |
| Data Warehouse | BigQuery |
| Secrets | Secret Manager |
| Configuration | Google Sheets API |
| Source API | Google Search Console API |
| Visualization | Looker Studio |
Impact
This pipeline changed how the agency operates:
- Historical data is now an asset. SEO strategies can reference 16+ months of trends instead of losing data to GSC’s rolling window.
- 60 clients, one system. Adding a new customer takes one row in a config sheet - no custom setup, no additional infrastructure.
- Analyst time freed up. Hours previously spent on manual exports now go toward strategy and client work.
The agency can now pitch long-term SEO tracking as a differentiator - something competitors relying on native GSC can’t offer.
Lessons Learned
-
No-code platforms have ceilings. Make.com and n8n work for simple workflows but break under production data engineering requirements.
-
GSC API suppression is underestimated. The 65% click suppression on multi-dimension queries is a fundamental data quality issue that most GSC tools ignore.
-
Configuration-driven architecture scales. One config sheet managing 60 customers is operationally superior to per-customer deployments.
-
Hash deduplication is non-negotiable. Without it, re-runs and retries create duplicate data that corrupts dashboards.
-
AI-assisted development accelerates delivery. The entire production system was built using Claude Code - dramatically faster than traditional development for this type of data engineering work.
Future Roadmap
- Enhanced suppression handling: Complement GSC data with Google Analytics and third-party ranking tools
- Anomaly alerting: Automatic notifications for significant ranking changes
- Competitive intelligence: Integrate Ahrefs Share of Voice data
Want to discuss data pipelines?
Need to build production data infrastructure that actually stays running? I help teams move from fragile no-code automations to reliable cloud systems. Get in touch.