Source

Fault-Isolated
Environmental
Data Platform

Raw telemetry to BigQuery using idempotent micro-batches, immutable Parquet, and workflow-driven quality gates.

2.8M+
Raw Rows Ingested
~6h
TSI Sensor Lag
1.23 MB
7-Day Query Scan
98.8%
Scan Reduction
Measured Optimization · 2026-03-10

98.8% less
data scanned

Partitioning by DATE(ts) and clustering by native_sensor_id reduced a representative 7-day PM2.5 dry-run scan from 102.95 MB (unpartitioned copy) to 1.23 MB on sensors.tsi_raw_materialized.

Before: 102.95 MB After: 1.23 MB
Unpartitioned scan 102.95 MB
Partitioned + clustered 1.23 MB

BigQuery bills per byte scanned. This directly reduces query costs at scale.

System Design

System Topology & Fault Boundaries

End-to-end pipeline from raw sensor data to analytics-ready views, with explicit failure isolation at every layer.

Architecture Docs
Control Plane
GitHub Actions
Scheduled + Manual Orchestration
GitHub Actions
CI/CD + Verification
Data Plane
Cloud Run
Stateless Extract
Schema Coercion
Idempotent
GCS Parquet
Immutable Raw
Append Only
Replay
BigQuery
Partitioned DATE
Clustered ID
Validation Gate
Freshness + Quality
Threshold Checks
Published
Analytics Views
Blocked
Teams Alert
Engineering Decisions

Architecture & Operational Controls

1. Stateless Compute

Cloud Run extraction jobs coerce key fields at ingestion, reducing downstream BigQuery INT/FLOAT casting errors.

2. Immutable Raw Layer

Parquet files in GCS are written to dated source partitions, enabling deterministic replay for historical backfills.

3. Idempotent Writes

Data is materialized via DATE(ts) partitions, enabling safe retries without data duplication.

4. Validation Gate

Freshness checks, quality assertions, and row-threshold checks gate every publish decision with repeatable quality assertions.

Design Tradeoffs

Scalability & Tradeoffs

Every architectural decision is a deliberate tradeoff — here's what was chosen and why.

01 · Ingestion Model
Micro-batch vs Streaming

Scheduled micro-batch via GitHub Actions (with optional Cloud Scheduler). Streaming intentionally avoided to reduce operational complexity, simplify failure domains, and make partition-replay predictable.

02 · Storage Design
Schema Evolution

Long-format schema (ts, sensor_id, metric, value) allows immediate ingestion of new hardware metrics without structural table migrations.

What's Next
Event-Driven Ingestion
Pub/Sub + Eventarc to replace cron triggers and reduce latency.
Anomaly Detection
Z-score rolling baselines in BigQuery ML to flag degrading sensors.
Expanded dbt Migration
Grow dbt coverage from compile/test validation into full transformation execution.
Technologies

Tech Stack

Production-grade tools spanning ingest, storage, transformation, and orchestration.

GCP
Google Cloud
Infrastructure
BQ
BigQuery
Data Warehouse
GCS
Cloud Storage
Parquet Store
CR
Cloud Run
Serverless Compute
dbt
dbt Core
Validation / Migration
PY
Python
Ingestion Scripts
GHA
GitHub Actions
CI/CD
PQ
Parquet
Columnar Format
SQL
SQL
Analytics Queries
CS
Cloud Scheduler
Optional Trigger Path
CRT
Carto
Spatial Analytics
DN
Deepnote
Analytics NB
Data Quality Gate

Pipeline Validation

Every publish is gated on freshness, quality, and threshold assertions.

pipeline_validation.sql
SELECT
  COUNT(*) AS row_count,
  COUNTIF(pm2_5 IS NULL) AS null_count,
  MAX(ts) AS latest_ts
FROM `sensors.tsi_raw_materialized`
WHERE DATE(ts) >=
  DATE_SUB(CURRENT_DATE(), INTERVAL 1 DAY);

> quality checks passed