Fault-Isolated
Environmental
Data Platform
Raw telemetry to BigQuery using idempotent micro-batches, immutable Parquet, and workflow-driven quality gates.
98.8% less
data scanned
Partitioning by
DATE(ts)
and clustering by
native_sensor_id
reduced a representative 7-day PM2.5 dry-run scan from
102.95 MB (unpartitioned copy) to
1.23 MB on
sensors.tsi_raw_materialized.
BigQuery bills per byte scanned. This directly reduces query costs at scale.
System Topology & Fault Boundaries
End-to-end pipeline from raw sensor data to analytics-ready views, with explicit failure isolation at every layer.
Schema Coercion
Append Only
Clustered ID
Threshold Checks
Architecture & Operational Controls
Cloud Run extraction jobs coerce key fields at ingestion, reducing downstream BigQuery INT/FLOAT casting errors.
Parquet files in GCS are written to dated source partitions, enabling deterministic replay for historical backfills.
Data is materialized via
DATE(ts)
partitions, enabling safe retries without data duplication.
Freshness checks, quality assertions, and row-threshold checks gate every publish decision with repeatable quality assertions.
Scalability & Tradeoffs
Every architectural decision is a deliberate tradeoff — here's what was chosen and why.
Scheduled micro-batch via GitHub Actions (with optional Cloud Scheduler). Streaming intentionally avoided to reduce operational complexity, simplify failure domains, and make partition-replay predictable.
Long-format schema (ts, sensor_id, metric, value) allows immediate ingestion of new hardware metrics without structural table migrations.
Tech Stack
Production-grade tools spanning ingest, storage, transformation, and orchestration.
Pipeline Validation
Every publish is gated on freshness, quality, and threshold assertions.
COUNT(*) AS row_count,
COUNTIF(pm2_5 IS NULL) AS null_count,
MAX(ts) AS latest_ts
FROM `sensors.tsi_raw_materialized`
WHERE DATE(ts) >=
DATE_SUB(CURRENT_DATE(), INTERVAL 1 DAY);
> quality checks passed