BigQuery Job Monitoring & ML-based Query Classification
This solution builds a passive BigQuery job monitoring and ML-based classification system.It continuously collects BigQuery job metadata, enriches it with reservation capacity context, derives meaningful features, and trains a BigQuery ML model to identify good vs. problematic jobs.
- No production queries are modified
- No jobs are killed or throttled
- 100% metadata-driven and read-only
- Fully automated using scheduled queries
This solution consist of two Phases,
- No production queries are modified
- No jobs are killed or throttled
- 100% metadata-driven and read-only
- Fully automated using scheduled queries
What Phase 1 Does
- Collects BigQuery job execution metadata on a schedule
- Captures slot reservation capacity information
- Derives job-level features such as runtime, slot usage, SQL patterns, and time attributes
- Applies rule-based classification to label jobs as good or problematic
- Trains a BigQuery ML model to learn query behavior patterns
- Retrains the model daily to stay accurate as workloads evolve
Key Components
- bq_job_history – raw BigQuery job execution data
- bq_reservation_capacity – slot reservation capacity snapshot
- bq_job_features – enriched, ML-ready job feature dataset
- bq_job_risk_model – BigQuery ML model for job risk prediction
Outcome of Phase 1
- Clear visibility into which jobs are good or bad
- Reservation-aware slot usage analysis
- ML-based risk scoring available directly in SQL
- Zero production impact (no query interception, no throttling)
Phase 1 establishes the intelligence layer, but does not take action.
Phase 1 establishes the intelligence layer, but does not take action.
Phase 2 – Alerting & Operational Awareness
Phase 2 adds real-time awareness and notifications using the ML intelligence built in Phase 1.
Phase 2 adds real-time awareness and notifications using the ML intelligence built in Phase 1.
What Phase 2 Does- Detects job completion events in near real time
- Scores new jobs using the trained ML model
- Sends Slack/Teams/Email alerts when a job is predicted to be risky
- Provides operators with context to investigate or take action
- Detects job completion events in near real time
- Scores new jobs using the trained ML model
- Sends Slack/Teams/Email alerts when a job is predicted to be risky
- Provides operators with context to investigate or take action
Key Components
- Cloud Logging sink to capture BigQuery job completion events
- Pub/Sub topic to stream job events
- Cloud Function to:
- Look up job features
- Run
ML.PREDICT - Apply probability thresholds
- Send formatted alerts to Slack
- Slack/Teams/Email channel for centralized visibility
- Cloud Logging sink to capture BigQuery job completion events
- Pub/Sub topic to stream job events
- Cloud Function to:
- Look up job features
- Run
ML.PREDICT - Apply probability thresholds
- Send formatted alerts to Slack
- Slack/Teams/Email channel for centralized visibility
Outcome of Phase 2- Near real-time alerts for problematic BigQuery jobs
- Reduced time to detect inefficient or expensive queries
- Actionable insights delivered where teams already work (Slack)
- Near real-time alerts for problematic BigQuery jobs
- Reduced time to detect inefficient or expensive queries
- Actionable insights delivered where teams already work (Slack)
1. Create Monitoring dataset
bq_monitoring) to store all monitoring tables, derived features, and ML models. Keeps monitoring and analytics artifacts clearly separated from business and analytical datasets.BigQuery UI → Explorer → Project database-xxxx → ⋮ → Create datasetDataset ID: bq_monitoringLocation: US (multi-region)
2. Create job history table (schema only)
CREATE TABLE IF NOT EXISTS `database-xxx.bq_monitoring.bq_job_history`
PARTITION BY DATE(creation_time)CLUSTER BY reservation_id, user_email ASSELECT job_id, project_id, user_email, creation_time, start_time, end_time, TIMESTAMP_DIFF(end_time, start_time, SECOND) AS runtime_sec, total_bytes_processed, total_slot_ms, reservation_id, job_type, statement_type, query, error_resultFROM `region-us`.INFORMATION_SCHEMA.JOBS_BY_PROJECTWHERE 1 = 0;
3. Scheduled log for bg_job_history
last_seen). Ensures near-real-time visibility into BigQuery usage with minimal processing overhead.INSERT INTO `database-xxx.bq_monitoring.bq_job_history`( job_id, project_id, user_email, creation_time, start_time, end_time, runtime_sec, total_bytes_processed, total_slot_ms, reservation_id, job_type, statement_type, query, error_result)WITH last_seen AS ( SELECT IFNULL(MAX(creation_time), TIMESTAMP('1970-01-01')) AS max_ct FROM `database-xxx.bq_monitoring.bq_job_history`)SELECT job_id, project_id, user_email, creation_time, start_time, end_time, TIMESTAMP_DIFF(end_time, start_time, SECOND) AS runtime_sec, total_bytes_processed, total_slot_ms, reservation_id, job_type, statement_type, query, error_resultFROM `region-us`.INFORMATION_SCHEMA.JOBS_BY_PROJECT, last_seenWHERE job_type = 'QUERY' AND creation_time > last_seen.max_ct;
Schedule it:
Click Schedule
Frequency: every 15 minutes
Location: Multi-region → US
UNCHECK “Set destination table for query results”
Save
4. Create reservation capacity snapshot table
5. Schedule refresh for bq_reservation_capacity
Use the same SQL as step 4.
Schedule frequency: every 1 hour
Location: US
UNCHECK destination table
Save
6. Create feature table bq_job_features
label_problematic) to each job.7. Schedule refresh for bq_job_features
Use the same SQL as step 6.
Schedule frequency: every 4 hours
Location: US
UNCHECK destination table
Save
8. Initial ML Training
9. Schedule daily ML retraining
Use the exact same SQL as step 8 and schedule it:
Frequency: Daily (choose a quiet time)
Location: US
UNCHECK destination table
Save
Now the model auto-relearns from the latest data every day.
10. How to use it (see good vs bad jobs)
Direct label (no ML) - what our rule think is good or bad?
SELECT
creation_time, job_id, reservation_id, runtime_sec, total_slot_ms, slot_capacity, slot_usage_ratio, label_problematic
FROM `database-xxx.bq_monitoring.bq_job_features`ORDER BY creation_time DESCLIMIT 50;
ML prediction example for recent jobs:
SELECT *FROM ML.PREDICT( MODEL `database-xxx.bq_monitoring.bq_job_risk_model`, ( SELECT label_problematic, runtime_sec, total_bytes_processed, total_slot_ms, total_slot_seconds, COALESCE(slot_capacity, 0) AS slot_capacity_feature, COALESCE(slot_usage_ratio, 0) AS slot_usage_ratio_feature, job_type, statement_type, has_select_star, has_cross_join, has_where_clause, hour_of_day, day_of_week
FROM `database-xxx.bq_monitoring.bq_job_features` ORDER BY creation_time DESC LIMIT 50
));
we’ll see:
predicted_label
predicted_probability
we’ll see:
predicted_label
predicted_probability
per job, on top of our rule-based labels.
====================== I will create solution for phase-2 later ===================================
Comments
Post a Comment