BigQuery Job Monitoring & ML-based Query Classification

BigQuery Job Monitoring & ML-based Query Classification - Phase 1

BigQuery Job Monitoring & ML-based Query Classification

This solution builds a passive BigQuery job monitoring and ML-based classification system.
It continuously collects BigQuery job metadata, enriches it with reservation capacity context, derives meaningful features, and trains a BigQuery ML model to identify good vs. problematic jobs.

No production queries are modified
No jobs are killed or throttled
100% metadata-driven and read-only
Fully automated using scheduled queries
This solution consist of two Phases,

What Phase 1 Does

Collects BigQuery job execution metadata on a schedule
Captures slot reservation capacity information
Derives job-level features such as runtime, slot usage, SQL patterns, and time attributes
Applies rule-based classification to label jobs as good or problematic
Trains a BigQuery ML model to learn query behavior patterns
Retrains the model daily to stay accurate as workloads evolve

Key Components

bq_job_history – raw BigQuery job execution data
bq_reservation_capacity – slot reservation capacity snapshot
bq_job_features – enriched, ML-ready job feature dataset
bq_job_risk_model – BigQuery ML model for job risk prediction

Outcome of Phase 1

Clear visibility into which jobs are good or bad
Reservation-aware slot usage analysis
ML-based risk scoring available directly in SQL
Zero production impact (no query interception, no throttling)

Phase 1 establishes the intelligence layer, but does not take action.

Phase 2 – Alerting & Operational Awareness

Phase 2 adds real-time awareness and notifications using the ML intelligence built in Phase 1.

What Phase 2 Does
Detects job completion events in near real time
Scores new jobs using the trained ML model
Sends Slack/Teams/Email alerts when a job is predicted to be risky
Provides operators with context to investigate or take action

Key Components
Cloud Logging sink to capture BigQuery job completion events
Pub/Sub topic to stream job events
Cloud Function to:
Look up job features
Run `ML.PREDICT`
Apply probability thresholds
Send formatted alerts to Slack
Slack/Teams/Email channel for centralized visibility

Outcome of Phase 2
       Near real-time alerts for problematic BigQuery jobs
       Reduced time to detect inefficient or expensive queries
       Actionable insights delivered where teams already work (Slack)

1. Create Monitoring dataset

Creates a dedicated BigQuery dataset (bq_monitoring) to store all monitoring tables, derived features, and ML models. Keeps monitoring and analytics artifacts clearly separated from business and analytical datasets.

BigQuery UI → Explorer → Project database-xxxx → ⋮ → Create dataset
Dataset ID: bq_monitoring
Location: US (multi-region)

2. Create job history table (schema only)

Defines the schema for capturing BigQuery query job metadata such as runtime, slot usage, bytes processed, SQL text, and errors. Acts as the long-term, append-only history of all query activity.

CREATE TABLE IF NOT EXISTS `database-xxx.bq_monitoring.bq_job_history`

PARTITION BY DATE(creation_time)
CLUSTER BY reservation_id, user_email AS
SELECT
job_id,
project_id,
user_email,
creation_time,
start_time,
end_time,
TIMESTAMP_DIFF(end_time, start_time, SECOND) AS runtime_sec,
total_bytes_processed,
total_slot_ms,
reservation_id,
job_type,
statement_type,
query,
error_result
FROM
`region-us`.INFORMATION_SCHEMA.JOBS_BY_PROJECT
WHERE
1 = 0;

3. Scheduled log for bg_job_history

Runs every 15 minutes to ingest only newly completed query jobs using a watermark (last_seen). Ensures near-real-time visibility into BigQuery usage with minimal processing overhead.

INSERT INTO `database-xxx.bq_monitoring.bq_job_history`
(
job_id, project_id, user_email,
creation_time, start_time, end_time,
runtime_sec, total_bytes_processed,
total_slot_ms, reservation_id, job_type,
statement_type, query, error_result
)
WITH last_seen AS (
SELECT IFNULL(MAX(creation_time), TIMESTAMP('1970-01-01')) AS max_ct
FROM `database-xxx.bq_monitoring.bq_job_history`
)
SELECT
job_id,
project_id,
user_email,
creation_time,
start_time,
end_time,
TIMESTAMP_DIFF(end_time, start_time, SECOND) AS runtime_sec,
total_bytes_processed,
total_slot_ms,
reservation_id,
job_type,
statement_type,
query,
error_result
FROM
`region-us`.INFORMATION_SCHEMA.JOBS_BY_PROJECT,
last_seen
WHERE
job_type = 'QUERY'
AND creation_time > last_seen.max_ct;

Schedule it:
       Click Schedule
        Frequency: every 15 minutes
       Location: Multi-region → US
        UNCHECK “Set destination table for query results”
       Save

4. Create reservation capacity snapshot table

Captures the current state of BigQuery reservations including committed slots and autoscaling limits. Provides capacity context needed to interpret job slot usage correctly.

CREATE OR REPLACE TABLE `database-xxx.bq_monitoring.bq_reservation_capacity` AS
SELECT
reservation_name,
slot_capacity, -- committed slots
autoscale.current_slots AS autoscale_current_slots,
autoscale.max_slots AS autoscale_max_slots,
CURRENT_TIMESTAMP() AS snapshot_time
FROM
`region-us`.INFORMATION_SCHEMA.RESERVATIONS;

5. Schedule refresh for bq_reservation_capacity

Refreshes reservation slot information hourly to reflect capacity changes over time. Keeps slot allocation data aligned with actual reservation configuration.

Use the same SQL as step 4.
       Schedule frequency: every 1 hour
       Location: US
       UNCHECK destination table
       Save

6. Create feature table bq_job_features

Joins job history with reservation capacity and derives analytical features such as slot usage ratio, SQL patterns, and time-based attributes. Also applies rule-based logic to assign an initial good/bad label (label_problematic) to each job.

CREATE OR REPLACE TABLE `database-xxx.bq_monitoring.bq_job_features` AS
SELECT
  j.job_id,
  j.project_id,
  j.user_email,
  j.reservation_id,
  j.creation_time,
  j.runtime_sec,
  j.total_bytes_processed,
  j.total_slot_ms,
  SAFE_DIVIDE(j.total_slot_ms, 1000.0) AS total_slot_seconds,

  r.slot_capacity,
  r.autoscale_current_slots,
  r.autoscale_max_slots,
  SAFE_DIVIDE(SAFE_DIVIDE(j.total_slot_ms, 1000.0), r.slot_capacity) AS slot_usage_ratio,

  j.job_type,
  j.statement_type,
  REGEXP_CONTAINS(LOWER(j.query), r'\bselect\s+\*\b') AS has_select_star,
  REGEXP_CONTAINS(LOWER(j.query), r'\bcross\s+join\b') AS has_cross_join,
  REGEXP_CONTAINS(LOWER(j.query), r'\bwhere\b') AS has_where_clause,
  EXTRACT(HOUR FROM j.creation_time) AS hour_of_day,
  EXTRACT(DAYOFWEEK FROM j.creation_time) AS day_of_week,

  -- Label for ML: 1 = “bad/problematic”, 0 = “good/normal”

  IF(
    j.runtime_sec > 600                                  -- > 10 minutes
    OR j.total_bytes_processed > 5 * 10^12              -- > 5 TB
    OR j.total_slot_ms > 2 * 10^7                       -- heavy slot use
    OR SAFE_DIVIDE(SAFE_DIVIDE(j.total_slot_ms, 1000.0), r.slot_capacity) > 0.4
    , 1, 0

  ) AS label_problematic
FROM `database-xxx.bq_monitoring.bq_job_history` j
LEFT JOIN `database-xxx.bq_monitoring.bq_reservation_capacity` r
  ON j.reservation_id = r.reservation_name;

7. Schedule refresh for bq_job_features

Rebuilds the feature table every 4 hours to incorporate new jobs and updated reservation data. Ensures downstream analysis and ML training always operate on fresh data.

Use the same SQL as step 6.
       Schedule frequency: every 4 hours
       Location: US
       UNCHECK destination table
       Save

8. Initial ML Training

Trains a BigQuery ML logistic regression model using historical job features and rule-based labels. Allows the system to learn behavioral patterns beyond static thresholds.

CREATE OR REPLACE MODEL `database-xxx.bq_monitoring.bq_job_risk_model`
OPTIONS(
model_type = 'logistic_reg',
input_label_cols = ['label_problematic']
) AS

SELECT
label_problematic,
-- core numeric features

runtime_sec,
total_bytes_processed,
total_slot_ms,
total_slot_seconds,

-- capacity features (safe, no all-NULL columns)

COALESCE(slot_capacity, 0) AS slot_capacity_feature,
COALESCE(slot_usage_ratio, 0) AS slot_usage_ratio_feature,

-- categorical / pattern features

job_type,
statement_type,
has_select_star,
has_cross_join,
has_where_clause,
hour_of_day,
day_of_week

FROM `database-xxx.bq_monitoring.bq_job_features`
WHERE creation_time < TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 1 DAY);

9. Schedule daily ML retraining

Retrains the ML model daily with the latest job data. Keeps predictions accurate as workloads, users, and query behavior evolve.

Use the exact same SQL as step 8 and schedule it:
       Frequency: Daily (choose a quiet time)
       Location: US
       UNCHECK destination table
       Save
Now the model auto-relearns from the latest data every day.

10. How to use it (see good vs bad jobs)

Provides SQL queries to view rule-based labels and ML predictions for recent jobs. Enables operators to identify problematic queries, assess confidence scores, and analyze trends by reservation or user.

Direct label (no ML) - what our rule think is good or bad?

SELECT

creation_time,
job_id,
reservation_id,
runtime_sec,
total_slot_ms,
slot_capacity,
slot_usage_ratio,
label_problematic

FROM `database-xxx.bq_monitoring.bq_job_features`
ORDER BY creation_time DESC
LIMIT 50;

ML prediction example for recent jobs:

SELECT
*
FROM ML.PREDICT(
MODEL `database-xxx.bq_monitoring.bq_job_risk_model`,
(
SELECT
label_problematic,
runtime_sec,
total_bytes_processed,
total_slot_ms,
total_slot_seconds,
COALESCE(slot_capacity, 0) AS slot_capacity_feature,
COALESCE(slot_usage_ratio, 0) AS slot_usage_ratio_feature,
job_type,
statement_type,
has_select_star,
has_cross_join,
has_where_clause,
hour_of_day,
day_of_week

FROM `database-xxx.bq_monitoring.bq_job_features`
ORDER BY creation_time DESC
LIMIT 50

)
);

we’ll see:
`predicted_label`
predicted_probability
per job, on top of our rule-based labels.

====================== I will create solution for phase-2 later ===================================

bala.ramamurthy

Search This Blog