Skip to main content

BigQuery Job Monitoring & ML-based Query Classification - Phase 1


BigQuery Job Monitoring & ML-based Query Classification

This solution builds a passive BigQuery job monitoring and ML-based classification system.
It continuously collects BigQuery job metadata, enriches it with reservation capacity context, derives meaningful features, and trains a BigQuery ML model to identify good vs. problematic jobs.

  • No production queries are modified
  • No jobs are killed or throttled
  • 100% metadata-driven and read-only
  • Fully automated using scheduled queries
This solution consist of two Phases,

What Phase 1 Does

  • Collects BigQuery job execution metadata on a schedule
  • Captures slot reservation capacity information
  • Derives job-level features such as runtime, slot usage, SQL patterns, and time attributes
  • Applies rule-based classification to label jobs as good or problematic
  • Trains a BigQuery ML model to learn query behavior patterns
  • Retrains the model daily to stay accurate as workloads evolve

Key Components

  • bq_job_history – raw BigQuery job execution data
  • bq_reservation_capacity – slot reservation capacity snapshot
  • bq_job_features – enriched, ML-ready job feature dataset
  • bq_job_risk_model – BigQuery ML model for job risk prediction

Outcome of Phase 1

  • Clear visibility into which jobs are good or bad
  • Reservation-aware slot usage analysis
  • ML-based risk scoring available directly in SQL
  • Zero production impact (no query interception, no throttling)

Phase 1 establishes the intelligence layer, but does not take action.

Phase 2 – Alerting & Operational Awareness

Phase 2 adds real-time awareness and notifications using the ML intelligence built in Phase 1.

What Phase 2 Does
  • Detects job completion events in near real time
  • Scores new jobs using the trained ML model
  • Sends Slack/Teams/Email alerts when a job is predicted to be risky
  • Provides operators with context to investigate or take action

Key Components
  • Cloud Logging sink to capture BigQuery job completion events
  • Pub/Sub topic to stream job events
  • Cloud Function to:
    • Look up job features
    • Run ML.PREDICT
    • Apply probability thresholds
    • Send formatted alerts to Slack
  • Slack/Teams/Email channel for centralized visibility

Outcome of Phase 2
  •         Near real-time alerts for problematic BigQuery jobs
  •         Reduced time to detect inefficient or expensive queries
  •         Actionable insights delivered where teams already work (Slack)

1. Create Monitoring dataset

Creates a dedicated BigQuery dataset (bq_monitoring) to store all monitoring tables, derived features, and ML models. Keeps monitoring and analytics artifacts clearly separated from business and analytical datasets.

BigQuery UI → Explorer → Project database-xxxx → ⋮ → Create dataset
Dataset ID: bq_monitoring
Location: US (multi-region)

2. Create job history table (schema only)

Defines the schema for capturing BigQuery query job metadata such as runtime, slot usage, bytes processed, SQL text, and errors. Acts as the long-term, append-only history of all query activity.

CREATE TABLE IF NOT EXISTS `database-xxx.bq_monitoring.bq_job_history`

PARTITION BY DATE(creation_time)
CLUSTER BY reservation_id, user_email AS
SELECT
job_id,
project_id,
user_email,
creation_time,
start_time,
end_time,
TIMESTAMP_DIFF(end_time, start_time, SECOND) AS runtime_sec,
total_bytes_processed,
total_slot_ms,
reservation_id,
job_type,
statement_type,
query,
error_result
FROM
`region-us`.INFORMATION_SCHEMA.JOBS_BY_PROJECT
WHERE
1 = 0;

3. Scheduled log for bg_job_history

Runs every 15 minutes to ingest only newly completed query jobs using a watermark (last_seen). Ensures near-real-time visibility into BigQuery usage with minimal processing overhead.

INSERT INTO `database-xxx.bq_monitoring.bq_job_history`
(
job_id, project_id, user_email,
creation_time, start_time, end_time,
runtime_sec, total_bytes_processed,
total_slot_ms, reservation_id, job_type,
statement_type, query, error_result
)
WITH last_seen AS (
SELECT IFNULL(MAX(creation_time), TIMESTAMP('1970-01-01')) AS max_ct
FROM `database-xxx.bq_monitoring.bq_job_history`
)
SELECT
job_id,
project_id,
user_email,
creation_time,
start_time,
end_time,
TIMESTAMP_DIFF(end_time, start_time, SECOND) AS runtime_sec,
total_bytes_processed,
total_slot_ms,
reservation_id,
job_type,
statement_type,
query,
error_result
FROM
`region-us`.INFORMATION_SCHEMA.JOBS_BY_PROJECT,
last_seen
WHERE
job_type = 'QUERY'
AND creation_time > last_seen.max_ct;

Schedule it:

        Click Schedule
        Frequency: every 15 minutes
        Location: Multi-region → US
        UNCHECK “Set destination table for query results”
        Save

4. Create reservation capacity snapshot table

Captures the current state of BigQuery reservations including committed slots and autoscaling limits. Provides capacity context needed to interpret job slot usage correctly.

CREATE OR REPLACE TABLE `database-xxx.bq_monitoring.bq_reservation_capacity` AS
SELECT
reservation_name,
slot_capacity, -- committed slots
autoscale.current_slots AS autoscale_current_slots,
autoscale.max_slots AS autoscale_max_slots,
CURRENT_TIMESTAMP() AS snapshot_time
FROM
`region-us`.INFORMATION_SCHEMA.RESERVATIONS;

5. Schedule refresh for bq_reservation_capacity

Refreshes reservation slot information hourly to reflect capacity changes over time. Keeps slot allocation data aligned with actual reservation configuration.

Use the same SQL as step 4.
        Schedule frequency: every 1 hour
        Location: US
        UNCHECK destination table
        Save

6. Create feature table bq_job_features

Joins job history with reservation capacity and derives analytical features such as slot usage ratio, SQL patterns, and time-based attributes. Also applies rule-based logic to assign an initial good/bad label (label_problematic) to each job.

CREATE OR REPLACE TABLE `database-xxx.bq_monitoring.bq_job_features` AS
SELECT
j.job_id,
j.project_id,
j.user_email,
j.reservation_id,
j.creation_time,
j.runtime_sec,
j.total_bytes_processed,
j.total_slot_ms,
SAFE_DIVIDE(j.total_slot_ms, 1000.0) AS total_slot_seconds,

r.slot_capacity,
r.autoscale_current_slots,
r.autoscale_max_slots,
SAFE_DIVIDE(SAFE_DIVIDE(j.total_slot_ms, 1000.0), r.slot_capacity) AS slot_usage_ratio,

j.job_type,
j.statement_type,
REGEXP_CONTAINS(LOWER(j.query), r'\bselect\s+\*\b') AS has_select_star,
REGEXP_CONTAINS(LOWER(j.query), r'\bcross\s+join\b') AS has_cross_join,
REGEXP_CONTAINS(LOWER(j.query), r'\bwhere\b') AS has_where_clause,
EXTRACT(HOUR FROM j.creation_time) AS hour_of_day,
EXTRACT(DAYOFWEEK FROM j.creation_time) AS day_of_week,

-- Label for ML: 1 = “bad/problematic”, 0 = “good/normal”

IF(
j.runtime_sec > 600 -- > 10 minutes
OR j.total_bytes_processed > 5 * 10^12 -- > 5 TB
OR j.total_slot_ms > 2 * 10^7 -- heavy slot use
OR SAFE_DIVIDE(SAFE_DIVIDE(j.total_slot_ms, 1000.0), r.slot_capacity) > 0.4
, 1, 0

) AS label_problematic
FROM `database-xxx.bq_monitoring.bq_job_history` j
LEFT JOIN `database-xxx.bq_monitoring.bq_reservation_capacity` r
ON j.reservation_id = r.reservation_name;

7. Schedule refresh for bq_job_features

Rebuilds the feature table every 4 hours to incorporate new jobs and updated reservation data. Ensures downstream analysis and ML training always operate on fresh data.

Use the same SQL as step 6.
        Schedule frequency: every 4 hours
        Location: US
        UNCHECK destination table
        Save

8. Initial ML Training

Trains a BigQuery ML logistic regression model using historical job features and rule-based labels. Allows the system to learn behavioral patterns beyond static thresholds.

CREATE OR REPLACE MODEL `database-xxx.bq_monitoring.bq_job_risk_model`
OPTIONS(
model_type = 'logistic_reg',
input_label_cols = ['label_problematic']
) AS

SELECT
label_problematic,
-- core numeric features

runtime_sec,
total_bytes_processed,
total_slot_ms,
total_slot_seconds,

-- capacity features (safe, no all-NULL columns)

COALESCE(slot_capacity, 0) AS slot_capacity_feature,
COALESCE(slot_usage_ratio, 0) AS slot_usage_ratio_feature,

-- categorical / pattern features

job_type,
statement_type,
has_select_star,
has_cross_join,
has_where_clause,
hour_of_day,
day_of_week

FROM `database-xxx.bq_monitoring.bq_job_features`
WHERE creation_time < TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 1 DAY);

9. Schedule daily ML retraining

Retrains the ML model daily with the latest job data. Keeps predictions accurate as workloads, users, and query behavior evolve.

Use the exact same SQL as step 8 and schedule it:
        Frequency: Daily (choose a quiet time)
        Location: US
        UNCHECK destination table
        Save
Now the model auto-relearns from the latest data every day.

10. How to use it (see good vs bad jobs)

Provides SQL queries to view rule-based labels and ML predictions for recent jobs. Enables operators to identify problematic queries, assess confidence scores, and analyze trends by reservation or user.

Direct label (no ML) - what our rule think is good or bad?

SELECT

creation_time,
job_id,
reservation_id,
runtime_sec,
total_slot_ms,
slot_capacity,
slot_usage_ratio,
label_problematic

FROM `database-xxx.bq_monitoring.bq_job_features`
ORDER BY creation_time DESC
LIMIT 50;

ML prediction example for recent jobs:

SELECT
*
FROM ML.PREDICT(
MODEL `database-xxx.bq_monitoring.bq_job_risk_model`,
(
SELECT
label_problematic,
runtime_sec,
total_bytes_processed,
total_slot_ms,
total_slot_seconds,
COALESCE(slot_capacity, 0) AS slot_capacity_feature,
COALESCE(slot_usage_ratio, 0) AS slot_usage_ratio_feature,
job_type,
statement_type,
has_select_star,
has_cross_join,
has_where_clause,
hour_of_day,
day_of_week

FROM `database-xxx.bq_monitoring.bq_job_features`
ORDER BY creation_time DESC
LIMIT 50

)
); 

we’ll see:

  • predicted_label

  • predicted_probability

per job, on top of our rule-based labels. 


====================== I will create solution for phase-2 later ===================================


 

Comments

Popular posts from this blog

Security Considerations for Multi-Cluster Cloud Architecture (HA EKS with Databases)

Security Considerations for Multi-Cluster Cloud Architecture (HA EKS with Databases) Running a highly available multi-cluster EKS architecture brings powerful benefits—zero downtime, disaster recovery, and global scalability. But it also multiplies your security challenges. Securing a single EKS cluster is already complex. Add multiple clusters across regions, databases with sensitive data, and cross-cluster communication, and the attack surface grows significantly. One misconfigured security group or exposed secret can compromise your entire infrastructure. This guide covers essential security considerations for multi-cluster architectures: network isolation, encryption, IAM management, secrets handling, and incident response. We'll focus on practical measures that protect your infrastructure without sacrificing performance or availability. Let's build a secure, highly available system. 1. Network Security & Isolation VPC Architecture Separate VPCs per cluster or use share...
Claude Certified Architect – Learning Resources & Playbook If you’re interested in understanding the latest Claude architecture, patterns, and enterprise AI design practices , here are curated resources to get you started. This is especially useful for teams working on: AI Enablement platforms LLM integrations (Claude, OpenAI, etc.) Secure enterprise AI architecture (APIs, gateways, governance)  🎓 1. Certification & Training Access Certification Access Request 👉 https://anthropic.skilljar.com/claude-certified-architect-foundations-access-request Official Training Portal (Courses) 👉 https://anthropic.skilljar.com/ 💡 Recommended starting point for structured learning and certification path. 📚 2. Hands-On Learning (Cookbook) Claude Cookbook (GitHub) 👉 https://github.com/anthropics/anthropic-cookbook Includes: Prompt engineering patterns API usage examples Real-world implementation scenarios 💡 Best for developers building integrations (Cloud Run, APIs, Agents, e...