Databricks Logs Explained: Where to Look When Things Break From Driver to Delta

Author: Aishwarya Manoharan

20 May, 2026

Introduction

A Databricks job fails… or worse, it runs but performs poorly.

You open the workspace and face a familiar question:

Where do you start?

Driver logs? Spark UI? Executor logs? Query history?

Without a clear approach, it’s easy to jump between tabs and waste time chasing symptoms instead of root causes.

This guide provides a structured, layer-by-layer approach to Databricks logs, so you know exactly where to look, what each log tells you, and how it connects to real debugging scenarios and certification concepts.

It walks through each log type with:

  • What to look for
  • What it tells us
  • Exactly where to find it in the Databricks UI

The Mental Model: Debugging Top-Down

Before diving into individual logs, anchor yourself in this hierarchy:

Cluster → Driver → Executor → Stage → Task → Table (Delta)

  1. Cluster: Infrastructure and lifecycle events
  2. Driver: Job orchestration and failures
  3. Executor: Parallel task execution and resource issues
  4. Stage: Shuffle boundaries and data distribution
  5. Task: Fine-grained execution and skew
  6. Table (Delta): Data operations and history

Rule of Thumb

  • Debug top-down (start broad, narrow down)
  • Optimize bottom-up (fix root causes at task level)

Core Log Types with Deep Interpretation

1. Cluster Event Logs

Scope: Cluster lifecycle

Where to find (UI path): Compute → Cluster → Event Log tab

What we see in the logs and how to interpret it

Cluster start / terminate events

  • Cluster starting → resources being provisioned
  • Cluster terminated → job finished OR failure OR idle timeout
  • Unexpected termination → check policies, spot/preemptible loss, or failures

Autoscaling actions (scale up / down)

  • Scaling up (adding workers) → workload requires more parallelism
  • Scaling down (removing workers) → cluster is underutilized
  • Frequent scale up/down → unstable workload or poor partitioning
  • No scale up despite load → autoscaling limits or misconfiguration

Init script execution

  • Success → environment correctly configured
  • Failure → dependency/setup issue (libraries, mounts, configs)
  • Long execution time → slowing cluster startup

Errors during cluster setup

  • Library install failure → dependency mismatch
  • Node allocation failure → cloud capacity or quota issue
  • Permission errors → IAM / role misconfiguration

2. Driver Logs

Scope: Job orchestration

Where to find (UI path): Compute → Cluster → Driver Logs

OR

Workflows → Job → Run → Driver Logs

What we see in the logs and how to interpret it

SparkContext initialization

  • Successful init → cluster ready for execution
  • Failure → configuration issue or incompatible settings

Query planning and execution coordination

  • Logical/physical plan generation → Spark deciding execution strategy
  • Long planning time → complex query or large schema

Exceptions and stack traces

  • NullPointer / AnalysisException → code or schema issue
  • Job aborted → failure in execution stage
  • Repeated failures → systemic issue, not transient

Broadcast join behavior

  • Broadcast created → small table optimized for join
  • Broadcast too large → failure or fallback to shuffle join

Driver OutOfMemory (OOM)

  • Large collect() or toPandas() → data pulled to driver
  • Large broadcast → exceeds driver memory

3. Executor Logs

Scope: Worker nodes

Where to find (UI path): Compute → Cluster → Executors → stdout / stderr

OR

Spark UI → Executors → Logs

What we see in the logs and how to interpret it

Task execution logs

  • Normal execution → tasks distributed properly
  • Repeated retries → instability or skew

Memory usage and GC (Garbage Collection)

  • Frequent GC → memory pressure
  • Long GC pauses → inefficient memory allocation

Spill to disk (very important)

  • Spill occurs → memory insufficient for operation
  • Heavy spill → performance degradation
  • No spill → workload fits in memory

Shuffle operations

  • Shuffle read/write → data redistribution across nodes
  • Large shuffle → expensive joins/aggregations

Executor failures

  • Executor lost → node crash or resource exhaustion
  • Fetch failures → shuffle data unavailable

4. Stage Logs (Spark UI)

Scope: Stage-level execution

Where to find (UI path): Workflows → Job → Run → Spark UI → Stages tab

What we see in the logs and how to interpret it

Shuffle read size

  • Large read → heavy dependency on previous stage
  • Skewed read → uneven data distribution

Shuffle write size

  • Large write → expensive transformation (join/groupBy)
  • Small write → efficient stage

Stage duration

  • Long duration → bottleneck stage
  • Short duration → efficient processing

Task distribution within stage

  • Even distribution → balanced workload
  • Uneven distribution → data skew

Stage retries

  • Retry occurred → transient failure or instability
  • Multiple retries → deeper issue (data or infra)

5. Task Logs

Scope: Individual tasks

Where to find (UI path): Spark UI → Stages → Select Stage → Tasks

What we see in the logs and how to interpret it

Task execution time

  • Uniform times → balanced partitions
  • One task much slower → skew

Input size

  • Large input → heavy partition
  • Uneven input → skew

Output size

  • Large output → data expansion
  • Small output → filtering or aggregation

Spill (memory → disk)

  • Spill present → memory insufficient
  • Heavy spill → tuning needed (memory, partitions)

Locality level

  • Data-local → efficient execution
  • Remote reads → network overhead

6. SQL Query History

Scope: SQL queries

Where to find (UI path): SQL Warehouses → Query History

OR

SQL Editor → Query History

What we see in the logs and how to interpret it

Query execution time

  • Long time → inefficient query
  • Short time → optimized execution

Query plan

  • Simple plan → efficient execution
  • Complex plan → multiple joins/aggregations

Photon usage

  • Photon enabled → optimized engine
  • Photon not used → missed optimization opportunity

7. Delta Table History

Scope: Table-level operations

Where to find: DESCRIBE HISTORY table_name

What we see in the logs and how to interpret it

Write operations

  • Frequent small writes → small file problem
  • Batched writes → efficient ingestion

MERGE operations

  • Frequent merges → upsert-heavy workload
  • Large merges → performance cost

OPTIMIZE operations

  • Regular optimize → good file compaction
  • Missing optimize → degraded read performance

VACUUM operations

  • Performed → storage cleanup
  • Not performed → storage bloat

8. Ganglia Metrics

Scope: Cluster resource usage

Where to find (UI path): Compute → Cluster → Metrics tab

What we see in the logs and how to interpret it

CPU usage

  • High CPU → compute-bound workload
  • Low CPU → underutilization

Memory usage

  • High memory → risk of spill/OOM
  • Low memory → over-provisioned cluster

Network I/O

  • High network → heavy shuffle
  • Low network → minimal data movement

9. Audit Logs

Scope: Workspace-level activity

Where to find: Admin Console → Audit Logs

OR

Cloud Storage (log delivery)

What we see in the logs and how to interpret it

User actions

  • Frequent access → active usage
  • Unexpected access → potential security issue

Permission changes

  • Changes detected → governance activity
  • Unauthorized changes → security risk

10. Streaming Query Logs

Scope: Structured Streaming

Where to find: Notebook → query.lastProgress

Spark UI → Streaming tab

What we see in the logs and how to interpret it

Input rows per second

  • High input → heavy ingestion rate
  • Increasing input → growing load

Processed rows per second

  • Matches input → system keeping up
  • Lower than input → backlog forming

Batch duration

  • Increasing duration → system under stress
  • Stable duration → healthy pipeline

Latency

  • High latency → delayed processing
  • Low latency → near real-time

Debugging Scenarios: Putting It All Together

Real-world issues are rarely obvious. The following scenarios show how to use these logs together to diagnose common but tricky problems.

Autoscaling Not Working as Expected

Symptom

Our job is slow, and we expect Databricks to add more workers, but it doesn’t.

Think of it like this:

Autoscaling should bring in more “workers” when there’s too much work. If it doesn’t, our job stays slow because not enough machines are helping.

Where to look: Compute → Cluster → Event Log tab

What to look for and what it tells us

No “scaling up” events

We don’t see messages about adding workers.

Autoscaling is not being triggered

Possible reasons:

  • Max workers limit already reached
  • Not enough pending tasks (Spark doesn’t think it needs more workers)

Frequent scale up and scale down

Workers are added and removed repeatedly.

  • Workload is unstable or uneven
  • Often caused by poor partitioning or bursty jobs

Scaling happens too late

Workers are added, but only after the job is already slow.

  • Autoscaling is reacting, but too slowly

What we check next (practical steps)

1. Check cluster limits (very first step)

Go to: Compute → Cluster → Configuration

Look at:

  • Min workers
  • Max workers

If max workers is already reached, autoscaling cannot scale further.

If min workers is too low, scaling may start too late.

2. Check if there are enough tasks to trigger scaling

Go to: Spark UI → Stages → Tasks

If we see only a few tasks running:

  • Spark does not need more executors

Fix: increase partitions (for example, repartition)

3. Check task parallelism vs cluster size

If we have:

  • 10 tasks
  • 20 workers

Half the cluster will sit idle.

Autoscaling will not scale up because it is not needed.

4. Check workload pattern (spiky vs steady)

If tasks appear in bursts:

  • Autoscaling may scale up and immediately scale down

Fix:

  • Improve partitioning
  • Avoid uneven workloads

5. Check stage behavior (hidden bottleneck)

Go to: Spark UI → Stages

If one stage is slow but not parallel:

  • Autoscaling cannot help

What this tells us

Autoscaling depends on how much parallel work Spark can see.

If our job doesn’t expose enough parallelism, or if limits are too tight, scaling won’t behave the way we expect.

Final Takeaway

Each log answers a different question, but only if we interpret it correctly.

  • Cluster tells us if infrastructure is healthy
  • Driver tells us why the job failed
  • Executors tell us how work is executed
  • Stages and Tasks tell us where performance breaks
  • Delta tells us what happened to our data

Debug top-down. Optimize bottom-up.