Databricks Logs Explained: Where to Look When Things Break From Driver to Delta

Author: Aishwarya Manoharan

20 May, 2026

Introduction

A Databricks job fails… or worse, it runs but performs poorly.

You open the workspace and face a familiar question:

Where do you start?

Driver logs? Spark UI? Executor logs? Query history?

Without a clear approach, it’s easy to jump between tabs and waste time chasing symptoms instead of root causes.

This guide provides a structured, layer-by-layer approach to Databricks logs, so you know exactly where to look, what each log tells you, and how it connects to real debugging scenarios and certification concepts.

It walks through each log type with:

What to look for
What it tells us
Exactly where to find it in the Databricks UI

The Mental Model: Debugging Top-Down

Before diving into individual logs, anchor yourself in this hierarchy:

Cluster → Driver → Executor → Stage → Task → Table (Delta)

Cluster: Infrastructure and lifecycle events
Driver: Job orchestration and failures
Executor: Parallel task execution and resource issues
Stage: Shuffle boundaries and data distribution
Task: Fine-grained execution and skew
Table (Delta): Data operations and history

Rule of Thumb

Debug top-down (start broad, narrow down)
Optimize bottom-up (fix root causes at task level)

Core Log Types with Deep Interpretation

1. Cluster Event Logs

Scope: Cluster lifecycle

Where to find (UI path): Compute → Cluster → Event Log tab

What we see in the logs and how to interpret it

Cluster start / terminate events

Cluster starting → resources being provisioned
Cluster terminated → job finished OR failure OR idle timeout
Unexpected termination → check policies, spot/preemptible loss, or failures

Autoscaling actions (scale up / down)

Scaling up (adding workers) → workload requires more parallelism
Scaling down (removing workers) → cluster is underutilized
Frequent scale up/down → unstable workload or poor partitioning
No scale up despite load → autoscaling limits or misconfiguration

Init script execution

Success → environment correctly configured
Failure → dependency/setup issue (libraries, mounts, configs)
Long execution time → slowing cluster startup

Errors during cluster setup

Library install failure → dependency mismatch
Node allocation failure → cloud capacity or quota issue
Permission errors → IAM / role misconfiguration

2. Driver Logs

Scope: Job orchestration

Where to find (UI path): Compute → Cluster → Driver Logs

Workflows → Job → Run → Driver Logs

What we see in the logs and how to interpret it

SparkContext initialization

Successful init → cluster ready for execution
Failure → configuration issue or incompatible settings

Query planning and execution coordination

Logical/physical plan generation → Spark deciding execution strategy
Long planning time → complex query or large schema

Exceptions and stack traces

NullPointer / AnalysisException → code or schema issue
Job aborted → failure in execution stage
Repeated failures → systemic issue, not transient

Broadcast join behavior

Broadcast created → small table optimized for join
Broadcast too large → failure or fallback to shuffle join

Driver OutOfMemory (OOM)

Large collect() or toPandas() → data pulled to driver
Large broadcast → exceeds driver memory

3. Executor Logs

Scope: Worker nodes

Where to find (UI path): Compute → Cluster → Executors → stdout / stderr

Spark UI → Executors → Logs

What we see in the logs and how to interpret it

Task execution logs

Normal execution → tasks distributed properly
Repeated retries → instability or skew

Memory usage and GC (Garbage Collection)

Frequent GC → memory pressure
Long GC pauses → inefficient memory allocation

Spill to disk (very important)

Spill occurs → memory insufficient for operation
Heavy spill → performance degradation
No spill → workload fits in memory

Shuffle operations

Shuffle read/write → data redistribution across nodes
Large shuffle → expensive joins/aggregations

Executor failures

Executor lost → node crash or resource exhaustion
Fetch failures → shuffle data unavailable

4. Stage Logs (Spark UI)

Scope: Stage-level execution

Where to find (UI path): Workflows → Job → Run → Spark UI → Stages tab

What we see in the logs and how to interpret it

Shuffle read size

Large read → heavy dependency on previous stage
Skewed read → uneven data distribution

Shuffle write size

Large write → expensive transformation (join/groupBy)
Small write → efficient stage

Stage duration

Long duration → bottleneck stage
Short duration → efficient processing

Task distribution within stage

Even distribution → balanced workload
Uneven distribution → data skew

Stage retries

Retry occurred → transient failure or instability
Multiple retries → deeper issue (data or infra)

5. Task Logs

Scope: Individual tasks

Where to find (UI path): Spark UI → Stages → Select Stage → Tasks

What we see in the logs and how to interpret it

Task execution time

Uniform times → balanced partitions
One task much slower → skew

Input size

Large input → heavy partition
Uneven input → skew

Output size

Large output → data expansion
Small output → filtering or aggregation

Spill (memory → disk)

Spill present → memory insufficient
Heavy spill → tuning needed (memory, partitions)

Locality level

Data-local → efficient execution
Remote reads → network overhead

6. SQL Query History

Scope: SQL queries

Where to find (UI path): SQL Warehouses → Query History

SQL Editor → Query History

What we see in the logs and how to interpret it

Query execution time

Long time → inefficient query
Short time → optimized execution

Query plan

Simple plan → efficient execution
Complex plan → multiple joins/aggregations

Photon usage

Photon enabled → optimized engine
Photon not used → missed optimization opportunity

7. Delta Table History

Scope: Table-level operations

Where to find: DESCRIBE HISTORY table_name

What we see in the logs and how to interpret it

Write operations

Frequent small writes → small file problem
Batched writes → efficient ingestion

MERGE operations

Frequent merges → upsert-heavy workload
Large merges → performance cost

OPTIMIZE operations

Regular optimize → good file compaction
Missing optimize → degraded read performance

VACUUM operations

Performed → storage cleanup
Not performed → storage bloat

8. Ganglia Metrics

Scope: Cluster resource usage

Where to find (UI path): Compute → Cluster → Metrics tab

What we see in the logs and how to interpret it

CPU usage

High CPU → compute-bound workload
Low CPU → underutilization

Memory usage

High memory → risk of spill/OOM
Low memory → over-provisioned cluster

Network I/O

High network → heavy shuffle
Low network → minimal data movement

9. Audit Logs

Scope: Workspace-level activity

Where to find: Admin Console → Audit Logs

Cloud Storage (log delivery)

What we see in the logs and how to interpret it

User actions

Frequent access → active usage
Unexpected access → potential security issue

Permission changes

Changes detected → governance activity
Unauthorized changes → security risk

10. Streaming Query Logs

Scope: Structured Streaming

Where to find: Notebook → query.lastProgress

Spark UI → Streaming tab

What we see in the logs and how to interpret it

Input rows per second

High input → heavy ingestion rate
Increasing input → growing load

Processed rows per second

Matches input → system keeping up
Lower than input → backlog forming

Batch duration

Increasing duration → system under stress
Stable duration → healthy pipeline

Latency

High latency → delayed processing
Low latency → near real-time

Debugging Scenarios: Putting It All Together

Real-world issues are rarely obvious. The following scenarios show how to use these logs together to diagnose common but tricky problems.

Autoscaling Not Working as Expected

Symptom

Our job is slow, and we expect Databricks to add more workers, but it doesn’t.

Think of it like this:

Autoscaling should bring in more “workers” when there’s too much work. If it doesn’t, our job stays slow because not enough machines are helping.

Where to look: Compute → Cluster → Event Log tab

What to look for and what it tells us

No “scaling up” events

We don’t see messages about adding workers.

Autoscaling is not being triggered

Possible reasons:

Max workers limit already reached
Not enough pending tasks (Spark doesn’t think it needs more workers)

Frequent scale up and scale down

Workers are added and removed repeatedly.

Workload is unstable or uneven
Often caused by poor partitioning or bursty jobs

Scaling happens too late

Workers are added, but only after the job is already slow.

Autoscaling is reacting, but too slowly

What we check next (practical steps)

1. Check cluster limits (very first step)

Go to: Compute → Cluster → Configuration

Look at:

Min workers
Max workers

If max workers is already reached, autoscaling cannot scale further.

If min workers is too low, scaling may start too late.

2. Check if there are enough tasks to trigger scaling

Go to: Spark UI → Stages → Tasks

If we see only a few tasks running:

Spark does not need more executors

Fix: increase partitions (for example, repartition)

3. Check task parallelism vs cluster size

If we have:

10 tasks
20 workers

Half the cluster will sit idle.

Autoscaling will not scale up because it is not needed.

4. Check workload pattern (spiky vs steady)

If tasks appear in bursts:

Autoscaling may scale up and immediately scale down

Fix:

Improve partitioning
Avoid uneven workloads

5. Check stage behavior (hidden bottleneck)

Go to: Spark UI → Stages

If one stage is slow but not parallel:

Autoscaling cannot help

What this tells us

Autoscaling depends on how much parallel work Spark can see.

If our job doesn’t expose enough parallelism, or if limits are too tight, scaling won’t behave the way we expect.

Final Takeaway

Each log answers a different question, but only if we interpret it correctly.

Cluster tells us if infrastructure is healthy
Driver tells us why the job failed
Executors tell us how work is executed
Stages and Tasks tell us where performance breaks
Delta tells us what happened to our data

Debug top-down. Optimize bottom-up.

Databricks Logs Explained: Where to Look When Things Break From Driver to Delta

Databricks Logs Explained: Where to Look When Things Break From Driver to Delta

Author: Aishwarya Manoharan

Introduction

Where do you start?

The Mental Model: Debugging Top-Down

Cluster → Driver → Executor → Stage → Task → Table (Delta)

Rule of Thumb

Core Log Types with Deep Interpretation

1. Cluster Event Logs

Scope: Cluster lifecycle

Where to find (UI path): Compute → Cluster → Event Log tab<img fetchpriority="high" fetchpriority="high" decoding="async" class="aligncenter size-full wp-image-81980" src="https://xorbix.com/wp-content/uploads/2026/05/image1-1.png" alt="" width="1478" height="826" />

What we see in the logs and how to interpret it

2. Driver Logs

Scope: Job orchestration

Where to find (UI path): Compute → Cluster → Driver Logs

Workflows → Job → Run → Driver Logs

What we see in the logs and how to interpret it

3. Executor Logs

Scope: Worker nodes

Where to find (UI path): Compute → Cluster → Executors → stdout / stderr

Spark UI → Executors → Logs

What we see in the logs and how to interpret it

4. Stage Logs (Spark UI)

Scope: Stage-level execution

Where to find (UI path): Workflows → Job → Run → Spark UI → Stages tab

What we see in the logs and how to interpret it

5. Task Logs

Scope: Individual tasks

Where to find (UI path): Spark UI → Stages → Select Stage → Tasks

What we see in the logs and how to interpret it

6. SQL Query History

Scope: SQL queries

Where to find (UI path): SQL Warehouses → Query History

SQL Editor → Query History

What we see in the logs and how to interpret it

7. Delta Table History

Scope: Table-level operations

Where to find: DESCRIBE HISTORY table_name

What we see in the logs and how to interpret it

8. Ganglia Metrics

Scope: Cluster resource usage

Where to find (UI path): Compute → Cluster → Metrics tab

What we see in the logs and how to interpret it

9. Audit Logs

Scope: Workspace-level activity

Where to find: Admin Console → Audit Logs

Cloud Storage (log delivery)

What we see in the logs and how to interpret it

10. Streaming Query Logs

Scope: Structured Streaming

Where to find: Notebook → query.lastProgress

Spark UI → Streaming tab

What we see in the logs and how to interpret it

Debugging Scenarios: Putting It All Together

Autoscaling Not Working as Expected

What to look for and what it tells us

Frequent scale up and scale down

Scaling happens too late

Final Takeaway

Where to find (UI path): Compute → Cluster → Event Log tab