Databricks DEA: Monitoring, Troubleshooting & Governance

Topic Overview

Production data engineering requires monitoring job performance, diagnosing bottlenecks in the Spark UI, and enforcing data governance through Unity Catalog. The DEA exam tests all three areas as a unified production readiness domain.

Monitoring Layer

Track pipeline health using Lakeflow Jobs run history (duration trends, failure rates) and the DAG task graph (identify upstream blockers). Set alerts for failures and SLA misses. Compare current run times against historical baselines to detect regression.

Spark UI Diagnostics

The Spark UI provides stage-level metrics for every Spark job. Key metrics: task duration distribution, shuffle read/write sizes, spill (memory and disk). Used to identify data skew, excessive shuffling, and memory pressure before they cause production failures.

Cluster Diagnostics

Common cluster-level issues include startup failures (library conflicts, init script errors), out-of-memory (OOM) errors (executor or driver), and library version conflicts. Each has specific diagnosis patterns and fixes.

Unity Catalog Governance

Unity Catalog provides fine-grained access control (GRANT/REVOKE/DENY), column-level masking, row-level security filters, and ABAC (Attribute-Based Access Control) policies. All governance is centrally managed and auditable.

Monitoring & Job Performance

Lakeflow Jobs Run History

Accessible in the Jobs UI → select job → Runs tab. Shows all past runs with start time, duration, trigger, and status (Succeeded/Failed/Running/Skipped). Click any run to drill into task-level details.

Which tasks consistently fail
Which runs are slowest
Whether duration is increasing over time

Comparing Against Historical Baseline

The run history view shows duration for each run over time. If current run takes 2× longer than previous runs, it signals a performance regression. Common causes:

Data volume growth
A new skewed partition
A dropped broadcast hint
A cluster configuration change

DAG Task Graph for Upstream Blockers

In the Jobs UI, the DAG visualization shows task statuses in real time. If Task C is waiting, check its upstream tasks (A and B).

Task "Waiting" → upstream dependency not complete
Task "Failed" → blocks all downstream tasks
Use the DAG to quickly pinpoint bottlenecks

Alerts and SLA Monitoring

Configure job-level alerts (email or webhook) for: job started, job succeeded, job failed, duration exceeds SLA threshold. SLA alerts notify when a job runs longer than expected — critical for time-sensitive pipelines feeding dashboards or downstream systems.

Spark UI & Troubleshooting

Spark UI Key Views

Jobs and Stages View

Spark UI shows all jobs (triggered by actions like count(), write()) and their stages (separated by shuffle boundaries). Each stage shows: number of tasks, duration, input/output size, shuffle read/write size. Click a stage to see per-task metrics.

Stage-Level Task Metrics

For each stage, the task summary shows Min/25th/Median/75th/Max values for: task duration, shuffle read size, shuffle write size, spill (memory), spill (disk). These distributions reveal performance problems. When Max >> Median for task duration or shuffle read, data skew is present.

Identifying Data Skew in Spark UI

Pattern: 199 tasks complete in seconds, 1 task takes minutes. Max shuffle read = 5GB, Median = 50MB.

Fix: Confirm AQE skew handling is active (spark.sql.adaptive.skewJoin.enabled=true, default in Databricks). Alternatively: salt the skewed join key or repartition before join.

Skew vs Spill vs Shuffle Comparison

Issue	Spark UI Indicator	Root Cause	Fix
Data Skew	Max task time >> Median; Max shuffle read >> Median	Uneven data distribution (hot key)	AQE skew handling, salting, repartition
Disk Spilling	Spill (disk) column has large values	Partition too large for executor memory	Increase executor memory, more shuffle partitions, filter earlier
Excessive Shuffle	Shuffle read/write sizes very large	Joins/groupBy without optimization	Broadcast small tables, reduce shuffle.partitions, pre-aggregate
Slow Stage Overall	All tasks slow (not just one)	Insufficient parallelism or large data volume	Increase cluster size, increase shuffle.partitions

Cluster Diagnostics

Cluster Startup Failures

Common causes:

Init script error — check init script logs in cluster event log
Library installation failure — check library logs, version conflicts
Misconfigured Spark config — check cluster configuration JSON
IAM/permission issue — cluster cannot access cloud resources

Check cluster event log → look for ERROR entries.

Out-of-Memory (OOM) Errors

Executor OOM: java.lang.OutOfMemoryError: Java heap space or GC overhead limit exceeded in task logs.

Causes: partition too large, collecting too much data to executor.

Fixes: increase spark.executor.memory, increase shuffle partitions, filter earlier, avoid wide transformations on large data.

Driver OOM

collect() or toPandas() on large DataFrames pulls all data to the driver — driver runs out of memory.

Symptom: driver node fails, job lost.

Fix: Never collect large DataFrames to driver. Use display() with limits, write to Delta instead, or increase spark.driver.memory.

Library Conflicts

Multiple libraries with incompatible versions.

Symptoms: ClassNotFoundException, NoSuchMethodError, import errors.

Fix: check installed libraries in cluster UI, use cluster-scoped vs job-scoped installations, pin specific versions in requirements.txt for reproducibility.

Unity Catalog & Governance

Unity Catalog Hierarchy

3-Level Namespace

catalog.schema.table — every object in Unity Catalog lives in this hierarchy.

Metastore (top level, one per region)
Catalog → Schema (database) → Tables/Views/Volumes/Functions
Permissions granted at any level and inherited downward

Managed vs External Tables

Managed: CREATE TABLE catalog.schema.table (...)

External: CREATE TABLE catalog.schema.table (...) LOCATION 's3://bucket/path'

Drop managed = data deleted. Drop external = metadata only, data stays.

Access Control: GRANT / REVOKE / DENY

GRANT Statement

GRANT privilege ON securable TO principal

Must GRANT at each level: USE CATALOG + USE SCHEMA + SELECT TABLE.

GRANT SELECT ON TABLE catalog.schema.orders TO `data-analysts`;
GRANT USE CATALOG ON CATALOG my_catalog TO `group`;

REVOKE Statement

REVOKE privilege ON securable FROM principal

Removes a previously granted privilege. Does not deny — just removes the grant.

REVOKE SELECT ON TABLE catalog.schema.orders FROM `data-analysts`;

DENY Statement

DENY privilege ON securable TO principal

Explicitly prohibits access even if a GRANT exists at another level. Stronger than not granting — used to create exceptions within broadly granted access.

DENY wins over GRANT.

Privilege Hierarchy

Principals (users, groups, service principals) inherit permissions at lower levels from grants at higher levels.

GRANT ALL PRIVILEGES ON CATALOG dev TO `dev-team`; — gives all permissions on everything in dev catalog.

Grant USE CATALOG + USE SCHEMA before granting table-level privileges.

Privileges Reference

Privilege	Applies To	What It Allows
USE CATALOG	Catalog	Access objects within the catalog
USE SCHEMA	Schema	Access objects within the schema
SELECT	Table/View	Read data
MODIFY	Table	INSERT, UPDATE, DELETE, MERGE
CREATE TABLE	Schema	Create new tables in the schema
ALL PRIVILEGES	Any	All applicable privileges
READ VOLUME	Volume	Read files from a Unity Catalog Volume
WRITE VOLUME	Volume	Write files to a Unity Catalog Volume

Row & Column Security

Column-Level Masking

Mask sensitive column values based on user group. Example: show full SSN to HR group, mask to XXX-XX-#### for all others.

Implemented with a masking function: ALTER TABLE SET MASK function_name ON COLUMN ssn. Function returns different values based on current_user() or is_member('group').

Row-Level Security (RLS)

Filter which rows a user can see based on identity or group membership. Example: sales reps can only see their own region's data.

Implemented via a row filter function: ALTER TABLE SET ROW FILTER function_name ON (region_col). Filter checks current_user() or group membership and applies a WHERE clause.

Unity Catalog ABAC Policies

Centralized policy-based access control at the Unity Catalog level. ABAC policies define row-level filtering and column masking rules that apply across all queries on protected tables — without modifying query code.

Policies reference user attributes (group membership, user properties) to dynamically control data visibility at query time. More scalable than per-table grants for complex access patterns.

Memory Hooks

🔍 Skew Pattern in Spark UI

"Max >> Median = Skew; All slow = Something else"

Data skew: max task duration and max shuffle read are much larger than median. Fix: AQE skew handling (on by default). If all tasks are slow, it's a different problem (too few partitions, cluster too small).

💾 OOM = Wrong Place to Look

"Executor OOM → executor memory. Driver OOM → stop collecting!"

Executor OOM: increase spark.executor.memory + more shuffle partitions. Driver OOM: you're collecting too much data to driver — use write() instead of collect(). Never use collect() on large DataFrames.

🔑 GRANT Cascade Rule

"USE CATALOG + USE SCHEMA + SELECT = minimum for table read"

Granting SELECT on a table isn't enough. User also needs: GRANT USE CATALOG ON CATALOG ... and GRANT USE SCHEMA ON SCHEMA .... All three required for a user to query a table in Unity Catalog.

🎭 DENY vs REVOKE

"REVOKE removes a gift; DENY slams the door"

REVOKE: removes a previously granted privilege (no access if nothing else grants it). DENY: explicitly blocks access even if a broader GRANT exists. DENY wins over GRANT — use to create exceptions.

🔒 Column Mask vs Row Filter

"Mask = what you see; Filter = which rows you see"

Column masking: same rows, different column values per user group (mask SSN). Row-level security: same table, different rows per user (filter by region). Both implemented as functions referencing current_user() or is_member().

📈 Run History = Your SLA Baseline

"Today took 3 hours? Compare to last 30 runs"

Lakeflow Jobs run history shows duration per run. Compare current vs historical average. Sudden spike = investigate that specific run's Spark UI and DAG task graph. Gradual increase = data growth or query regression.

Flashcards

Click any card to reveal the answer · 0 / 8 flipped

Flashcard 1

Data skew: how to identify in Spark UI and 3 ways to fix

Tap to reveal

Identify: stage task summary shows Max task duration >> Median; Max shuffle read >> Median.

Fix: (1) AQE skew join handling (default — spark.sql.adaptive.skewJoin.enabled=true). (2) Salt the skewed key. (3) Repartition before join. (4) Broadcast join if one side is small.

Flashcard 2

Executor OOM vs Driver OOM — causes and fixes

Tap to reveal

Executor OOM: partition data > executor heap. Fix: increase spark.executor.memory, increase shuffle partitions, filter earlier.

Driver OOM: collect() or toPandas() pulls too much data. Fix: write to Delta instead. Increase spark.driver.memory as last resort.

Flashcard 3

Unity Catalog minimum grants to allow a user to query a table

Tap to reveal

3 grants required:

(1) GRANT USE CATALOG ON CATALOG catalog_name
(2) GRANT USE SCHEMA ON SCHEMA catalog.schema_name
(3) GRANT SELECT ON TABLE catalog.schema.table_name

All three must be granted. USE CATALOG and USE SCHEMA are often forgotten.

Flashcard 4

GRANT vs REVOKE vs DENY in Unity Catalog

Tap to reveal

GRANT: gives access (additive).
REVOKE: removes a grant (no longer has that access).
DENY: explicitly blocks access even when a broader GRANT exists — DENY wins over GRANT.

Use DENY to create exceptions within broadly granted roles.

Flashcard 5

Column masking vs row-level security in Unity Catalog

Tap to reveal

Column masking: all rows visible, column values differ by user group. ALTER TABLE SET MASK function ON COLUMN col.

Row-level security: different users see different rows. ALTER TABLE SET ROW FILTER function ON (col).

Both use functions referencing current_user() or is_member().

Flashcard 6

What is Unity Catalog ABAC and how does it differ from GRANT-based control?

Tap to reveal

ABAC: Attribute-Based Access Control. Centralized policies apply row-level filtering and column masking based on user attributes (group, role). Applied at query time automatically — no code changes needed.

vs GRANT-based: explicit per-table/per-user grants. ABAC scales better for complex, org-wide access patterns.

Flashcard 7

Cluster startup failure diagnosis steps

Tap to reveal

(1) Open cluster event log in Databricks UI.
(2) Look for ERROR entries at startup time.
(3) Check init script logs (most common cause).
(4) Check library installation logs (version conflict).
(5) Verify IAM role/permissions for cloud resource access.
(6) Check Spark configuration for typos or invalid values.

Flashcard 8

How to use Lakeflow Jobs run history to detect performance regression

Tap to reveal

Jobs UI → select job → Runs tab. Compare current run duration against historical runs.

Sudden spike: one-time event — check that run's Spark UI for skew, spill, or shuffle explosion.
Gradual increase: data growth or query regression.

Click the slow run → open Spark UI → check stages.

Study Advisor

Beginner Path Start Here ▼

Start with the Spark UI skew pattern (Max >> Median = skew). Learn the 3 required Unity Catalog grants for table access (USE CATALOG + USE SCHEMA + SELECT). Understand job cluster monitoring in Lakeflow Jobs run history.

Intermediate Path Build Depth ▼

Study OOM diagnosis (executor vs driver), cluster startup failure steps (event log → init script → library logs), and Unity Catalog privilege types (GRANT, REVOKE, DENY — DENY wins). Learn column masking vs row-level security (what vs which rows).

Advanced Path Go Deep ▼

Master Unity Catalog ABAC policies (centralized, attribute-based, scales for org-wide governance), GRANT hierarchy (permissions cascade down catalog→schema→table), and Spark UI disk spill diagnosis (Spill (memory) column in stage metrics).

Exam Focus High Yield ▼

High-yield: Skew = Max>>Median in Spark UI; AQE skew handling = default on; 3 grants for table access; DENY > GRANT (DENY wins); Column masking = same rows/different values; Row filter = different rows; External table drop = data stays; ABAC = attribute-based centralized policies.

Quick Review Last Minute ▼

Skew: Max>>Median→AQE or salt. OOM executor: increase memory+partitions. OOM driver: stop collecting! UC grants: USE CATALOG+USE SCHEMA+SELECT. DENY wins over GRANT. Column mask=different values. Row filter=different rows. ABAC=centralized attribute-based. External table drop=data stays. Run history=compare to baseline.

Topic Overview

Monitoring Layer

Spark UI Diagnostics

Cluster Diagnostics

Unity Catalog Governance

Monitoring & Job Performance

Lakeflow Jobs Run History

Comparing Against Historical Baseline

DAG Task Graph for Upstream Blockers

Alerts and SLA Monitoring

Spark UI & Troubleshooting

Jobs and Stages View

Stage-Level Task Metrics

Identifying Data Skew in Spark UI

Cluster Startup Failures

Out-of-Memory (OOM) Errors

Driver OOM

Library Conflicts

Unity Catalog & Governance

3-Level Namespace

Managed vs External Tables

GRANT Statement

REVOKE Statement

DENY Statement

Privilege Hierarchy

Column-Level Masking

Row-Level Security (RLS)

Unity Catalog ABAC Policies

Practice Quiz — 10 Questions

Memory Hooks

Flashcards

Study Advisor