FlashGenius Logo FlashGenius
Databricks DEA Exam Prep ยท Topic 5 of 5

Monitoring, Troubleshooting & Governance

Spark UI ยท Job Monitoring ยท Cluster Diagnostics ยท Unity Catalog ยท GRANT/REVOKE ยท Row-Level Security ยท ABAC

Topic Overview

Production data engineering requires monitoring job performance, diagnosing bottlenecks in the Spark UI, and enforcing data governance through Unity Catalog. The DEA exam tests all three areas as a unified production readiness domain.

Monitoring Layer

Track pipeline health using Lakeflow Jobs run history (duration trends, failure rates) and the DAG task graph (identify upstream blockers). Set alerts for failures and SLA misses. Compare current run times against historical baselines to detect regression.

Spark UI Diagnostics

The Spark UI provides stage-level metrics for every Spark job. Key metrics: task duration distribution, shuffle read/write sizes, spill (memory and disk). Used to identify data skew, excessive shuffling, and memory pressure before they cause production failures.

Cluster Diagnostics

Common cluster-level issues include startup failures (library conflicts, init script errors), out-of-memory (OOM) errors (executor or driver), and library version conflicts. Each has specific diagnosis patterns and fixes.

Unity Catalog Governance

Unity Catalog provides fine-grained access control (GRANT/REVOKE/DENY), column-level masking, row-level security filters, and ABAC (Attribute-Based Access Control) policies. All governance is centrally managed and auditable.

Monitoring & Job Performance

Lakeflow Jobs Run History

Accessible in the Jobs UI โ†’ select job โ†’ Runs tab. Shows all past runs with start time, duration, trigger, and status (Succeeded/Failed/Running/Skipped). Click any run to drill into task-level details.

  • Which tasks consistently fail
  • Which runs are slowest
  • Whether duration is increasing over time

Comparing Against Historical Baseline

The run history view shows duration for each run over time. If current run takes 2ร— longer than previous runs, it signals a performance regression. Common causes:

  • Data volume growth
  • A new skewed partition
  • A dropped broadcast hint
  • A cluster configuration change

DAG Task Graph for Upstream Blockers

In the Jobs UI, the DAG visualization shows task statuses in real time. If Task C is waiting, check its upstream tasks (A and B).

  • Task "Waiting" โ†’ upstream dependency not complete
  • Task "Failed" โ†’ blocks all downstream tasks
  • Use the DAG to quickly pinpoint bottlenecks

Alerts and SLA Monitoring

Configure job-level alerts (email or webhook) for: job started, job succeeded, job failed, duration exceeds SLA threshold. SLA alerts notify when a job runs longer than expected โ€” critical for time-sensitive pipelines feeding dashboards or downstream systems.

Spark UI & Troubleshooting

Spark UI Key Views

Jobs and Stages View

Spark UI shows all jobs (triggered by actions like count(), write()) and their stages (separated by shuffle boundaries). Each stage shows: number of tasks, duration, input/output size, shuffle read/write size. Click a stage to see per-task metrics.

Stage-Level Task Metrics

For each stage, the task summary shows Min/25th/Median/75th/Max values for: task duration, shuffle read size, shuffle write size, spill (memory), spill (disk). These distributions reveal performance problems. When Max >> Median for task duration or shuffle read, data skew is present.

Identifying Data Skew in Spark UI

Pattern: 199 tasks complete in seconds, 1 task takes minutes. Max shuffle read = 5GB, Median = 50MB.

Fix: Confirm AQE skew handling is active (spark.sql.adaptive.skewJoin.enabled=true, default in Databricks). Alternatively: salt the skewed join key or repartition before join.

Skew vs Spill vs Shuffle Comparison
IssueSpark UI IndicatorRoot CauseFix
Data Skew Max task time >> Median; Max shuffle read >> Median Uneven data distribution (hot key) AQE skew handling, salting, repartition
Disk Spilling Spill (disk) column has large values Partition too large for executor memory Increase executor memory, more shuffle partitions, filter earlier
Excessive Shuffle Shuffle read/write sizes very large Joins/groupBy without optimization Broadcast small tables, reduce shuffle.partitions, pre-aggregate
Slow Stage Overall All tasks slow (not just one) Insufficient parallelism or large data volume Increase cluster size, increase shuffle.partitions
Cluster Diagnostics

Cluster Startup Failures

Common causes:

  • Init script error โ€” check init script logs in cluster event log
  • Library installation failure โ€” check library logs, version conflicts
  • Misconfigured Spark config โ€” check cluster configuration JSON
  • IAM/permission issue โ€” cluster cannot access cloud resources

Check cluster event log โ†’ look for ERROR entries.

Out-of-Memory (OOM) Errors

Executor OOM: java.lang.OutOfMemoryError: Java heap space or GC overhead limit exceeded in task logs.

Causes: partition too large, collecting too much data to executor.

Fixes: increase spark.executor.memory, increase shuffle partitions, filter earlier, avoid wide transformations on large data.

Driver OOM

collect() or toPandas() on large DataFrames pulls all data to the driver โ€” driver runs out of memory.

Symptom: driver node fails, job lost.

Fix: Never collect large DataFrames to driver. Use display() with limits, write to Delta instead, or increase spark.driver.memory.

Library Conflicts

Multiple libraries with incompatible versions.

Symptoms: ClassNotFoundException, NoSuchMethodError, import errors.

Fix: check installed libraries in cluster UI, use cluster-scoped vs job-scoped installations, pin specific versions in requirements.txt for reproducibility.

Unity Catalog & Governance

Unity Catalog Hierarchy

3-Level Namespace

catalog.schema.table โ€” every object in Unity Catalog lives in this hierarchy.

  • Metastore (top level, one per region)
  • Catalog โ†’ Schema (database) โ†’ Tables/Views/Volumes/Functions
  • Permissions granted at any level and inherited downward

Managed vs External Tables

Managed: CREATE TABLE catalog.schema.table (...)

External: CREATE TABLE catalog.schema.table (...) LOCATION 's3://bucket/path'

Drop managed = data deleted. Drop external = metadata only, data stays.

Access Control: GRANT / REVOKE / DENY

GRANT Statement

GRANT privilege ON securable TO principal

Must GRANT at each level: USE CATALOG + USE SCHEMA + SELECT TABLE.

  • GRANT SELECT ON TABLE catalog.schema.orders TO `data-analysts`;
  • GRANT USE CATALOG ON CATALOG my_catalog TO `group`;

REVOKE Statement

REVOKE privilege ON securable FROM principal

Removes a previously granted privilege. Does not deny โ€” just removes the grant.

REVOKE SELECT ON TABLE catalog.schema.orders FROM `data-analysts`;

DENY Statement

DENY privilege ON securable TO principal

Explicitly prohibits access even if a GRANT exists at another level. Stronger than not granting โ€” used to create exceptions within broadly granted access.

DENY wins over GRANT.

Privilege Hierarchy

Principals (users, groups, service principals) inherit permissions at lower levels from grants at higher levels.

GRANT ALL PRIVILEGES ON CATALOG dev TO `dev-team`; โ€” gives all permissions on everything in dev catalog.

Grant USE CATALOG + USE SCHEMA before granting table-level privileges.

Privileges Reference
PrivilegeApplies ToWhat It Allows
USE CATALOGCatalogAccess objects within the catalog
USE SCHEMASchemaAccess objects within the schema
SELECTTable/ViewRead data
MODIFYTableINSERT, UPDATE, DELETE, MERGE
CREATE TABLESchemaCreate new tables in the schema
ALL PRIVILEGESAnyAll applicable privileges
READ VOLUMEVolumeRead files from a Unity Catalog Volume
WRITE VOLUMEVolumeWrite files to a Unity Catalog Volume
Row & Column Security

Column-Level Masking

Mask sensitive column values based on user group. Example: show full SSN to HR group, mask to XXX-XX-#### for all others.

Implemented with a masking function: ALTER TABLE SET MASK function_name ON COLUMN ssn. Function returns different values based on current_user() or is_member('group').

Row-Level Security (RLS)

Filter which rows a user can see based on identity or group membership. Example: sales reps can only see their own region's data.

Implemented via a row filter function: ALTER TABLE SET ROW FILTER function_name ON (region_col). Filter checks current_user() or group membership and applies a WHERE clause.

Unity Catalog ABAC Policies

Centralized policy-based access control at the Unity Catalog level. ABAC policies define row-level filtering and column masking rules that apply across all queries on protected tables โ€” without modifying query code.

Policies reference user attributes (group membership, user properties) to dynamically control data visibility at query time. More scalable than per-table grants for complex access patterns.

Practice Quiz โ€” 10 Questions

Question 1 of 10
In the Spark UI, a data engineer observes that a stage has 200 tasks. The task duration summary shows: Min=2s, Median=3s, Max=9 minutes. The shuffle read summary shows: Median=80MB, Max=8GB. This pattern MOST likely indicates:
Question 2 of 10
A data engineer receives a java.lang.OutOfMemoryError: Java heap space error in executor logs during a large groupBy aggregation. Which action is MOST likely to resolve this?
Question 3 of 10
A team member needs SELECT access to a table in Unity Catalog (prod.sales.orders). The engineer runs GRANT SELECT ON TABLE prod.sales.orders TO user. The user still cannot query the table. What is MOST likely missing?
Question 4 of 10
Which Unity Catalog feature allows different users to see different values in the same column โ€” for example, full credit card numbers for finance users and masked values for all others?
Question 5 of 10
A Lakeflow Job that normally completes in 45 minutes takes 3 hours today. In the Jobs UI run history, all previous runs completed in 40โ€“50 minutes. Where should the data engineer look FIRST to diagnose the issue?
Question 6 of 10
A data engineer wants to ensure that users in the analyst-group can never access data in the prod.finance.payroll table, even if a broader GRANT ALL PRIVILEGES ON SCHEMA prod.finance exists for their group. Which statement achieves this?
Question 7 of 10
When running VACUUM on a Delta table with the default 7-day retention, what happens to data files written 3 days ago?
Question 8 of 10
A data engineer drops an external table registered in Unity Catalog. What happens to the underlying data files?
Question 9 of 10
Unity Catalog ABAC policies differ from standard GRANT-based access control primarily because:
Question 10 of 10
A data engineer notices disk spilling in a Spark stage processing a large shuffle join. Which TWO actions would MOST directly reduce spilling? (Select the best single answer)
0/10
Quiz Complete

Memory Hooks

๐Ÿ” Skew Pattern in Spark UI
"Max >> Median = Skew; All slow = Something else"
Data skew: max task duration and max shuffle read are much larger than median. Fix: AQE skew handling (on by default). If all tasks are slow, it's a different problem (too few partitions, cluster too small).
๐Ÿ’พ OOM = Wrong Place to Look
"Executor OOM โ†’ executor memory. Driver OOM โ†’ stop collecting!"
Executor OOM: increase spark.executor.memory + more shuffle partitions. Driver OOM: you're collecting too much data to driver โ€” use write() instead of collect(). Never use collect() on large DataFrames.
๐Ÿ”‘ GRANT Cascade Rule
"USE CATALOG + USE SCHEMA + SELECT = minimum for table read"
Granting SELECT on a table isn't enough. User also needs: GRANT USE CATALOG ON CATALOG ... and GRANT USE SCHEMA ON SCHEMA .... All three required for a user to query a table in Unity Catalog.
๐ŸŽญ DENY vs REVOKE
"REVOKE removes a gift; DENY slams the door"
REVOKE: removes a previously granted privilege (no access if nothing else grants it). DENY: explicitly blocks access even if a broader GRANT exists. DENY wins over GRANT โ€” use to create exceptions.
๐Ÿ”’ Column Mask vs Row Filter
"Mask = what you see; Filter = which rows you see"
Column masking: same rows, different column values per user group (mask SSN). Row-level security: same table, different rows per user (filter by region). Both implemented as functions referencing current_user() or is_member().
๐Ÿ“ˆ Run History = Your SLA Baseline
"Today took 3 hours? Compare to last 30 runs"
Lakeflow Jobs run history shows duration per run. Compare current vs historical average. Sudden spike = investigate that specific run's Spark UI and DAG task graph. Gradual increase = data growth or query regression.

Flashcards

Click any card to reveal the answer ยท 0 / 8 flipped
Flashcard 1
Data skew: how to identify in Spark UI and 3 ways to fix
Tap to reveal
Identify: stage task summary shows Max task duration >> Median; Max shuffle read >> Median.

Fix: (1) AQE skew join handling (default โ€” spark.sql.adaptive.skewJoin.enabled=true). (2) Salt the skewed key. (3) Repartition before join. (4) Broadcast join if one side is small.
Flashcard 2
Executor OOM vs Driver OOM โ€” causes and fixes
Tap to reveal
Executor OOM: partition data > executor heap. Fix: increase spark.executor.memory, increase shuffle partitions, filter earlier.

Driver OOM: collect() or toPandas() pulls too much data. Fix: write to Delta instead. Increase spark.driver.memory as last resort.
Flashcard 3
Unity Catalog minimum grants to allow a user to query a table
Tap to reveal
3 grants required:

(1) GRANT USE CATALOG ON CATALOG catalog_name
(2) GRANT USE SCHEMA ON SCHEMA catalog.schema_name
(3) GRANT SELECT ON TABLE catalog.schema.table_name

All three must be granted. USE CATALOG and USE SCHEMA are often forgotten.
Flashcard 4
GRANT vs REVOKE vs DENY in Unity Catalog
Tap to reveal
GRANT: gives access (additive).
REVOKE: removes a grant (no longer has that access).
DENY: explicitly blocks access even when a broader GRANT exists โ€” DENY wins over GRANT.

Use DENY to create exceptions within broadly granted roles.
Flashcard 5
Column masking vs row-level security in Unity Catalog
Tap to reveal
Column masking: all rows visible, column values differ by user group. ALTER TABLE SET MASK function ON COLUMN col.

Row-level security: different users see different rows. ALTER TABLE SET ROW FILTER function ON (col).

Both use functions referencing current_user() or is_member().
Flashcard 6
What is Unity Catalog ABAC and how does it differ from GRANT-based control?
Tap to reveal
ABAC: Attribute-Based Access Control. Centralized policies apply row-level filtering and column masking based on user attributes (group, role). Applied at query time automatically โ€” no code changes needed.

vs GRANT-based: explicit per-table/per-user grants. ABAC scales better for complex, org-wide access patterns.
Flashcard 7
Cluster startup failure diagnosis steps
Tap to reveal
(1) Open cluster event log in Databricks UI.
(2) Look for ERROR entries at startup time.
(3) Check init script logs (most common cause).
(4) Check library installation logs (version conflict).
(5) Verify IAM role/permissions for cloud resource access.
(6) Check Spark configuration for typos or invalid values.
Flashcard 8
How to use Lakeflow Jobs run history to detect performance regression
Tap to reveal
Jobs UI โ†’ select job โ†’ Runs tab. Compare current run duration against historical runs.

Sudden spike: one-time event โ€” check that run's Spark UI for skew, spill, or shuffle explosion.
Gradual increase: data growth or query regression.

Click the slow run โ†’ open Spark UI โ†’ check stages.

Study Advisor

Beginner Path Start Here โ–ผ

Start with the Spark UI skew pattern (Max >> Median = skew). Learn the 3 required Unity Catalog grants for table access (USE CATALOG + USE SCHEMA + SELECT). Understand job cluster monitoring in Lakeflow Jobs run history.

Intermediate Path Build Depth โ–ผ

Study OOM diagnosis (executor vs driver), cluster startup failure steps (event log โ†’ init script โ†’ library logs), and Unity Catalog privilege types (GRANT, REVOKE, DENY โ€” DENY wins). Learn column masking vs row-level security (what vs which rows).

Advanced Path Go Deep โ–ผ

Master Unity Catalog ABAC policies (centralized, attribute-based, scales for org-wide governance), GRANT hierarchy (permissions cascade down catalogโ†’schemaโ†’table), and Spark UI disk spill diagnosis (Spill (memory) column in stage metrics).

Exam Focus High Yield โ–ผ

High-yield: Skew = Max>>Median in Spark UI; AQE skew handling = default on; 3 grants for table access; DENY > GRANT (DENY wins); Column masking = same rows/different values; Row filter = different rows; External table drop = data stays; ABAC = attribute-based centralized policies.

Quick Review Last Minute โ–ผ

Skew: Max>>Medianโ†’AQE or salt. OOM executor: increase memory+partitions. OOM driver: stop collecting! UC grants: USE CATALOG+USE SCHEMA+SELECT. DENY wins over GRANT. Column mask=different values. Row filter=different rows. ABAC=centralized attribute-based. External table drop=data stays. Run history=compare to baseline.