Topic Overview
Monitoring Layer
Track pipeline health using Lakeflow Jobs run history (duration trends, failure rates) and the DAG task graph (identify upstream blockers). Set alerts for failures and SLA misses. Compare current run times against historical baselines to detect regression.
Spark UI Diagnostics
The Spark UI provides stage-level metrics for every Spark job. Key metrics: task duration distribution, shuffle read/write sizes, spill (memory and disk). Used to identify data skew, excessive shuffling, and memory pressure before they cause production failures.
Cluster Diagnostics
Common cluster-level issues include startup failures (library conflicts, init script errors), out-of-memory (OOM) errors (executor or driver), and library version conflicts. Each has specific diagnosis patterns and fixes.
Unity Catalog Governance
Unity Catalog provides fine-grained access control (GRANT/REVOKE/DENY), column-level masking, row-level security filters, and ABAC (Attribute-Based Access Control) policies. All governance is centrally managed and auditable.
Monitoring & Job Performance
Lakeflow Jobs Run History
Accessible in the Jobs UI โ select job โ Runs tab. Shows all past runs with start time, duration, trigger, and status (Succeeded/Failed/Running/Skipped). Click any run to drill into task-level details.
- Which tasks consistently fail
- Which runs are slowest
- Whether duration is increasing over time
Comparing Against Historical Baseline
The run history view shows duration for each run over time. If current run takes 2ร longer than previous runs, it signals a performance regression. Common causes:
- Data volume growth
- A new skewed partition
- A dropped broadcast hint
- A cluster configuration change
DAG Task Graph for Upstream Blockers
In the Jobs UI, the DAG visualization shows task statuses in real time. If Task C is waiting, check its upstream tasks (A and B).
- Task "Waiting" โ upstream dependency not complete
- Task "Failed" โ blocks all downstream tasks
- Use the DAG to quickly pinpoint bottlenecks
Alerts and SLA Monitoring
Configure job-level alerts (email or webhook) for: job started, job succeeded, job failed, duration exceeds SLA threshold. SLA alerts notify when a job runs longer than expected โ critical for time-sensitive pipelines feeding dashboards or downstream systems.
Spark UI & Troubleshooting
Jobs and Stages View
Spark UI shows all jobs (triggered by actions like count(), write()) and their stages (separated by shuffle boundaries). Each stage shows: number of tasks, duration, input/output size, shuffle read/write size. Click a stage to see per-task metrics.
Stage-Level Task Metrics
For each stage, the task summary shows Min/25th/Median/75th/Max values for: task duration, shuffle read size, shuffle write size, spill (memory), spill (disk). These distributions reveal performance problems. When Max >> Median for task duration or shuffle read, data skew is present.
Identifying Data Skew in Spark UI
Pattern: 199 tasks complete in seconds, 1 task takes minutes. Max shuffle read = 5GB, Median = 50MB.
Fix: Confirm AQE skew handling is active (spark.sql.adaptive.skewJoin.enabled=true, default in Databricks). Alternatively: salt the skewed join key or repartition before join.
| Issue | Spark UI Indicator | Root Cause | Fix |
|---|---|---|---|
| Data Skew | Max task time >> Median; Max shuffle read >> Median | Uneven data distribution (hot key) | AQE skew handling, salting, repartition |
| Disk Spilling | Spill (disk) column has large values | Partition too large for executor memory | Increase executor memory, more shuffle partitions, filter earlier |
| Excessive Shuffle | Shuffle read/write sizes very large | Joins/groupBy without optimization | Broadcast small tables, reduce shuffle.partitions, pre-aggregate |
| Slow Stage Overall | All tasks slow (not just one) | Insufficient parallelism or large data volume | Increase cluster size, increase shuffle.partitions |
Cluster Startup Failures
Common causes:
- Init script error โ check init script logs in cluster event log
- Library installation failure โ check library logs, version conflicts
- Misconfigured Spark config โ check cluster configuration JSON
- IAM/permission issue โ cluster cannot access cloud resources
Check cluster event log โ look for ERROR entries.
Out-of-Memory (OOM) Errors
Executor OOM: java.lang.OutOfMemoryError: Java heap space or GC overhead limit exceeded in task logs.
Causes: partition too large, collecting too much data to executor.
Fixes: increase spark.executor.memory, increase shuffle partitions, filter earlier, avoid wide transformations on large data.
Driver OOM
collect() or toPandas() on large DataFrames pulls all data to the driver โ driver runs out of memory.
Symptom: driver node fails, job lost.
Fix: Never collect large DataFrames to driver. Use display() with limits, write to Delta instead, or increase spark.driver.memory.
Library Conflicts
Multiple libraries with incompatible versions.
Symptoms: ClassNotFoundException, NoSuchMethodError, import errors.
Fix: check installed libraries in cluster UI, use cluster-scoped vs job-scoped installations, pin specific versions in requirements.txt for reproducibility.
Unity Catalog & Governance
3-Level Namespace
catalog.schema.table โ every object in Unity Catalog lives in this hierarchy.
- Metastore (top level, one per region)
- Catalog โ Schema (database) โ Tables/Views/Volumes/Functions
- Permissions granted at any level and inherited downward
Managed vs External Tables
Managed: CREATE TABLE catalog.schema.table (...)
External: CREATE TABLE catalog.schema.table (...) LOCATION 's3://bucket/path'
Drop managed = data deleted. Drop external = metadata only, data stays.
GRANT Statement
GRANT privilege ON securable TO principal
Must GRANT at each level: USE CATALOG + USE SCHEMA + SELECT TABLE.
GRANT SELECT ON TABLE catalog.schema.orders TO `data-analysts`;GRANT USE CATALOG ON CATALOG my_catalog TO `group`;
REVOKE Statement
REVOKE privilege ON securable FROM principal
Removes a previously granted privilege. Does not deny โ just removes the grant.
REVOKE SELECT ON TABLE catalog.schema.orders FROM `data-analysts`;
DENY Statement
DENY privilege ON securable TO principal
Explicitly prohibits access even if a GRANT exists at another level. Stronger than not granting โ used to create exceptions within broadly granted access.
DENY wins over GRANT.
Privilege Hierarchy
Principals (users, groups, service principals) inherit permissions at lower levels from grants at higher levels.
GRANT ALL PRIVILEGES ON CATALOG dev TO `dev-team`; โ gives all permissions on everything in dev catalog.
Grant USE CATALOG + USE SCHEMA before granting table-level privileges.
| Privilege | Applies To | What It Allows |
|---|---|---|
| USE CATALOG | Catalog | Access objects within the catalog |
| USE SCHEMA | Schema | Access objects within the schema |
| SELECT | Table/View | Read data |
| MODIFY | Table | INSERT, UPDATE, DELETE, MERGE |
| CREATE TABLE | Schema | Create new tables in the schema |
| ALL PRIVILEGES | Any | All applicable privileges |
| READ VOLUME | Volume | Read files from a Unity Catalog Volume |
| WRITE VOLUME | Volume | Write files to a Unity Catalog Volume |
Column-Level Masking
Mask sensitive column values based on user group. Example: show full SSN to HR group, mask to XXX-XX-#### for all others.
Implemented with a masking function: ALTER TABLE SET MASK function_name ON COLUMN ssn. Function returns different values based on current_user() or is_member('group').
Row-Level Security (RLS)
Filter which rows a user can see based on identity or group membership. Example: sales reps can only see their own region's data.
Implemented via a row filter function: ALTER TABLE SET ROW FILTER function_name ON (region_col). Filter checks current_user() or group membership and applies a WHERE clause.
Unity Catalog ABAC Policies
Centralized policy-based access control at the Unity Catalog level. ABAC policies define row-level filtering and column masking rules that apply across all queries on protected tables โ without modifying query code.
Policies reference user attributes (group membership, user properties) to dynamically control data visibility at query time. More scalable than per-table grants for complex access patterns.
Practice Quiz โ 10 Questions
java.lang.OutOfMemoryError: Java heap space error in executor logs during a large groupBy aggregation. Which action is MOST likely to resolve this?prod.sales.orders). The engineer runs GRANT SELECT ON TABLE prod.sales.orders TO user. The user still cannot query the table. What is MOST likely missing?analyst-group can never access data in the prod.finance.payroll table, even if a broader GRANT ALL PRIVILEGES ON SCHEMA prod.finance exists for their group. Which statement achieves this?VACUUM on a Delta table with the default 7-day retention, what happens to data files written 3 days ago?Memory Hooks
spark.executor.memory + more shuffle partitions. Driver OOM: you're collecting too much data to driver โ use write() instead of collect(). Never use collect() on large DataFrames.GRANT USE CATALOG ON CATALOG ... and GRANT USE SCHEMA ON SCHEMA .... All three required for a user to query a table in Unity Catalog.current_user() or is_member().Flashcards
Fix: (1) AQE skew join handling (default โ
spark.sql.adaptive.skewJoin.enabled=true). (2) Salt the skewed key. (3) Repartition before join. (4) Broadcast join if one side is small.spark.executor.memory, increase shuffle partitions, filter earlier.Driver OOM:
collect() or toPandas() pulls too much data. Fix: write to Delta instead. Increase spark.driver.memory as last resort.(1)
GRANT USE CATALOG ON CATALOG catalog_name(2)
GRANT USE SCHEMA ON SCHEMA catalog.schema_name(3)
GRANT SELECT ON TABLE catalog.schema.table_nameAll three must be granted. USE CATALOG and USE SCHEMA are often forgotten.
REVOKE: removes a grant (no longer has that access).
DENY: explicitly blocks access even when a broader GRANT exists โ DENY wins over GRANT.
Use DENY to create exceptions within broadly granted roles.
ALTER TABLE SET MASK function ON COLUMN col.Row-level security: different users see different rows.
ALTER TABLE SET ROW FILTER function ON (col).Both use functions referencing
current_user() or is_member().vs GRANT-based: explicit per-table/per-user grants. ABAC scales better for complex, org-wide access patterns.
(2) Look for ERROR entries at startup time.
(3) Check init script logs (most common cause).
(4) Check library installation logs (version conflict).
(5) Verify IAM role/permissions for cloud resource access.
(6) Check Spark configuration for typos or invalid values.
Sudden spike: one-time event โ check that run's Spark UI for skew, spill, or shuffle explosion.
Gradual increase: data growth or query regression.
Click the slow run โ open Spark UI โ check stages.
Study Advisor
Start with the Spark UI skew pattern (Max >> Median = skew). Learn the 3 required Unity Catalog grants for table access (USE CATALOG + USE SCHEMA + SELECT). Understand job cluster monitoring in Lakeflow Jobs run history.
Study OOM diagnosis (executor vs driver), cluster startup failure steps (event log โ init script โ library logs), and Unity Catalog privilege types (GRANT, REVOKE, DENY โ DENY wins). Learn column masking vs row-level security (what vs which rows).
Master Unity Catalog ABAC policies (centralized, attribute-based, scales for org-wide governance), GRANT hierarchy (permissions cascade down catalogโschemaโtable), and Spark UI disk spill diagnosis (Spill (memory) column in stage metrics).
High-yield: Skew = Max>>Median in Spark UI; AQE skew handling = default on; 3 grants for table access; DENY > GRANT (DENY wins); Column masking = same rows/different values; Row filter = different rows; External table drop = data stays; ABAC = attribute-based centralized policies.
Skew: Max>>MedianโAQE or salt. OOM executor: increase memory+partitions. OOM driver: stop collecting! UC grants: USE CATALOG+USE SCHEMA+SELECT. DENY wins over GRANT. Column mask=different values. Row filter=different rows. ABAC=centralized attribute-based. External table drop=data stays. Run history=compare to baseline.