Databricks DEA: Data Transformation & Modeling

Data transformation is the core of the Silver and Gold layers in the Medallion architecture. The DEA exam tests PySpark/SQL transformation skills, Spark performance tuning, and understanding of Gold layer object types for BI and analytics consumers.

Core Concepts

Medallion Transformation Flow

Bronze (raw ingestion) → Silver (clean, typed, deduplicated) → Gold (aggregated, business-ready). Each layer adds value. Silver transformations: clean nulls, standardize types, deduplicate. Gold transformations: joins, aggregations, business logic, serving layer objects.

PySpark vs Spark SQL

Both run on the same Spark engine and produce identical results. PySpark uses Python DataFrame API (df.filter(), df.groupBy()). Spark SQL uses SQL syntax (spark.sql("SELECT ...") or %sql magic). Use whichever is clearer — they interoperate freely.

Gold Layer Serving Objects

The DEA exam specifically tests four Gold layer object types: (1) Tables (materialized Delta), (2) Views (virtual, no data storage), (3) Materialized Views (auto-refreshed pre-computed results), (4) Streaming Tables (continuously updated from a streaming source via DLT).

Data Quality in Silver and Gold

Data quality checks ensure reliable downstream consumption. The exam covers: constraint-based checks (ALTER TABLE ADD CONSTRAINT), Delta Live Tables expectations (warn, drop, fail), and validation patterns (null checks, referential integrity, range checks).

Column & Row Operations

Operation	PySpark	SQL Equivalent
Add column	`df.withColumn("new", expr)`	`SELECT *, expr AS new`
Drop column	`df.drop("col")`	`SELECT all_except_col`
Rename column	`df.withColumnRenamed("old","new")`	`SELECT col AS new`
Filter rows	`df.filter(col("age") > 18)`	`WHERE age > 18`
Explode array	`df.withColumn("item", explode("items"))`	`SELECT explode(items) AS item`
Split string	`split(col("name"), " ")`	`split(name, ' ')`
Cast type	`col("id").cast("integer")`	`CAST(id AS INT)`
Handle nulls	`df.fillna({"col": 0})`	`COALESCE(col, 0)`

Join Types

Inner Join

Returns only rows where the join key exists in BOTH DataFrames. df1.join(df2, "key", "inner"). Most common — use when you only want matched records.

Left Join

Returns all rows from the left DataFrame, with NULLs for unmatched right-side columns. df1.join(df2, "key", "left"). Use when preserving all left-side records (e.g., all orders, even those without matching customers).

Broadcast Join

Sends the smaller DataFrame to all executors to avoid shuffle. df1.join(broadcast(df2), "key"). Use when one DataFrame is small (< spark.sql.autoBroadcastJoinThreshold, default 10MB). Eliminates expensive shuffle for small dimension tables.

Cross Join

Cartesian product — every row from left with every row from right. df1.crossJoin(df2). Use only when intentional — produces rows_left × rows_right rows. Rarely needed; can cause out-of-memory.

Union vs Union All

union() in SQL deduplicates. In PySpark, df1.union(df2) does NOT deduplicate by default (behaves like UNION ALL). Use dropDuplicates() explicitly after union for deduplication. Always be explicit about intent.

Deduplication & Aggregations

Deduplication

df.dropDuplicates() — remove exact duplicate rows. df.dropDuplicates(["key_col"]) — keep first occurrence per key. df.distinct() — same as dropDuplicates() on all columns. Use MERGE INTO for upsert-based deduplication in Delta tables.

Aggregations

df.groupBy("col").agg(count("*"), avg("amount"), approx_count_distinct("user_id")). SQL: GROUP BY col. Key functions: count, sum, avg, min, max, approx_count_distinct (use instead of countDistinct for large datasets — much faster), summary() for descriptive stats.

Spark Tuning Parameters

Parameter	Default	What It Controls	When to Tune
`spark.sql.shuffle.partitions`	200	Number of partitions after a shuffle (join/groupBy)	Reduce for small data (try 8–16); increase for large data
`spark.default.parallelism`	2× CPU cores	Default parallelism for RDD operations	Match to cluster size
`spark.executor.memory`	1g	Memory per executor for task execution	Increase if OOM errors on executors
`spark.driver.memory`	1g	Memory for the driver process	Increase if collecting large results to driver
`spark.sql.autoBroadcastJoinThreshold`	10MB	Max table size for auto-broadcast joins	Increase if small dim tables aren't being broadcast; set to -1 to disable

Performance Bottlenecks

Data Skew

When one partition has significantly more data than others — causes one slow task that stalls the entire stage. Signs in Spark UI: most tasks finish in seconds, one task takes minutes; max shuffle read >> median shuffle read. Fix: Adaptive Query Execution (AQE) with skew join handling, salting, or repartitioning.

Shuffle

Data movement across the network between stages (caused by joins, groupBy, orderBy). Most expensive operation in Spark. Minimize shuffles by: using broadcast joins for small tables, tuning shuffle.partitions appropriately, using Delta Lake Z-ORDER to pre-sort data.

Disk Spilling

Occurs when executor memory is insufficient to hold partition data — spills to disk, massively degrading performance. Signs: slow stage with "spill (memory)" column in Spark UI. Fix: increase executor memory, reduce partition size (more partitions), or reduce data earlier with filters.

Adaptive Query Execution (AQE)

Databricks has AQE enabled by default. Dynamically re-plans query execution based on runtime statistics. Key features: automatically coalesces small shuffle partitions, handles skew joins by splitting oversized partitions, converts sort-merge joins to broadcast joins when possible at runtime.

Gold Layer Object Types

Object Type	Storage	Refresh	Use Case	Created With
Table (Delta)	Materialized on disk	Manual write/append/merge	Full historical data, write-optimized	`CREATE TABLE AS SELECT` or DLT
View	No storage (virtual)	Always current (re-runs query)	Reusable query logic, security filters	`CREATE VIEW AS SELECT`
Materialized View	Materialized on disk	Automatically refreshed by DLT	Pre-computed aggregations for BI speed	Delta Live Tables
Streaming Table	Materialized on disk	Continuously updated (streaming)	Real-time dashboards, low-latency BI	Delta Live Tables (streaming)

Gold Layer Object Details

Views (Standard and Temp)

Standard view: CREATE VIEW schema.view_name AS SELECT ... — persists in catalog, no data stored, re-executes underlying query each time. Temp view: CREATE TEMP VIEW name AS SELECT ... — session-scoped, disappears when session ends. Global temp view: CREATE GLOBAL TEMP VIEW — available across sessions in same cluster, accessed as global_temp.view_name.

Materialized Views in DLT

Pre-compute and store query results. DLT automatically refreshes when upstream data changes. Faster query performance than regular views (no re-computation at query time). Defined in DLT pipeline as @dlt.table with COMMENT and TBLPROPERTIES. Results stored in Delta format.

Streaming Tables in DLT

Continuously updated from a streaming source. Defined in DLT with @dlt.table using spark.readStream. Append-only by default. Updated incrementally as new data arrives. Ideal for real-time dashboards, low-latency reporting, and continuously updated aggregates.

CTAS (Create Table As Select)

CREATE TABLE catalog.schema.table AS SELECT ... — creates a Delta table from query results. Common for building Gold layer summary tables. Use CREATE OR REPLACE TABLE to overwrite. Can add partitioning: PARTITIONED BY (date_col).

Data Quality Concepts

Delta Table Constraints

ALTER TABLE t ADD CONSTRAINT constraint_name CHECK (expression). Enforced on every write — rows violating the constraint are rejected with an error. Example: ALTER TABLE orders ADD CONSTRAINT valid_amount CHECK (amount > 0). Use DESCRIBE EXTENDED to view constraints.

Delta Live Tables Expectations

Three modes: @dlt.expect("rule", "condition") — tracks violations but keeps rows (warn). @dlt.expect_or_drop("rule", "condition") — drops violating rows. @dlt.expect_or_fail("rule", "condition") — halts pipeline on violation.

Silver Layer Cleaning Pattern

Standard Silver transformation: (1) Read Bronze: spark.read.table("bronze.events"). (2) Filter nulls: df.filter(col("user_id").isNotNull()). (3) Cast types: col("ts").cast("timestamp"). (4) Deduplicate: df.dropDuplicates(["event_id"]). (5) Write: df.write.format("delta").mode("append").saveAsTable("silver.events").

Practice Quiz — 10 Questions

Memory Hooks

🔀

Shuffle Tuning

"200 partitions for 50MB = overkill"

spark.sql.shuffle.partitions defaults to 200 — fine for large data, terrible for small. Small dataset? Set to 8–16. Large dataset? May need 500+. Tune based on data size.

⚖️

Data Skew = One Slow Task

"199 tasks done; 1 task running for 10 minutes = skew"

Skew: one partition has way more data. Spark UI: max shuffle read >> median. Fix: AQE skew handling, salting, or repartition. AQE enabled by default in Databricks.

📡

Broadcast = No Shuffle

"Small table? Broadcast it — spare the network"

broadcast(df) hint or auto-broadcast when table < autoBroadcastJoinThreshold (10MB default). Sends small table to every executor. Eliminates expensive shuffle for dimension table joins.

🏅

Gold Objects: 4 Types

"Tables store, Views compute, Materialized refresh, Streaming flows"

Table = stored Delta. View = virtual/recomputes each query. Materialized View = stored + auto-refreshed by DLT. Streaming Table = live-updating from stream via DLT.

✅

DLT Expectations: 3 Modes

"Warn, Drop, or Fail — pick your quality enforcement"

expect = warn/track (keep bad rows). expect_or_drop = silently drop bad rows. expect_or_fail = halt pipeline on violation. Choose based on how critical data quality is.

🧹

Silver Cleaning Pattern

"Filter → Cast → Deduplicate → Append"

Bronze to Silver: (1) filter nulls, (2) cast types, (3) dropDuplicates on key, (4) write.mode("append"). This is the canonical Silver transformation pattern.

Flashcards (click to flip)

Question

What are the 5 key Spark tuning parameters for the DEA exam?

Answer

spark.sql.shuffle.partitions (200 default — reduce for small data). spark.default.parallelism (2× CPU cores). spark.executor.memory (executor heap). spark.driver.memory (driver heap). spark.sql.autoBroadcastJoinThreshold (10MB — max size for auto-broadcast join).

Question

Data skew — how to identify and fix it

Answer

Identify: Spark UI shows max task time >> median task time; max shuffle read >> median. Fix: (1) Enable AQE skew join handling (default in Databricks). (2) Salt the skewed key. (3) Repartition before join. (4) Use broadcast join if one side is small.

Question

Broadcast join — when and how to use it

Answer

Use when one table is small (< autoBroadcastJoinThreshold, default 10MB). Databricks auto-broadcasts if under threshold. Force with: df1.join(broadcast(df2), 'key'). Eliminates shuffle — sends small table to every executor. Best for dimension tables in star schema joins.

Question

4 Gold layer object types and their characteristics

Answer

Table (Delta): stored, manually written. View: virtual, no storage, re-executes on query. Materialized View: stored pre-computed results, auto-refreshed by DLT. Streaming Table: continuously updated from stream via DLT. Views = fastest to create; Materialized = fastest to query.

Question

3 DLT expectation modes and when to use each

Answer

@dlt.expect: track violations, keep all rows (warn mode — for monitoring). @dlt.expect_or_drop: silently drop violating rows, pipeline continues (for partial bad data). @dlt.expect_or_fail: stop pipeline on any violation (for critical data quality).

Question

What causes disk spilling in Spark and how to fix it?

Answer

Spilling: partition data exceeds executor memory → written to disk → major performance hit. Identify: Spark UI shows 'Spill (Memory)' column with large values. Fix: (1) Increase spark.executor.memory. (2) Increase shuffle partitions (smaller partitions). (3) Filter/reduce data earlier in the pipeline.

Question

Union vs Union All in PySpark — behavior difference

Answer

SQL UNION deduplicates. SQL UNION ALL keeps duplicates. PySpark: df1.union(df2) does NOT deduplicate by default (behaves like UNION ALL). To deduplicate after union: df1.union(df2).dropDuplicates(). Always be explicit about deduplication intent.

Question

Standard temp view vs global temp view vs permanent view

Answer

Temp view: CREATE TEMP VIEW — session-scoped, lost when session ends. Global temp view: CREATE GLOBAL TEMP VIEW — available across sessions on same cluster, accessed as global_temp.view_name. Permanent view: CREATE VIEW schema.v — persists in Unity Catalog, available to all users with permissions.

Tip: Click any card to reveal the answer. Click again to flip back.

Study Advisor

Beginner

Start with the Silver transformation pattern (filter nulls → cast types → deduplicate → append). Understand join types: inner (match both), left (keep all left), broadcast (avoid shuffle with small table). Know the 4 Gold layer object types.

Intermediate

Study Spark tuning parameters (shuffle.partitions, autoBroadcastJoinThreshold). Understand data skew identification (Spark UI: max task >> median). Learn DLT expectations (expect vs expect_or_drop vs expect_or_fail).

Advanced

Deep dive on AQE (adaptive query execution, skew handling, dynamic broadcast), disk spilling diagnosis and fixes, materialized views vs streaming tables in DLT, constraint-based data quality, and CTAS patterns for Gold layer tables.

Exam Focus

High-yield: shuffle.partitions=200 default (tune down for small data); broadcast join=no shuffle for small table; data skew=AQE or salt; DLT expect_or_drop=drop bad rows; Streaming Table=live via DLT; Materialized View=pre-computed/auto-refresh; Silver=filter/cast/deduplicate/append.

Quick Review

Joins: inner=match both, left=keep left, broadcast=no shuffle, cross=cartesian. Skew: max>>median task time. Tuning: shuffle.partitions, autoBroadcastJoinThreshold. Gold: Table/View/Materialized View/Streaming Table. DLT quality: expect(warn), expect_or_drop, expect_or_fail. Silver: filter→cast→dedup→append.

Data Transformation & Modeling

Core Concepts

Medallion Transformation Flow

PySpark vs Spark SQL

Gold Layer Serving Objects

Data Quality in Silver and Gold

Column & Row Operations

Join Types

Inner Join

Left Join

Broadcast Join

Cross Join

Union vs Union All

Deduplication & Aggregations

Deduplication

Aggregations

Spark Tuning Parameters

Performance Bottlenecks

Data Skew

Shuffle

Disk Spilling

Adaptive Query Execution (AQE)

Gold Layer Object Types

Gold Layer Object Details

Views (Standard and Temp)

Materialized Views in DLT

Streaming Tables in DLT

CTAS (Create Table As Select)

Data Quality Concepts

Delta Table Constraints

Delta Live Tables Expectations

Silver Layer Cleaning Pattern

Practice Quiz — 10 Questions

Quiz Complete!

Memory Hooks

Flashcards (click to flip)

Study Advisor

Beginner

Intermediate

Advanced

Exam Focus

Quick Review