PySpark/SQL · Joins · Aggregations · Spark Tuning · Gold Layer Objects · Data Quality
Data transformation is the core of the Silver and Gold layers in the Medallion architecture. The DEA exam tests PySpark/SQL transformation skills, Spark performance tuning, and understanding of Gold layer object types for BI and analytics consumers.
Bronze (raw ingestion) → Silver (clean, typed, deduplicated) → Gold (aggregated, business-ready). Each layer adds value. Silver transformations: clean nulls, standardize types, deduplicate. Gold transformations: joins, aggregations, business logic, serving layer objects.
Both run on the same Spark engine and produce identical results. PySpark uses Python DataFrame API (df.filter(), df.groupBy()). Spark SQL uses SQL syntax (spark.sql("SELECT ...") or %sql magic). Use whichever is clearer โ they interoperate freely.
The DEA exam specifically tests four Gold layer object types: (1) Tables (materialized Delta), (2) Views (virtual, no data storage), (3) Materialized Views (auto-refreshed pre-computed results), (4) Streaming Tables (continuously updated from a streaming source via DLT).
Data quality checks ensure reliable downstream consumption. The exam covers: constraint-based checks (ALTER TABLE ADD CONSTRAINT), Delta Live Tables expectations (warn, drop, fail), and validation patterns (null checks, referential integrity, range checks).
| Operation | PySpark | SQL Equivalent |
|---|---|---|
| Add column | df.withColumn("new", expr) | SELECT *, expr AS new |
| Drop column | df.drop("col") | SELECT all_except_col |
| Rename column | df.withColumnRenamed("old","new") | SELECT col AS new |
| Filter rows | df.filter(col("age") > 18) | WHERE age > 18 |
| Explode array | df.withColumn("item", explode("items")) | SELECT explode(items) AS item |
| Split string | split(col("name"), " ") | split(name, ' ') |
| Cast type | col("id").cast("integer") | CAST(id AS INT) |
| Handle nulls | df.fillna({"col": 0}) | COALESCE(col, 0) |
Returns only rows where the join key exists in BOTH DataFrames. df1.join(df2, "key", "inner"). Most common โ use when you only want matched records.
Returns all rows from the left DataFrame, with NULLs for unmatched right-side columns. df1.join(df2, "key", "left"). Use when preserving all left-side records (e.g., all orders, even those without matching customers).
Sends the smaller DataFrame to all executors to avoid shuffle. df1.join(broadcast(df2), "key"). Use when one DataFrame is small (< spark.sql.autoBroadcastJoinThreshold, default 10MB). Eliminates expensive shuffle for small dimension tables.
Cartesian product โ every row from left with every row from right. df1.crossJoin(df2). Use only when intentional โ produces rows_left × rows_right rows. Rarely needed; can cause out-of-memory.
union() in SQL deduplicates. In PySpark, df1.union(df2) does NOT deduplicate by default (behaves like UNION ALL). Use dropDuplicates() explicitly after union for deduplication. Always be explicit about intent.
df.dropDuplicates() โ remove exact duplicate rows. df.dropDuplicates(["key_col"]) โ keep first occurrence per key. df.distinct() โ same as dropDuplicates() on all columns. Use MERGE INTO for upsert-based deduplication in Delta tables.
df.groupBy("col").agg(count("*"), avg("amount"), approx_count_distinct("user_id")). SQL: GROUP BY col. Key functions: count, sum, avg, min, max, approx_count_distinct (use instead of countDistinct for large datasets โ much faster), summary() for descriptive stats.
| Parameter | Default | What It Controls | When to Tune |
|---|---|---|---|
spark.sql.shuffle.partitions |
200 | Number of partitions after a shuffle (join/groupBy) | Reduce for small data (try 8–16); increase for large data |
spark.default.parallelism |
2× CPU cores | Default parallelism for RDD operations | Match to cluster size |
spark.executor.memory |
1g | Memory per executor for task execution | Increase if OOM errors on executors |
spark.driver.memory |
1g | Memory for the driver process | Increase if collecting large results to driver |
spark.sql.autoBroadcastJoinThreshold |
10MB | Max table size for auto-broadcast joins | Increase if small dim tables aren't being broadcast; set to -1 to disable |
When one partition has significantly more data than others โ causes one slow task that stalls the entire stage. Signs in Spark UI: most tasks finish in seconds, one task takes minutes; max shuffle read >> median shuffle read. Fix: Adaptive Query Execution (AQE) with skew join handling, salting, or repartitioning.
Data movement across the network between stages (caused by joins, groupBy, orderBy). Most expensive operation in Spark. Minimize shuffles by: using broadcast joins for small tables, tuning shuffle.partitions appropriately, using Delta Lake Z-ORDER to pre-sort data.
Occurs when executor memory is insufficient to hold partition data โ spills to disk, massively degrading performance. Signs: slow stage with "spill (memory)" column in Spark UI. Fix: increase executor memory, reduce partition size (more partitions), or reduce data earlier with filters.
Databricks has AQE enabled by default. Dynamically re-plans query execution based on runtime statistics. Key features: automatically coalesces small shuffle partitions, handles skew joins by splitting oversized partitions, converts sort-merge joins to broadcast joins when possible at runtime.
| Object Type | Storage | Refresh | Use Case | Created With |
|---|---|---|---|---|
| Table (Delta) | Materialized on disk | Manual write/append/merge | Full historical data, write-optimized | CREATE TABLE AS SELECT or DLT |
| View | No storage (virtual) | Always current (re-runs query) | Reusable query logic, security filters | CREATE VIEW AS SELECT |
| Materialized View | Materialized on disk | Automatically refreshed by DLT | Pre-computed aggregations for BI speed | Delta Live Tables |
| Streaming Table | Materialized on disk | Continuously updated (streaming) | Real-time dashboards, low-latency BI | Delta Live Tables (streaming) |
Standard view: CREATE VIEW schema.view_name AS SELECT ... โ persists in catalog, no data stored, re-executes underlying query each time. Temp view: CREATE TEMP VIEW name AS SELECT ... โ session-scoped, disappears when session ends. Global temp view: CREATE GLOBAL TEMP VIEW โ available across sessions in same cluster, accessed as global_temp.view_name.
Pre-compute and store query results. DLT automatically refreshes when upstream data changes. Faster query performance than regular views (no re-computation at query time). Defined in DLT pipeline as @dlt.table with COMMENT and TBLPROPERTIES. Results stored in Delta format.
Continuously updated from a streaming source. Defined in DLT with @dlt.table using spark.readStream. Append-only by default. Updated incrementally as new data arrives. Ideal for real-time dashboards, low-latency reporting, and continuously updated aggregates.
CREATE TABLE catalog.schema.table AS SELECT ... โ creates a Delta table from query results. Common for building Gold layer summary tables. Use CREATE OR REPLACE TABLE to overwrite. Can add partitioning: PARTITIONED BY (date_col).
ALTER TABLE t ADD CONSTRAINT constraint_name CHECK (expression). Enforced on every write โ rows violating the constraint are rejected with an error. Example: ALTER TABLE orders ADD CONSTRAINT valid_amount CHECK (amount > 0). Use DESCRIBE EXTENDED to view constraints.
Three modes: @dlt.expect("rule", "condition") โ tracks violations but keeps rows (warn). @dlt.expect_or_drop("rule", "condition") โ drops violating rows. @dlt.expect_or_fail("rule", "condition") โ halts pipeline on violation.
Standard Silver transformation: (1) Read Bronze: spark.read.table("bronze.events"). (2) Filter nulls: df.filter(col("user_id").isNotNull()). (3) Cast types: col("ts").cast("timestamp"). (4) Deduplicate: df.dropDuplicates(["event_id"]). (5) Write: df.write.format("delta").mode("append").saveAsTable("silver.events").
spark.sql.shuffle.partitions defaults to 200 โ fine for large data, terrible for small. Small dataset? Set to 8โ16. Large dataset? May need 500+. Tune based on data size.broadcast(df) hint or auto-broadcast when table < autoBroadcastJoinThreshold (10MB default). Sends small table to every executor. Eliminates expensive shuffle for dimension table joins.expect = warn/track (keep bad rows). expect_or_drop = silently drop bad rows. expect_or_fail = halt pipeline on violation. Choose based on how critical data quality is.dropDuplicates on key, (4) write.mode("append"). This is the canonical Silver transformation pattern.What are the 5 key Spark tuning parameters for the DEA exam?
spark.sql.shuffle.partitions (200 default โ reduce for small data). spark.default.parallelism (2× CPU cores). spark.executor.memory (executor heap). spark.driver.memory (driver heap). spark.sql.autoBroadcastJoinThreshold (10MB โ max size for auto-broadcast join).
Data skew โ how to identify and fix it
Identify: Spark UI shows max task time >> median task time; max shuffle read >> median. Fix: (1) Enable AQE skew join handling (default in Databricks). (2) Salt the skewed key. (3) Repartition before join. (4) Use broadcast join if one side is small.
Broadcast join โ when and how to use it
Use when one table is small (< autoBroadcastJoinThreshold, default 10MB). Databricks auto-broadcasts if under threshold. Force with: df1.join(broadcast(df2), 'key'). Eliminates shuffle โ sends small table to every executor. Best for dimension tables in star schema joins.
4 Gold layer object types and their characteristics
Table (Delta): stored, manually written. View: virtual, no storage, re-executes on query. Materialized View: stored pre-computed results, auto-refreshed by DLT. Streaming Table: continuously updated from stream via DLT. Views = fastest to create; Materialized = fastest to query.
3 DLT expectation modes and when to use each
@dlt.expect: track violations, keep all rows (warn mode โ for monitoring). @dlt.expect_or_drop: silently drop violating rows, pipeline continues (for partial bad data). @dlt.expect_or_fail: stop pipeline on any violation (for critical data quality).
What causes disk spilling in Spark and how to fix it?
Spilling: partition data exceeds executor memory → written to disk → major performance hit. Identify: Spark UI shows 'Spill (Memory)' column with large values. Fix: (1) Increase spark.executor.memory. (2) Increase shuffle partitions (smaller partitions). (3) Filter/reduce data earlier in the pipeline.
Union vs Union All in PySpark โ behavior difference
SQL UNION deduplicates. SQL UNION ALL keeps duplicates. PySpark: df1.union(df2) does NOT deduplicate by default (behaves like UNION ALL). To deduplicate after union: df1.union(df2).dropDuplicates(). Always be explicit about deduplication intent.
Standard temp view vs global temp view vs permanent view
Temp view: CREATE TEMP VIEW โ session-scoped, lost when session ends. Global temp view: CREATE GLOBAL TEMP VIEW โ available across sessions on same cluster, accessed as global_temp.view_name. Permanent view: CREATE VIEW schema.v โ persists in Unity Catalog, available to all users with permissions.
Tip: Click any card to reveal the answer. Click again to flip back.
Start with the Silver transformation pattern (filter nulls → cast types → deduplicate → append). Understand join types: inner (match both), left (keep all left), broadcast (avoid shuffle with small table). Know the 4 Gold layer object types.
Study Spark tuning parameters (shuffle.partitions, autoBroadcastJoinThreshold). Understand data skew identification (Spark UI: max task >> median). Learn DLT expectations (expect vs expect_or_drop vs expect_or_fail).
Deep dive on AQE (adaptive query execution, skew handling, dynamic broadcast), disk spilling diagnosis and fixes, materialized views vs streaming tables in DLT, constraint-based data quality, and CTAS patterns for Gold layer tables.
High-yield: shuffle.partitions=200 default (tune down for small data); broadcast join=no shuffle for small table; data skew=AQE or salt; DLT expect_or_drop=drop bad rows; Streaming Table=live via DLT; Materialized View=pre-computed/auto-refresh; Silver=filter/cast/deduplicate/append.
Joins: inner=match both, left=keep left, broadcast=no shuffle, cross=cartesian. Skew: max>>median task time. Tuning: shuffle.partitions, autoBroadcastJoinThreshold. Gold: Table/View/Materialized View/Streaming Table. DLT quality: expect(warn), expect_or_drop, expect_or_fail. Silver: filter→cast→dedup→append.