Free DB-DEA ELT with Spark SQL and Python Practice Test 2026 — Databricks Data Engineer Associate Questions
Last updated: May 2026 · Aligned with the current Databricks DB-DEA exam · 22% of the exam
This free DB-DEA ELT with Spark SQL and Python practice test covers building ELT pipelines with Spark SQL and PySpark — extracting, transforming, joining, and writing data at scale. Each question includes a detailed explanation with real-world Databricks lakehouse context — perfect for DB-DEA exam prep.
Key Topics in DB-DEA ELT with Spark SQL and Python
- Spark SQL
- PySpark
- Joins & Aggregations
- UDFs
- Higher-Order Functions
- SQL Optimization
10 Free DB-DEA ELT with Spark SQL and Python Practice Questions with Answers
Each question below includes 4 answer options, the correct answer, and a detailed explanation. These are real questions from the FlashGenius DB-DEA question bank for the ELT with Spark SQL and Python domain (22% of the exam).
Sample Question 1 — ELT with Spark SQL and Python
A retail company ingests daily sales transactions from multiple stores into a large Delta table `sales_bronze` stored in cloud object storage. The table is partitioned by `event_date` (the date the sale occurred). A downstream ELT job creates a curated table `sales_silver` by deduplicating records based on a business key `sale_id`, keeping only the latest record per `sale_id` using a window function over `event_time`.
Recently, data quality checks show that some late-arriving records for previous days are missing from `sales_silver`. The ingestion process appends new files to `sales_bronze` every day, including late-arriving records with older `event_date` values. The current `sales_silver` job runs once per day and reads only the latest partition from `sales_bronze` using a filter on `event_date` equal to yesterday.
You need to adjust the ELT logic so that late-arriving records are correctly reflected in `sales_silver` while still leveraging partition pruning on `event_date`.
Which approach is the most appropriate change?
- A. Change the `sales_silver` job to read all historical partitions from `sales_bronze` every day, recompute the full deduplication, and overwrite the entire `sales_silver` table.
- B. Modify the `sales_silver` job to read only the latest partition from `sales_bronze`, but partition `sales_silver` by ingestion date instead of `event_date` so late-arriving records are not missed.
- C. Adjust the `sales_silver` job to read a configurable sliding window of recent `event_date` partitions from `sales_bronze` (for example, the last N days), then re-run the deduplication for just that window and overwrite only the affected partitions in `sales_silver`. (Correct answer)
- D. Keep reading only yesterday’s `event_date` partition from `sales_bronze`, but add an ORDER BY `event_time` clause without LIMIT in the final write to ensure the latest late-arriving records are included in `sales_silver`.
Correct answer: C
Explanation: Option C expands the read scope to a recent window of `event_date` partitions where late-arriving records are likely to appear, then re-applies the window-based deduplication and overwrites only those affected partitions in `sales_silver`. This correctly incorporates late-arriving data while still using partition pruning and avoiding an expensive full-table recomputation.
Option A would include late-arriving data but is unnecessarily expensive for large tables because it recomputes and overwrites the entire dataset every day. Option B changes the target partitioning but still reads only yesterday’s `event_date` from `sales_bronze`, so late-arriving records for older dates are never processed. Option D adds a costly global sort but does not change which partitions are read, so it does not fix the missing late-arriving records.
Sample Question 2 — ELT with Spark SQL and Python
A team is building a nightly ELT pipeline in PySpark to aggregate sales data into a gold table. The input silver table sales_silver is partitioned by sale_date and contains about 500 GB of data per day. The final gold table sales_daily_agg will be queried mostly by sale_date and region.
They currently write the result with:
result_df.write.mode("overwrite").format("parquet").save("s3://warehouse/gold/sales_daily_agg/")
The job is slow, and the team notices that the write stage has thousands of small output files for each run. A new engineer suggests adding:
result_df = result_df.repartition(200)
before the write, arguing that repartitioning will always improve performance by reducing the number of small files.
Given this scenario, how should the lead data engineer respond?
- A. Agree, because repartition always reduces the number of small files and avoids shuffles when writing large DataFrames.
- B. Disagree, because repartition introduces a shuffle; its benefit depends on the data size, partition count, and downstream access patterns. (Correct answer)
- C. Agree, but only if coalesce is used instead of repartition, since coalesce always avoids shuffles and improves performance.
- D. Disagree, because the only way to control small files is to change the storage format; partitioning and repartitioning do not affect file sizes.
Correct answer: B
Explanation: repartition can change the number of partitions and therefore influence the number of output files, but it does so via a shuffle, which is an expensive wide transformation. Whether it helps depends on the current partitioning, data size, and how the table will be queried. It is not inherently always beneficial and must be used deliberately with awareness of shuffle cost and partitioning strategy.
Sample Question 3 — ELT with Spark SQL and Python
A data engineer is building a silver-level ELT job in Spark SQL to combine customer profile data with web session data. The tables are:
- customers (customer_id is unique)
- sessions (customer_id is NOT unique; a customer can have many sessions)
The engineer wants a dataset with one row per customer that includes the total number of sessions. They write:
CREATE OR REPLACE TABLE customer_sessions AS
SELECT c.*, s.*
FROM customers c
JOIN sessions s
ON c.customer_id = s.customer_id;
They are surprised to see multiple rows per customer in the result and assume that the inner join should have automatically deduplicated customers.
What is the best way to correct this ELT logic while still using Spark SQL?
- A. Replace the inner join with a left join so that each customer appears at most once.
- B. Aggregate sessions first, then join: SELECT c.*, sess.total_sessions FROM customers c LEFT JOIN (SELECT customer_id, COUNT(*) AS total_sessions FROM sessions GROUP BY customer_id) sess ON c.customer_id = sess.customer_id; (Correct answer)
- C. Use a broadcast join hint on customers to force Spark to deduplicate customer_id before joining.
- D. Use DISTINCT on the join result: CREATE OR REPLACE TABLE customer_sessions AS SELECT DISTINCT c.*, s.* FROM customers c JOIN sessions s ON c.customer_id = s.customer_id;
Correct answer: B
Explanation: Joins preserve the multiplicity of matching rows from both sides; they do not deduplicate. To get one row per customer with a session count, the sessions table must first be aggregated by customer_id to a single row per customer, then joined to customers. The LEFT JOIN ensures all customers are retained even if they have zero sessions.
Sample Question 4 — ELT with Spark SQL and Python
A retail company maintains a bronze table orders_bronze that receives append-only order events from multiple source systems throughout the day. Each record has:
- order_id (business key, may be retried by source)
- event_time (ingestion time in UTC)
- status (e.g., CREATED, SHIPPED, CANCELLED)
The silver table orders_silver is intended to contain the latest status per order_id and be idempotent: rerunning the daily ELT job for the same date must not create duplicates or regress statuses.
Currently, the engineer uses a full refresh pattern:
1. Read all data from orders_bronze.
2. For each order_id, select the row with the max(event_time).
3. Overwrite orders_silver with the result.
This works but is becoming too slow as data grows. The engineer proposes switching to an incremental pattern that simply appends the new day's bronze records to orders_silver without any merge logic, assuming that downstream queries can just take the latest event_time per order_id.
Given the requirement for an idempotent silver table with one latest row per order_id, how should the lead data engineer respond?
- A. Reject the proposal, because blindly appending daily records to orders_silver will break idempotency and create multiple rows per order_id unless an upsert/merge pattern is used. (Correct answer)
- B. Accept the proposal, because append-only loads are always idempotent as long as order_id is present and downstream queries can filter by max(event_time).
- C. Accept the proposal, but require repartitioning orders_silver by order_id to ensure that appends do not create duplicates.
- D. Reject the proposal, because incremental patterns cannot be used when there is a business key like order_id; full refresh is the only correct approach.
Correct answer: A
Explanation: An idempotent silver table that exposes exactly one latest row per order_id must apply upsert/merge semantics: update existing rows when a newer event arrives and insert new keys. Simply appending new events will create multiple rows per order_id and make reruns non-idempotent, pushing complexity and correctness risks to every downstream consumer.
Sample Question 5 — ELT with Spark SQL and Python
A data engineer is building an ELT pipeline in PySpark to standardize email addresses in a large customer table stored as Parquet. The requirement is to trim whitespace and convert all emails to lowercase. The engineer writes the following code:
from pyspark.sql.functions import udf
def normalize_email(email):
return email.strip().lower() if email is not None else None
normalize_email_udf = udf(normalize_email)
result_df = customers_df.withColumn("email_norm", normalize_email_udf(customers_df.email))
The cluster is experiencing performance issues during this step. Given that the transformation is simple string manipulation, what is the best change to improve performance while preserving functionality?
- A. Rewrite the transformation using built-in Spark SQL functions instead of a Python UDF. (Correct answer)
- B. Increase the number of shuffle partitions before applying the UDF to parallelize the computation.
- C. Cache customers_df before applying the UDF so the computation runs faster.
- D. Convert customers_df to an RDD, apply a Python map function, then convert back to a DataFrame.
Correct answer: A
Explanation: Simple string operations like trimming and lowercasing should be implemented with built-in Spark SQL/DataFrame functions (e.g., trim(), lower()) so Spark can optimize execution with code generation and avoid Python serialization overhead. Increasing shuffle partitions, caching, or converting to RDDs does not address the fundamental performance cost of the Python UDF and can make performance worse.
Sample Question 6 — ELT with Spark SQL and Python
You are designing a partitioning strategy for a large Parquet table of web events in a lakehouse. The table receives about 5 TB of new data per day and is queried primarily by analysts who filter on event_date (YYYY-MM-DD) and sometimes on user_id. There are millions of distinct user_id values per day, but only 365–730 distinct event_date values over the typical query window. Queries often restrict to a date range such as the last 7 or 30 days.
You want to optimize query performance and avoid generating too many small files. Which partitioning strategy is most appropriate for this table?
- A. Partition by user_id only, because it has the highest cardinality and distributes data most evenly.
- B. Partition by event_date only, because it has lower cardinality and aligns with common filter predicates. (Correct answer)
- C. Partition by both event_date and user_id to maximize partition pruning for all queries.
- D. Do not partition the table at all and rely entirely on Spark's automatic optimizations.
Correct answer: B
Explanation: Partitioning by event_date matches the dominant query filter and has manageable cardinality, enabling effective partition pruning without creating an excessive number of small partitions. Partitioning by user_id or by both event_date and user_id would create an enormous number of partitions and small files, while not partitioning would force scans over far more data for typical date-range queries.
Sample Question 7 — ELT with Spark SQL and Python
A daily ELT job loads a full snapshot of a dimension table from a relational database into a Delta table in your lakehouse. The job uses the following PySpark code to write the data:
snapshot_df.write.mode("overwrite").format("delta").saveAsTable("dim_customer")
The job is scheduled to rerun if it fails. After a recent failure and rerun, analysts report that the dim_customer table is temporarily empty between the start and end of the job, causing downstream reports to break. You want the pipeline to be safely rerunnable without exposing an empty table to consumers.
Which change best improves the idempotency and reliability of this ELT step?
- A. Change the write mode to 'append' so that existing data is preserved while new data is written.
- B. Use an overwrite operation that targets only the affected partitions instead of the entire table.
- C. Write the snapshot to a temporary table or location, then atomically swap or overwrite the target table once the write succeeds. (Correct answer)
- D. Disable the rerun behavior so that the job cannot overwrite the table more than once per day.
Correct answer: C
Explanation: Writing the new snapshot to a temporary location and then atomically swapping or overwriting the target table ensures that consumers always see either the old complete version or the new complete version, never an empty or partially written table. Append would create duplicate snapshots, partition overwrite does not solve the transient empty state for a full snapshot, and disabling reruns does not address reliability.
Sample Question 8 — ELT with Spark SQL and Python
You are joining a small reference table of 50,000 product records with a large fact table of 5 billion sales records using Spark SQL. The product table is stored as a Delta table and is frequently reused in multiple joins. The cluster has enough memory to hold the product table in executor memory.
You notice that the join currently triggers a large shuffle on the sales table. You want to reduce shuffle overhead and speed up the join. What is the best approach?
- A. Broadcast the product table in the join so that the sales table does not need to be shuffled. (Correct answer)
- B. Repartition the sales table by product_id before the join to better distribute the data.
- C. Increase the shuffle partition count so that the shuffle is more parallelized.
- D. Convert both tables to CSV format before joining to reduce file sizes and shuffle volume.
Correct answer: A
Explanation: Broadcasting the small product table enables a broadcast hash join, where the large sales table does not need to be shuffled. This is ideal when one side of the join is small enough to fit in executor memory. Repartitioning or increasing shuffle partitions does not remove the expensive shuffle, and converting to CSV would degrade performance by losing columnar optimizations.
Sample Question 9 — ELT with Spark SQL and Python
You need to read a large Parquet dataset from object storage into PySpark, filter on a date column, and select only a few columns for downstream processing. You want Spark to minimize I/O by reading only the necessary data.
Which coding style best enables predicate pushdown and column pruning for this workload?
- A. Read the Parquet files into an RDD, then use Python filter and map functions to select rows and columns.
- B. Read the Parquet files as a DataFrame, then apply filter() and select() using built-in column expressions. (Correct answer)
- C. Read the Parquet files as text, parse the content with a Python UDF, and then filter the parsed DataFrame.
- D. Convert the Parquet files to CSV first, then read them with inferSchema enabled and apply filters in Python.
Correct answer: B
Explanation: Using the DataFrame API with built-in filter() and select() expressions allows Spark to push filters and column selection down to the Parquet reader, so only required columns and row groups are read. RDD-based processing, reading as text with UDF parsing, or converting to CSV all bypass Parquet's columnar optimizations and increase I/O.
Sample Question 10 — ELT with Spark SQL and Python
You maintain a Delta-based fact table fact_orders partitioned by order_date (date). Each hour, an ELT job loads new and updated orders for that hour from a source system and applies them to fact_orders. The job currently does:
incoming_df.write.mode("append").format("delta").saveAsTable("fact_orders")
Over time, you notice duplicate order_id values and inconsistent metrics when rerunning failed hourly jobs. You want the pipeline to be idempotent and to avoid accumulating duplicates, while still leveraging the existing partitioning by order_date.
Which change best addresses these issues?
- A. Change the write mode to 'overwrite' so that each hourly batch replaces the entire fact_orders table.
- B. Use a MERGE operation on fact_orders keyed by order_id to upsert incoming records, restricted to the affected order_date partitions. (Correct answer)
- C. Repartition incoming_df by order_id before appending so that duplicates are distributed evenly across partitions.
- D. Drop and recreate the fact_orders table each day after all hourly loads have completed.
Correct answer: B
Explanation: Using MERGE keyed by order_id provides upsert semantics: existing rows are updated and new rows are inserted, preventing duplicates and making reruns idempotent. Restricting the MERGE to the relevant order_date partitions leverages partition pruning for performance. Overwriting the entire table hourly is inefficient and risky, repartitioning does not remove duplicates, and dropping/recreating daily does not solve hourly idempotency.
How to Study DB-DEA ELT with Spark SQL and Python
Combine these DB-DEA ELT with Spark SQL and Python practice questions with the official Databricks Academy materials and hands-on practice in a Databricks Community Edition workspace. The DB-DEA exam emphasizes applied knowledge of PySpark, Spark SQL, and Delta Lake, so always relate concepts back to real notebooks and jobs you've built.
About the Databricks DB-DEA Exam
- Questions: 45 multiple choice
- Duration: 90 minutes
- Passing score: 70%
- Cost: $200 USD
- Domains: 6 (this is 22% of the exam)
- Validity: 2 years
Other DB-DEA Domains
Start the free DB-DEA ELT with Spark SQL and Python practice test now | 10-question quick start | All DB-DEA domains | DB-DEA Cheat Sheet