Databricks Certified Data Engineer Associate Practice Questions: ELT with Spark SQL and Python Domain
Test your Databricks Certified Data Engineer Associate knowledge with 10 practice questions from the ELT with Spark SQL and Python domain. Includes detailed explanations and answers.
Databricks Certified Data Engineer Associate Practice Questions
Master the ELT with Spark SQL and Python Domain
Test your knowledge in the ELT with Spark SQL and Python domain with these 10 practice questions. Each question is designed to help you prepare for the Databricks Certified Data Engineer Associate certification exam with detailed explanations to reinforce your learning.
Question 1
You are designing a storage layout for a large Delta table `web_sessions` that will be queried frequently by analysts. The table has billions of rows with the following relevant columns: - `session_id` (very high cardinality, almost unique) - `user_id` (high cardinality) - `country_code` (about 200 distinct values) - `event_date` (daily values over several years) Typical queries filter by a single `event_date` or a small range of dates and sometimes by `country_code`. The table is appended to daily. Which partitioning strategy is most appropriate to balance performance and avoiding small-file problems?
Show Answer & Explanation
Correct Answer: A
Partitioning by a column that is frequently used in filters enables effective partition pruning. Since most queries filter on `event_date`, partitioning by `event_date` alone aligns with access patterns while keeping the number of partitions manageable. Including high-cardinality columns like `user_id` or `session_id` in the partitioning scheme would create many small partitions and files, harming performance.
Question 2
A data engineer is migrating an existing batch ETL job that currently uses low-level RDD transformations to perform joins and aggregations on large Parquet datasets. The job runs daily and is starting to miss its SLA as data volume grows. They are considering rewriting the logic using Spark SQL/DataFrame operations instead of RDDs. The business logic (joins, filters, aggregations) will remain the same. What is the main reason this change is likely to improve performance for this ELT workload?
Show Answer & Explanation
Correct Answer: A
Spark SQL and the DataFrame API are optimized via the Catalyst optimizer and Tungsten engine, which can apply query optimizations such as predicate pushdown, column pruning, and efficient physical plans that are not available to arbitrary RDD code. For ELT-style set-based operations on structured data, this typically yields better performance and scalability than equivalent RDD-based implementations.
Question 3
You need to read a large Parquet dataset from object storage into PySpark, filter on a date column, and select only a few columns for downstream processing. You want Spark to minimize I/O by reading only the necessary data. Which coding style best enables predicate pushdown and column pruning for this workload?
Show Answer & Explanation
Correct Answer: B
Using the DataFrame API with built-in filter() and select() expressions allows Spark to push filters and column selection down to the Parquet reader, so only required columns and row groups are read. RDD-based processing, reading as text with UDF parsing, or converting to CSV all bypass Parquet's columnar optimizations and increase I/O.
Question 4
A team optimizes a complex ETL pipeline that joins several dimension tables with a large fact table (1.5 TB). They decide to cache an intermediate joined DataFrame because it is used in three downstream aggregations. After deployment, the pipeline becomes slower and sometimes fails. In the Spark UI: - The Storage tab shows the cached DataFrame with a size of 900 GB - Only about 30% of it is cached in memory; the rest spills to disk - Executors tab shows high GC time and intermittent task failures due to executor OOM - The SQL tab shows that each downstream aggregation reads the cached DataFrame once What is the most appropriate next step based on the Spark UI evidence?
Show Answer & Explanation
Correct Answer: A
The Storage tab shows that the cached DataFrame is very large and mostly not in memory, causing disk spill and high GC, and contributing to OOM errors. The SQL tab indicates it is only reused three times, which may not justify the memory pressure. Removing the cache avoids the overhead of maintaining a massive cached dataset and can improve stability and performance.
Question 5
A retail company maintains a bronze table orders_bronze that receives append-only order events from multiple source systems throughout the day. Each record has: - order_id (business key, may be retried by source) - event_time (ingestion time in UTC) - status (e.g., CREATED, SHIPPED, CANCELLED) The silver table orders_silver is intended to contain the latest status per order_id and be idempotent: rerunning the daily ELT job for the same date must not create duplicates or regress statuses. Currently, the engineer uses a full refresh pattern: 1. Read all data from orders_bronze. 2. For each order_id, select the row with the max(event_time). 3. Overwrite orders_silver with the result. This works but is becoming too slow as data grows. The engineer proposes switching to an incremental pattern that simply appends the new day's bronze records to orders_silver without any merge logic, assuming that downstream queries can just take the latest event_time per order_id. Given the requirement for an idempotent silver table with one latest row per order_id, how should the lead data engineer respond?
Show Answer & Explanation
Correct Answer: A
An idempotent silver table that exposes exactly one latest row per order_id must apply upsert/merge semantics: update existing rows when a newer event arrives and insert new keys. Simply appending new events will create multiple rows per order_id and make reruns non-idempotent, pushing complexity and correctness risks to every downstream consumer.
Question 6
You are deduplicating a `silver_events` table so that for each `(user_id, event_type)` pair, only the latest event by `event_time` is kept. You write the following Spark SQL: ```sql SELECT * FROM ( SELECT *, ROW_NUMBER() OVER ( PARTITION BY user_id, event_type ORDER BY event_time DESC ) AS rn FROM silver_events ) t WHERE rn = 1 ``` The query runs correctly but is slower than expected. A teammate suggests replacing the window function with a simple `GROUP BY user_id, event_type` and `MAX(event_time)`. Why is the window function approach still preferable in this scenario?
Show Answer & Explanation
Correct Answer: B
Using `ROW_NUMBER()` over `(user_id, event_type)` lets you select the latest full row per key while preserving all other columns. A simple `GROUP BY` with `MAX(event_time)` only returns the keys and the max timestamp, requiring an extra join back to recover the remaining columns. The benefit here is correctness and convenience, not an inherent performance advantage.
Question 7
You are joining a small reference table of 50,000 product records with a large fact table of 5 billion sales records using Spark SQL. The product table is stored as a Delta table and is frequently reused in multiple joins. The cluster has enough memory to hold the product table in executor memory. You notice that the join currently triggers a large shuffle on the sales table. You want to reduce shuffle overhead and speed up the join. What is the best approach?
Show Answer & Explanation
Correct Answer: A
Broadcasting the small product table enables a broadcast hash join, where the large sales table does not need to be shuffled. This is ideal when one side of the join is small enough to fit in executor memory. Repartitioning or increasing shuffle partitions does not remove the expensive shuffle, and converting to CSV would degrade performance by losing columnar optimizations.
Question 8
An ELT job creates a DataFrame df_sales and then uses it in three separate downstream aggregations and joins within the same job. The engineer calls df_sales.cache() immediately after creating it, assuming this will always improve performance. The cluster has limited memory and runs multiple concurrent jobs. What is the best practice in this situation?
Show Answer & Explanation
Correct Answer: C
Caching is beneficial when a DataFrame is reused across multiple actions and the cluster has enough memory to hold the cached data. On a memory-constrained cluster with concurrent workloads, caching large DataFrames can cause memory pressure, eviction, and extra overhead, negating any benefit. The decision to cache should be based on reuse frequency and available memory, not applied blindly.
Question 9
A PySpark ELT job reads a large JSON file, applies several withColumn transformations, and then writes the result as Parquet. The engineer adds multiple print("debug") statements after each transformation but sees that none of the messages appear until the write operation starts. They suspect the transformations are not being applied. What explains this behavior and what is the best way to validate intermediate results?
Show Answer & Explanation
Correct Answer: B
Spark DataFrame transformations are lazily evaluated and only executed when an action is triggered. The write operation is an action, so all prior transformations execute at that point, which is why debug prints appear then. To validate intermediate results, the engineer should call actions such as show(), count(), or limit().collect() on intermediate DataFrames to trigger execution and inspection earlier.
Question 10
A bronze table of JSON logs is ingested daily into a data lake. Over time, new fields are added to the JSON, and some existing fields change type (e.g., string to integer). A PySpark ELT job reads the directory as a single Parquet table without any explicit schema evolution handling. Recently, the job started failing with analysis exceptions about incompatible schemas between files. What is the best way to make the pipeline more robust to these schema changes?
Show Answer & Explanation
Correct Answer: B
Schema evolution in file-based tables requires explicit handling. Enabling schema evolution options (such as mergeSchema or table-format-specific evolution features) and/or defining a consistent target schema with appropriate casting and null handling allows the pipeline to accommodate added or changed columns without analysis failures. Relying on implicit behavior, changing to CSV, or dropping files either does not solve the problem or leads to data loss.
Ready to Accelerate Your Databricks Certified Data Engineer Associate Preparation?
Join thousands of professionals who are advancing their careers through expert certification preparation with FlashGenius.
- ✅ Unlimited practice questions across all Databricks Certified Data Engineer Associate domains
- ✅ Full-length exam simulations with real-time scoring
- ✅ AI-powered performance tracking and weak area identification
- ✅ Personalized study plans with adaptive learning
- ✅ Mobile-friendly platform for studying anywhere, anytime
- ✅ Expert explanations and study resources
Already have an account? Sign in here
About Databricks Certified Data Engineer Associate Certification
The Databricks Certified Data Engineer Associate certification validates your expertise in elt with spark sql and python and other critical domains. Our comprehensive practice questions are carefully crafted to mirror the actual exam experience and help you identify knowledge gaps before test day.
Practice Resources for Databricks DEA Certification
Strengthen your DB-DEA prep with focused practice questions across the most important exam domains.
Databricks Data Engineer Associate: Your Complete 2026 Guide
Preparing for the DB-DEA exam? This complete guide covers exam structure, key topics, study strategy, and real-world preparation tips to help you pass on your first attempt.
- ✔️ Full exam breakdown (latest blueprint)
- ✔️ Key domains and high-weight topics
- ✔️ Study roadmap + preparation strategy
- ✔️ Tips to avoid common exam mistakes