FlashGenius Logo FlashGenius
Login Sign Up

Databricks Certified Data Engineer Associate Practice Questions: ELT with Spark SQL and Python Domain

Test your Databricks Certified Data Engineer Associate knowledge with 10 practice questions from the ELT with Spark SQL and Python domain. Includes detailed explanations and answers.

Databricks Certified Data Engineer Associate Practice Questions

Master the ELT with Spark SQL and Python Domain

Test your knowledge in the ELT with Spark SQL and Python domain with these 10 practice questions. Each question is designed to help you prepare for the Databricks Certified Data Engineer Associate certification exam with detailed explanations to reinforce your learning.

Question 1

You are designing a storage layout for a large Delta table `web_sessions` that will be queried frequently by analysts. The table has billions of rows with the following relevant columns: - `session_id` (very high cardinality, almost unique) - `user_id` (high cardinality) - `country_code` (about 200 distinct values) - `event_date` (daily values over several years) Typical queries filter by a single `event_date` or a small range of dates and sometimes by `country_code`. The table is appended to daily. Which partitioning strategy is most appropriate to balance performance and avoiding small-file problems?

A) Partition the table by `event_date` only, relying on predicate pushdown and column pruning for other filters.

B) Partition the table by `session_id` to maximize parallelism and avoid data skew.

C) Partition the table by both `event_date` and `user_id` to maximize pruning for all common filters.

D) Partition the table by `country_code` only, since it has fewer distinct values than `user_id` or `session_id`.

Show Answer & Explanation

Correct Answer: A

Explanation:

Partitioning by a column that is frequently used in filters enables effective partition pruning. Since most queries filter on `event_date`, partitioning by `event_date` alone aligns with access patterns while keeping the number of partitions manageable. Including high-cardinality columns like `user_id` or `session_id` in the partitioning scheme would create many small partitions and files, harming performance.

Question 2

A data engineer is migrating an existing batch ETL job that currently uses low-level RDD transformations to perform joins and aggregations on large Parquet datasets. The job runs daily and is starting to miss its SLA as data volume grows. They are considering rewriting the logic using Spark SQL/DataFrame operations instead of RDDs. The business logic (joins, filters, aggregations) will remain the same. What is the main reason this change is likely to improve performance for this ELT workload?

A) Spark SQL/DataFrame operations are optimized by Catalyst and Tungsten, which can generate efficient execution plans for set-based transformations.

B) RDD transformations always materialize intermediate results to disk, while DataFrame operations always keep all intermediate data in memory.

C) Spark SQL/DataFrame operations automatically repartition data to eliminate shuffles, whereas RDDs always require explicit shuffles.

D) Spark SQL/DataFrame operations bypass the cluster’s resource manager and therefore can use more CPU and memory than RDD-based jobs.

Show Answer & Explanation

Correct Answer: A

Explanation:

Spark SQL and the DataFrame API are optimized via the Catalyst optimizer and Tungsten engine, which can apply query optimizations such as predicate pushdown, column pruning, and efficient physical plans that are not available to arbitrary RDD code. For ELT-style set-based operations on structured data, this typically yields better performance and scalability than equivalent RDD-based implementations.

Question 3

You need to read a large Parquet dataset from object storage into PySpark, filter on a date column, and select only a few columns for downstream processing. You want Spark to minimize I/O by reading only the necessary data. Which coding style best enables predicate pushdown and column pruning for this workload?

A) Read the Parquet files into an RDD, then use Python filter and map functions to select rows and columns.

B) Read the Parquet files as a DataFrame, then apply filter() and select() using built-in column expressions.

C) Read the Parquet files as text, parse the content with a Python UDF, and then filter the parsed DataFrame.

D) Convert the Parquet files to CSV first, then read them with inferSchema enabled and apply filters in Python.

Show Answer & Explanation

Correct Answer: B

Explanation:

Using the DataFrame API with built-in filter() and select() expressions allows Spark to push filters and column selection down to the Parquet reader, so only required columns and row groups are read. RDD-based processing, reading as text with UDF parsing, or converting to CSV all bypass Parquet's columnar optimizations and increase I/O.

Question 4

A team optimizes a complex ETL pipeline that joins several dimension tables with a large fact table (1.5 TB). They decide to cache an intermediate joined DataFrame because it is used in three downstream aggregations. After deployment, the pipeline becomes slower and sometimes fails. In the Spark UI: - The Storage tab shows the cached DataFrame with a size of 900 GB - Only about 30% of it is cached in memory; the rest spills to disk - Executors tab shows high GC time and intermittent task failures due to executor OOM - The SQL tab shows that each downstream aggregation reads the cached DataFrame once What is the most appropriate next step based on the Spark UI evidence?

A) Remove the cache on the intermediate DataFrame and let each aggregation recompute it

B) Increase the cluster size so that the entire cached DataFrame fits in memory

C) Change the cache level to DISK_ONLY to avoid executor OOM errors

D) Add additional cache() calls on the downstream aggregation results to reduce recomputation

Show Answer & Explanation

Correct Answer: A

Explanation:

The Storage tab shows that the cached DataFrame is very large and mostly not in memory, causing disk spill and high GC, and contributing to OOM errors. The SQL tab indicates it is only reused three times, which may not justify the memory pressure. Removing the cache avoids the overhead of maintaining a massive cached dataset and can improve stability and performance.

Question 5

A retail company maintains a bronze table orders_bronze that receives append-only order events from multiple source systems throughout the day. Each record has: - order_id (business key, may be retried by source) - event_time (ingestion time in UTC) - status (e.g., CREATED, SHIPPED, CANCELLED) The silver table orders_silver is intended to contain the latest status per order_id and be idempotent: rerunning the daily ELT job for the same date must not create duplicates or regress statuses. Currently, the engineer uses a full refresh pattern: 1. Read all data from orders_bronze. 2. For each order_id, select the row with the max(event_time). 3. Overwrite orders_silver with the result. This works but is becoming too slow as data grows. The engineer proposes switching to an incremental pattern that simply appends the new day's bronze records to orders_silver without any merge logic, assuming that downstream queries can just take the latest event_time per order_id. Given the requirement for an idempotent silver table with one latest row per order_id, how should the lead data engineer respond?

A) Reject the proposal, because blindly appending daily records to orders_silver will break idempotency and create multiple rows per order_id unless an upsert/merge pattern is used.

B) Accept the proposal, because append-only loads are always idempotent as long as order_id is present and downstream queries can filter by max(event_time).

C) Accept the proposal, but require repartitioning orders_silver by order_id to ensure that appends do not create duplicates.

D) Reject the proposal, because incremental patterns cannot be used when there is a business key like order_id; full refresh is the only correct approach.

Show Answer & Explanation

Correct Answer: A

Explanation:

An idempotent silver table that exposes exactly one latest row per order_id must apply upsert/merge semantics: update existing rows when a newer event arrives and insert new keys. Simply appending new events will create multiple rows per order_id and make reruns non-idempotent, pushing complexity and correctness risks to every downstream consumer.

Question 6

You are deduplicating a `silver_events` table so that for each `(user_id, event_type)` pair, only the latest event by `event_time` is kept. You write the following Spark SQL: ```sql SELECT * FROM ( SELECT *, ROW_NUMBER() OVER ( PARTITION BY user_id, event_type ORDER BY event_time DESC ) AS rn FROM silver_events ) t WHERE rn = 1 ``` The query runs correctly but is slower than expected. A teammate suggests replacing the window function with a simple `GROUP BY user_id, event_type` and `MAX(event_time)`. Why is the window function approach still preferable in this scenario?

A) Window functions are always faster than `GROUP BY` aggregations in Spark.

B) The window function allows you to keep all original columns from `silver_events` for the latest event, not just `user_id`, `event_type`, and `event_time`.

C) The window function avoids shuffles, while `GROUP BY` always causes a shuffle.

D) The window function automatically handles late-arriving data, while `GROUP BY` does not.

Show Answer & Explanation

Correct Answer: B

Explanation:

Using `ROW_NUMBER()` over `(user_id, event_type)` lets you select the latest full row per key while preserving all other columns. A simple `GROUP BY` with `MAX(event_time)` only returns the keys and the max timestamp, requiring an extra join back to recover the remaining columns. The benefit here is correctness and convenience, not an inherent performance advantage.

Question 7

You are joining a small reference table of 50,000 product records with a large fact table of 5 billion sales records using Spark SQL. The product table is stored as a Delta table and is frequently reused in multiple joins. The cluster has enough memory to hold the product table in executor memory. You notice that the join currently triggers a large shuffle on the sales table. You want to reduce shuffle overhead and speed up the join. What is the best approach?

A) Broadcast the product table in the join so that the sales table does not need to be shuffled.

B) Repartition the sales table by product_id before the join to better distribute the data.

C) Increase the shuffle partition count so that the shuffle is more parallelized.

D) Convert both tables to CSV format before joining to reduce file sizes and shuffle volume.

Show Answer & Explanation

Correct Answer: A

Explanation:

Broadcasting the small product table enables a broadcast hash join, where the large sales table does not need to be shuffled. This is ideal when one side of the join is small enough to fit in executor memory. Repartitioning or increasing shuffle partitions does not remove the expensive shuffle, and converting to CSV would degrade performance by losing columnar optimizations.

Question 8

An ELT job creates a DataFrame df_sales and then uses it in three separate downstream aggregations and joins within the same job. The engineer calls df_sales.cache() immediately after creating it, assuming this will always improve performance. The cluster has limited memory and runs multiple concurrent jobs. What is the best practice in this situation?

A) Keep df_sales.cache(), because caching always improves performance when a DataFrame is used more than once, regardless of memory constraints.

B) Remove df_sales.cache(), because caching is never beneficial in Spark; recomputing is always cheaper.

C) Cache df_sales only if it is reused multiple times in actions and there is sufficient memory; otherwise, avoid caching to prevent memory pressure and eviction.

D) Replace cache() with persist(StorageLevel.DISK_ONLY), because disk-only persistence is always faster than recomputation.

Show Answer & Explanation

Correct Answer: C

Explanation:

Caching is beneficial when a DataFrame is reused across multiple actions and the cluster has enough memory to hold the cached data. On a memory-constrained cluster with concurrent workloads, caching large DataFrames can cause memory pressure, eviction, and extra overhead, negating any benefit. The decision to cache should be based on reuse frequency and available memory, not applied blindly.

Question 9

A PySpark ELT job reads a large JSON file, applies several withColumn transformations, and then writes the result as Parquet. The engineer adds multiple print("debug") statements after each transformation but sees that none of the messages appear until the write operation starts. They suspect the transformations are not being applied. What explains this behavior and what is the best way to validate intermediate results?

A) Spark executes each transformation immediately, but print statements are buffered; the engineer should flush stdout after each print.

B) Spark uses lazy evaluation, so transformations are only executed when an action (like write, show, or count) is called; the engineer should use actions like df.show() or df.limit(10).collect() on intermediate DataFrames.

C) Spark only executes transformations when a cache() is called; the engineer should cache each intermediate DataFrame to force execution.

D) Spark defers execution until the driver script finishes; the engineer must wait for the entire script to complete before any output appears.

Show Answer & Explanation

Correct Answer: B

Explanation:

Spark DataFrame transformations are lazily evaluated and only executed when an action is triggered. The write operation is an action, so all prior transformations execute at that point, which is why debug prints appear then. To validate intermediate results, the engineer should call actions such as show(), count(), or limit().collect() on intermediate DataFrames to trigger execution and inspection earlier.

Question 10

A bronze table of JSON logs is ingested daily into a data lake. Over time, new fields are added to the JSON, and some existing fields change type (e.g., string to integer). A PySpark ELT job reads the directory as a single Parquet table without any explicit schema evolution handling. Recently, the job started failing with analysis exceptions about incompatible schemas between files. What is the best way to make the pipeline more robust to these schema changes?

A) Ignore the schema differences and rely on Spark's schema-on-read to automatically merge all file schemas without configuration.

B) Explicitly enable schema evolution features for the table format (for example, using options like mergeSchema or table-specific evolution settings) and/or define a target schema with appropriate casting and null handling.

C) Convert all JSON logs to CSV before writing to Parquet, because CSV does not have schema and will avoid schema evolution issues.

D) Drop any files that do not match the original schema so that the job can continue using a fixed schema.

Show Answer & Explanation

Correct Answer: B

Explanation:

Schema evolution in file-based tables requires explicit handling. Enabling schema evolution options (such as mergeSchema or table-format-specific evolution features) and/or defining a consistent target schema with appropriate casting and null handling allows the pipeline to accommodate added or changed columns without analysis failures. Relying on implicit behavior, changing to CSV, or dropping files either does not solve the problem or leads to data loss.

Ready to Accelerate Your Databricks Certified Data Engineer Associate Preparation?

Join thousands of professionals who are advancing their careers through expert certification preparation with FlashGenius.

  • ✅ Unlimited practice questions across all Databricks Certified Data Engineer Associate domains
  • ✅ Full-length exam simulations with real-time scoring
  • ✅ AI-powered performance tracking and weak area identification
  • ✅ Personalized study plans with adaptive learning
  • ✅ Mobile-friendly platform for studying anywhere, anytime
  • ✅ Expert explanations and study resources
Start Free Practice Now

Already have an account? Sign in here

About Databricks Certified Data Engineer Associate Certification

The Databricks Certified Data Engineer Associate certification validates your expertise in elt with spark sql and python and other critical domains. Our comprehensive practice questions are carefully crafted to mirror the actual exam experience and help you identify knowledge gaps before test day.

Practice Resources for Databricks DEA Certification

Strengthen your DB-DEA prep with focused practice questions across the most important exam domains.

Recommended Guide

Databricks Data Engineer Associate: Your Complete 2026 Guide

Preparing for the DB-DEA exam? This complete guide covers exam structure, key topics, study strategy, and real-world preparation tips to help you pass on your first attempt.

  • ✔️ Full exam breakdown (latest blueprint)
  • ✔️ Key domains and high-weight topics
  • ✔️ Study roadmap + preparation strategy
  • ✔️ Tips to avoid common exam mistakes
📘 Read the Complete Guide