Free DB-DEA Quick Start Practice Test — 10 Questions Across All 6 Databricks Domains

Last updated: May 2026 · Aligned with the current Databricks Data Engineer Associate exam

This free DB-DEA quick start practice test draws 10 questions across all 6 Databricks Certified Data Engineer Associate domains. Use it for a fast readiness check before diving into per-domain study.

10 Free DB-DEA Mixed-Domain Practice Questions

Sample Question 1 — Data Governance and Security

A retail company is building a centralized reporting database that consolidates customer orders, including names, email addresses, and partial payment card details. The security architect has already enabled encryption at rest for the database storage and TLS for all connections. Compliance asks what additional control is most important to reduce the risk of inappropriate access to this sensitive data by internal analysts. What should the architect recommend next?

  1. A. Rely on the existing encryption controls and focus only on performance tuning for the reporting queries
  2. B. Implement role-based access control so analysts only see the minimum data necessary for their job functions (Correct answer)
  3. C. Disable audit logging to avoid storing potentially sensitive metadata about user access patterns
  4. D. Create a single shared database account for all analysts to simplify credential management

Correct answer: B

Explanation: Encryption at rest and in transit protects data from certain threats but does not prevent over-privileged internal access. Implementing role-based access control aligned to job functions enforces least privilege so analysts only see the minimum data needed, directly reducing the risk of inappropriate internal access.

Sample Question 2 — Data Governance and Security

A healthcare analytics team wants to share a dataset with an external research partner. The dataset includes patient demographics, diagnosis codes, and a persistent patient ID that can be linked back to the hospital’s systems. The partner needs to perform longitudinal analysis over time but must not be able to identify individual patients. Which approach best balances analytical usefulness with privacy requirements?

  1. A. Provide the full dataset as-is since the partner is under contract and uses secure networks
  2. B. Replace patient IDs with randomly generated tokens that only the hospital can map back to real identities (Correct answer)
  3. C. Mask all diagnosis codes with generic placeholders such as 'Condition A', 'Condition B', etc.
  4. D. Remove only names and addresses and keep the original patient IDs for easier analysis

Correct answer: B

Explanation: Replacing patient IDs with randomly generated tokens that only the hospital can reverse is pseudonymization. It preserves the ability to track patients over time for longitudinal analysis while preventing the external partner from directly identifying individuals, achieving a good balance between utility and privacy.

Sample Question 3 — Databricks Lakehouse Platform

A data engineering team has a notebook that loads raw files into Delta tables every night at 2:00 AM. No users need to interact with the cluster during the run, and the team wants to avoid paying for compute when the pipeline is idle. What is the most appropriate way to run this workload?

  1. A. Attach the notebook to an all-purpose cluster that remains available for developers
  2. B. Create a Databricks Job that runs the notebook on job compute (Correct answer)
  3. C. Run the notebook from a SQL warehouse because it starts quickly for scheduled tasks
  4. D. Store the notebook in Databricks Repos so it can execute automatically each night

Correct answer: B

Explanation: Job compute is the best fit for scheduled production workloads because it is commonly created for the job run and terminated afterward. That matches the team's need for an unattended nightly pipeline and reduced idle cost.

Sample Question 4 — Databricks Lakehouse Platform

A notebook connects to an external database using a username and password currently written directly in the code. During a security review, the team is told to remove credentials from notebooks while keeping the pipeline functional. What should the team do next?

  1. A. Move the credentials into a separate notebook and import that notebook into the pipeline
  2. B. Store the credentials in a Databricks secret and reference the secret from the notebook (Correct answer)
  3. C. Save the credentials in a workspace folder with restricted permissions
  4. D. Attach the notebook to a SQL warehouse so the credentials are no longer visible in code

Correct answer: B

Explanation: Databricks secret management is the appropriate place to store sensitive credentials. The notebook can reference the secret at runtime without embedding the username and password directly in code.

Sample Question 5 — Delta Lake and Data Management

A data engineering team currently stores fact tables as plain Parquet files in cloud object storage. Analysts complain that concurrent ETL jobs sometimes leave the data in an inconsistent state, causing reports to show partial updates. The team wants to keep using the data lake but ensure transactional consistency during writes. What should they do?

  1. A. Wrap all Parquet writes in a single job so that only one writer runs at a time
  2. B. Convert the Parquet data to Delta tables to leverage ACID transactions via the transaction log (Correct answer)
  3. C. Increase the file size of Parquet files so that each partition has fewer files
  4. D. Create a Gold layer on top of the existing Parquet files without changing the format

Correct answer: B

Explanation: Plain Parquet files do not provide ACID guarantees, so concurrent writers can leave data in a partially updated state. Converting the Parquet data to Delta tables adds a transaction log and ACID transactions, ensuring that each write is atomic, consistent, isolated, and durable while still using the data lake storage.

Sample Question 6 — Delta Lake and Data Management

A streaming pipeline writes JSON events from a web application into a Bronze Delta table. The source system recently added a new optional field, but the write job now fails because the incoming schema no longer matches the table schema. The team wants to allow the new column while still rejecting incompatible type changes. What configuration should they use on the Delta table?

  1. A. Disable schema enforcement so that all schema changes are accepted automatically
  2. B. Enable schema evolution so that new columns can be added under controlled conditions (Correct answer)
  3. C. Manually recreate the table with the new schema every time the source changes
  4. D. Partition the table by the new column so that the schema mismatch is ignored

Correct answer: B

Explanation: Enabling schema evolution on the Delta table allows Delta to automatically add new columns when the source schema evolves, while still enforcing the existing schema and rejecting incompatible type changes. This fits a Bronze streaming ingestion pattern where new optional fields appear over time.

Sample Question 7 — ELT with Spark SQL and Python

A retail company ingests daily sales transactions from multiple stores into a large Delta table `sales_bronze` stored in cloud object storage. The table is partitioned by `event_date` (the date the sale occurred). A downstream ELT job creates a curated table `sales_silver` by deduplicating records based on a business key `sale_id`, keeping only the latest record per `sale_id` using a window function over `event_time`. Recently, data quality checks show that some late-arriving records for previous days are missing from `sales_silver`. The ingestion process appends new files to `sales_bronze` every day, including late-arriving records with older `event_date` values. The current `sales_silver` job runs once per day and reads only the latest partition from `sales_bronze` using a filter on `event_date` equal to yesterday. You need to adjust the ELT logic so that late-arriving records are correctly reflected in `sales_silver` while still leveraging partition pruning on `event_date`. Which approach is the most appropriate change?

  1. A. Change the `sales_silver` job to read all historical partitions from `sales_bronze` every day, recompute the full deduplication, and overwrite the entire `sales_silver` table.
  2. B. Modify the `sales_silver` job to read only the latest partition from `sales_bronze`, but partition `sales_silver` by ingestion date instead of `event_date` so late-arriving records are not missed.
  3. C. Adjust the `sales_silver` job to read a configurable sliding window of recent `event_date` partitions from `sales_bronze` (for example, the last N days), then re-run the deduplication for just that window and overwrite only the affected partitions in `sales_silver`. (Correct answer)
  4. D. Keep reading only yesterday’s `event_date` partition from `sales_bronze`, but add an ORDER BY `event_time` clause without LIMIT in the final write to ensure the latest late-arriving records are included in `sales_silver`.

Correct answer: C

Explanation: Option C expands the read scope to a recent window of `event_date` partitions where late-arriving records are likely to appear, then re-applies the window-based deduplication and overwrites only those affected partitions in `sales_silver`. This correctly incorporates late-arriving data while still using partition pruning and avoiding an expensive full-table recomputation. Option A would include late-arriving data but is unnecessarily expensive for large tables because it recomputes and overwrites the entire dataset every day. Option B changes the target partitioning but still reads only yesterday’s `event_date` from `sales_bronze`, so late-arriving records for older dates are never processed. Option D adds a costly global sort but does not change which partitions are read, so it does not fix the missing late-arriving records.

Sample Question 8 — ELT with Spark SQL and Python

A team is building a nightly ELT pipeline in PySpark to aggregate sales data into a gold table. The input silver table sales_silver is partitioned by sale_date and contains about 500 GB of data per day. The final gold table sales_daily_agg will be queried mostly by sale_date and region. They currently write the result with: result_df.write.mode("overwrite").format("parquet").save("s3://warehouse/gold/sales_daily_agg/") The job is slow, and the team notices that the write stage has thousands of small output files for each run. A new engineer suggests adding: result_df = result_df.repartition(200) before the write, arguing that repartitioning will always improve performance by reducing the number of small files. Given this scenario, how should the lead data engineer respond?

  1. A. Agree, because repartition always reduces the number of small files and avoids shuffles when writing large DataFrames.
  2. B. Disagree, because repartition introduces a shuffle; its benefit depends on the data size, partition count, and downstream access patterns. (Correct answer)
  3. C. Agree, but only if coalesce is used instead of repartition, since coalesce always avoids shuffles and improves performance.
  4. D. Disagree, because the only way to control small files is to change the storage format; partitioning and repartitioning do not affect file sizes.

Correct answer: B

Explanation: repartition can change the number of partitions and therefore influence the number of output files, but it does so via a shuffle, which is an expensive wide transformation. Whether it helps depends on the current partitioning, data size, and how the table will be queried. It is not inherently always beneficial and must be used deliberately with awareness of shuffle cost and partitioning strategy.

Sample Question 9 — Incremental Data Processing

A data engineering team is migrating a nightly batch pipeline from full refresh to incremental processing. The source is an orders table with 50 million rows and a stable updated_at column that is set on every insert and update. The target is a large fact_orders table in the warehouse. The team wants to reduce runtime and cost while ensuring no orders are missed. Which approach best uses updated_at as a high-water-mark for an incremental load?

  1. A. On each run, select rows from the source where updated_at > last_successful_run_timestamp and MERGE them into fact_orders. (Correct answer)
  2. B. On each run, select rows from the source where updated_at >= current_run_timestamp and MERGE them into fact_orders.
  3. C. On each run, select all rows from the source and MERGE them into fact_orders, but rely on MERGE to skip unchanged rows.
  4. D. On each run, select rows from the source where updated_at IS NOT NULL and append them to fact_orders without MERGE.

Correct answer: A

Explanation: Using updated_at > last_successful_run_timestamp reads only rows changed since the last successful run and applies them via MERGE, which correctly handles inserts and updates without scanning the full table. Option B compares to the current run time and will usually return no rows. Option C degenerates to a full refresh scan. Option D repeatedly reprocesses the entire table and appends duplicates while not handling updates.

Sample Question 10 — Incremental Data Processing

A small reference table country_codes has about 250 rows and changes a few times per month. It is used in joins by many downstream fact tables. The current pipeline performs a full refresh of country_codes each night in under 5 seconds. A new engineer proposes building a complex incremental pipeline using CDC and MERGE to update only changed rows. The team wants a simple, reliable solution. What should they do?

  1. A. Keep the nightly full refresh of country_codes, because the table is small and fast to reload. (Correct answer)
  2. B. Implement an incremental CDC-based MERGE pipeline to avoid ever re-reading the full table.
  3. C. Switch to a weekly full refresh schedule and drop the nightly load entirely.
  4. D. Stop maintaining country_codes in the warehouse and query it directly from the source system.

Correct answer: A

Explanation: For a tiny, slowly changing reference table, a nightly full refresh that completes in seconds is simpler and safer than adding CDC and MERGE complexity. Option B adds unnecessary operational overhead. Option C risks stale data for downstream consumers. Option D breaks the central warehouse model and can cause consistency and access issues.

The 6 DB-DEA Exam Domains

Start the free DB-DEA quick practice test now | All DB-DEA domains | DB-DEA Cheat Sheet