DB-DEA Quick Start Practice Test

Question 1

A retail company is building a centralized reporting database that consolidates customer orders, including names, email addresses, and partial payment card details. The security architect has already enabled encryption at rest for the database storage and TLS for all connections. Compliance asks what additional control is most important to reduce the risk of inappropriate access to this sensitive data by internal analysts.

What should the architect recommend next?

Accepted Answer

B. Implement role-based access control so analysts only see the minimum data necessary for their job functions. Encryption at rest and in transit protects data from certain threats but does not prevent over-privileged internal access. Implementing role-based access control aligned to job functions enforces least privilege so analysts only see the minimum data needed, directly reducing the risk of inappropriate internal access.

Answer

A. Rely on the existing encryption controls and focus only on performance tuning for the reporting queries

Answer

C. Disable audit logging to avoid storing potentially sensitive metadata about user access patterns

Answer

D. Create a single shared database account for all analysts to simplify credential management

Question 2

A healthcare analytics team wants to share a dataset with an external research partner. The dataset includes patient demographics, diagnosis codes, and a persistent patient ID that can be linked back to the hospital’s systems. The partner needs to perform longitudinal analysis over time but must not be able to identify individual patients.

Which approach best balances analytical usefulness with privacy requirements?

Accepted Answer

B. Replace patient IDs with randomly generated tokens that only the hospital can map back to real identities. Replacing patient IDs with randomly generated tokens that only the hospital can reverse is pseudonymization. It preserves the ability to track patients over time for longitudinal analysis while preventing the external partner from directly identifying individuals, achieving a good balance between utility and privacy.

Answer

A. Provide the full dataset as-is since the partner is under contract and uses secure networks

Answer

C. Mask all diagnosis codes with generic placeholders such as 'Condition A', 'Condition B', etc.

Answer

D. Remove only names and addresses and keep the original patient IDs for easier analysis

Question 3

A data engineering team has a notebook that loads raw files into Delta tables every night at 2:00 AM. No users need to interact with the cluster during the run, and the team wants to avoid paying for compute when the pipeline is idle. What is the most appropriate way to run this workload?

Accepted Answer

B. Create a Databricks Job that runs the notebook on job compute. Job compute is the best fit for scheduled production workloads because it is commonly created for the job run and terminated afterward. That matches the team's need for an unattended nightly pipeline and reduced idle cost.

Answer

A. Attach the notebook to an all-purpose cluster that remains available for developers

Answer

C. Run the notebook from a SQL warehouse because it starts quickly for scheduled tasks

Answer

D. Store the notebook in Databricks Repos so it can execute automatically each night

Question 4

A notebook connects to an external database using a username and password currently written directly in the code. During a security review, the team is told to remove credentials from notebooks while keeping the pipeline functional. What should the team do next?

Accepted Answer

B. Store the credentials in a Databricks secret and reference the secret from the notebook. Databricks secret management is the appropriate place to store sensitive credentials. The notebook can reference the secret at runtime without embedding the username and password directly in code.

Answer

A. Move the credentials into a separate notebook and import that notebook into the pipeline

Answer

C. Save the credentials in a workspace folder with restricted permissions

Answer

D. Attach the notebook to a SQL warehouse so the credentials are no longer visible in code

Question 5

A data engineering team currently stores fact tables as plain Parquet files in cloud object storage. Analysts complain that concurrent ETL jobs sometimes leave the data in an inconsistent state, causing reports to show partial updates. The team wants to keep using the data lake but ensure transactional consistency during writes. What should they do?

Accepted Answer

B. Convert the Parquet data to Delta tables to leverage ACID transactions via the transaction log. Plain Parquet files do not provide ACID guarantees, so concurrent writers can leave data in a partially updated state. Converting the Parquet data to Delta tables adds a transaction log and ACID transactions, ensuring that each write is atomic, consistent, isolated, and durable while still using the data lake storage.

Answer

A. Wrap all Parquet writes in a single job so that only one writer runs at a time

Answer

C. Increase the file size of Parquet files so that each partition has fewer files

Answer

D. Create a Gold layer on top of the existing Parquet files without changing the format

Question 6

A streaming pipeline writes JSON events from a web application into a Bronze Delta table. The source system recently added a new optional field, but the write job now fails because the incoming schema no longer matches the table schema. The team wants to allow the new column while still rejecting incompatible type changes. What configuration should they use on the Delta table?

Accepted Answer

B. Enable schema evolution so that new columns can be added under controlled conditions. Enabling schema evolution on the Delta table allows Delta to automatically add new columns when the source schema evolves, while still enforcing the existing schema and rejecting incompatible type changes. This fits a Bronze streaming ingestion pattern where new optional fields appear over time.

Answer

A. Disable schema enforcement so that all schema changes are accepted automatically

Answer

C. Manually recreate the table with the new schema every time the source changes

Answer

D. Partition the table by the new column so that the schema mismatch is ignored

Question 7

A retail company ingests daily sales transactions from multiple stores into a large Delta table `sales_bronze` stored in cloud object storage. The table is partitioned by `event_date` (the date the sale occurred). A downstream ELT job creates a curated table `sales_silver` by deduplicating records based on a business key `sale_id`, keeping only the latest record per `sale_id` using a window function over `event_time`.

Recently, data quality checks show that some late-arriving records for previous days are missing from `sales_silver`. The ingestion process appends new files to `sales_bronze` every day, including late-arriving records with older `event_date` values. The current `sales_silver` job runs once per day and reads only the latest partition from `sales_bronze` using a filter on `event_date` equal to yesterday.

You need to adjust the ELT logic so that late-arriving records are correctly reflected in `sales_silver` while still leveraging partition pruning on `event_date`.

Which approach is the most appropriate change?

Accepted Answer

C. Adjust the `sales_silver` job to read a configurable sliding window of recent `event_date` partitions from `sales_bronze` (for example, the last N days), then re-run the deduplication for just that window and overwrite only the affected partitions in `sales_silver`.. Option C expands the read scope to a recent window of `event_date` partitions where late-arriving records are likely to appear, then re-applies the window-based deduplication and overwrites only those affected partitions in `sales_silver`. This correctly incorporates late-arriving data while still using partition pruning and avoiding an expensive full-table recomputation.

Option A would include late-arriving data but is unnecessarily expensive for large tables because it recomputes and overwrites the entire dataset every day. Option B changes the target partitioning but still reads only yesterday’s `event_date` from `sales_bronze`, so late-arriving records for older dates are never processed. Option D adds a costly global sort but does not change which partitions are read, so it does not fix the missing late-arriving records.

Answer

A. Change the `sales_silver` job to read all historical partitions from `sales_bronze` every day, recompute the full deduplication, and overwrite the entire `sales_silver` table.

Answer

B. Modify the `sales_silver` job to read only the latest partition from `sales_bronze`, but partition `sales_silver` by ingestion date instead of `event_date` so late-arriving records are not missed.

Answer

D. Keep reading only yesterday’s `event_date` partition from `sales_bronze`, but add an ORDER BY `event_time` clause without LIMIT in the final write to ensure the latest late-arriving records are included in `sales_silver`.

Question 8

A team is building a nightly ELT pipeline in PySpark to aggregate sales data into a gold table. The input silver table sales_silver is partitioned by sale_date and contains about 500 GB of data per day. The final gold table sales_daily_agg will be queried mostly by sale_date and region.

They currently write the result with:

result_df.write.mode("overwrite").format("parquet").save("s3://warehouse/gold/sales_daily_agg/")

The job is slow, and the team notices that the write stage has thousands of small output files for each run. A new engineer suggests adding:

result_df = result_df.repartition(200)

before the write, arguing that repartitioning will always improve performance by reducing the number of small files.

Given this scenario, how should the lead data engineer respond?

Accepted Answer

B. Disagree, because repartition introduces a shuffle; its benefit depends on the data size, partition count, and downstream access patterns.. repartition can change the number of partitions and therefore influence the number of output files, but it does so via a shuffle, which is an expensive wide transformation. Whether it helps depends on the current partitioning, data size, and how the table will be queried. It is not inherently always beneficial and must be used deliberately with awareness of shuffle cost and partitioning strategy.

Answer

A. Agree, because repartition always reduces the number of small files and avoids shuffles when writing large DataFrames.

Answer

C. Agree, but only if coalesce is used instead of repartition, since coalesce always avoids shuffles and improves performance.

Answer

D. Disagree, because the only way to control small files is to change the storage format; partitioning and repartitioning do not affect file sizes.

Question 9

A data engineering team is migrating a nightly batch pipeline from full refresh to incremental processing. The source is an orders table with 50 million rows and a stable updated_at column that is set on every insert and update. The target is a large fact_orders table in the warehouse. The team wants to reduce runtime and cost while ensuring no orders are missed. Which approach best uses updated_at as a high-water-mark for an incremental load?

Accepted Answer

A. On each run, select rows from the source where updated_at > last_successful_run_timestamp and MERGE them into fact_orders.. Using updated_at > last_successful_run_timestamp reads only rows changed since the last successful run and applies them via MERGE, which correctly handles inserts and updates without scanning the full table. Option B compares to the current run time and will usually return no rows. Option C degenerates to a full refresh scan. Option D repeatedly reprocesses the entire table and appends duplicates while not handling updates.

Answer

B. On each run, select rows from the source where updated_at >= current_run_timestamp and MERGE them into fact_orders.

Answer

C. On each run, select all rows from the source and MERGE them into fact_orders, but rely on MERGE to skip unchanged rows.

Answer

D. On each run, select rows from the source where updated_at IS NOT NULL and append them to fact_orders without MERGE.

Question 10

A small reference table country_codes has about 250 rows and changes a few times per month. It is used in joins by many downstream fact tables. The current pipeline performs a full refresh of country_codes each night in under 5 seconds. A new engineer proposes building a complex incremental pipeline using CDC and MERGE to update only changed rows. The team wants a simple, reliable solution. What should they do?

Accepted Answer

A. Keep the nightly full refresh of country_codes, because the table is small and fast to reload.. For a tiny, slowly changing reference table, a nightly full refresh that completes in seconds is simpler and safer than adding CDC and MERGE complexity. Option B adds unnecessary operational overhead. Option C risks stale data for downstream consumers. Option D breaks the central warehouse model and can cause consistency and access issues.

Answer

B. Implement an incremental CDC-based MERGE pipeline to avoid ever re-reading the full table.

Answer

C. Switch to a weekly full refresh schedule and drop the nightly load entirely.

Answer

D. Stop maintaining country_codes in the warehouse and query it directly from the source system.

Free DB-DEA Quick Start Practice Test — 10 Questions Across All 6 Databricks Domains

10 Free DB-DEA Mixed-Domain Practice Questions

Sample Question 1 — Data Governance and Security

Sample Question 2 — Data Governance and Security

Sample Question 3 — Databricks Lakehouse Platform

Sample Question 4 — Databricks Lakehouse Platform

Sample Question 5 — Delta Lake and Data Management

Sample Question 6 — Delta Lake and Data Management

Sample Question 7 — ELT with Spark SQL and Python

Sample Question 8 — ELT with Spark SQL and Python

Sample Question 9 — Incremental Data Processing

Sample Question 10 — Incremental Data Processing

The 6 DB-DEA Exam Domains