Free DB-DEA Incremental Data Processing Practice Test 2026 — Databricks Data Engineer Associate Questions
Last updated: May 2026 · Aligned with the current Databricks DB-DEA exam · 17% of the exam
This free DB-DEA Incremental Data Processing practice test covers Structured Streaming, Auto Loader, CDC, and Delta Live Tables for incremental and streaming workloads. Each question includes a detailed explanation with real-world Databricks lakehouse context — perfect for DB-DEA exam prep.
Key Topics in DB-DEA Incremental Data Processing
- Structured Streaming
- Auto Loader
- Delta Live Tables
- CDC Patterns
- Watermarks & Triggers
- Medallion Architecture
10 Free DB-DEA Incremental Data Processing Practice Questions with Answers
Each question below includes 4 answer options, the correct answer, and a detailed explanation. These are real questions from the FlashGenius DB-DEA question bank for the Incremental Data Processing domain (17% of the exam).
Sample Question 1 — Incremental Data Processing
A data engineering team is migrating a nightly batch pipeline from full refresh to incremental processing. The source is an orders table with 50 million rows and a stable updated_at column that is set on every insert and update. The target is a large fact_orders table in the warehouse. The team wants to reduce runtime and cost while ensuring no orders are missed. Which approach best uses updated_at as a high-water-mark for an incremental load?
- A. On each run, select rows from the source where updated_at > last_successful_run_timestamp and MERGE them into fact_orders. (Correct answer)
- B. On each run, select rows from the source where updated_at >= current_run_timestamp and MERGE them into fact_orders.
- C. On each run, select all rows from the source and MERGE them into fact_orders, but rely on MERGE to skip unchanged rows.
- D. On each run, select rows from the source where updated_at IS NOT NULL and append them to fact_orders without MERGE.
Correct answer: A
Explanation: Using updated_at > last_successful_run_timestamp reads only rows changed since the last successful run and applies them via MERGE, which correctly handles inserts and updates without scanning the full table. Option B compares to the current run time and will usually return no rows. Option C degenerates to a full refresh scan. Option D repeatedly reprocesses the entire table and appends duplicates while not handling updates.
Sample Question 2 — Incremental Data Processing
A small reference table country_codes has about 250 rows and changes a few times per month. It is used in joins by many downstream fact tables. The current pipeline performs a full refresh of country_codes each night in under 5 seconds. A new engineer proposes building a complex incremental pipeline using CDC and MERGE to update only changed rows. The team wants a simple, reliable solution. What should they do?
- A. Keep the nightly full refresh of country_codes, because the table is small and fast to reload. (Correct answer)
- B. Implement an incremental CDC-based MERGE pipeline to avoid ever re-reading the full table.
- C. Switch to a weekly full refresh schedule and drop the nightly load entirely.
- D. Stop maintaining country_codes in the warehouse and query it directly from the source system.
Correct answer: A
Explanation: For a tiny, slowly changing reference table, a nightly full refresh that completes in seconds is simpler and safer than adding CDC and MERGE complexity. Option B adds unnecessary operational overhead. Option C risks stale data for downstream consumers. Option D breaks the central warehouse model and can cause consistency and access issues.
Sample Question 3 — Incremental Data Processing
A streaming pipeline ingests click events into a bronze table with columns: event_id, user_id, event_timestamp (business time), and ingestion_timestamp (time the event reached the platform). A downstream daily incremental job builds a silver table by reading only events where ingestion_timestamp is from "yesterday". Analysts report that some clicks appear several days late in the silver table compared to when users actually clicked. The team wants the silver table to reflect the correct business day of the click, even if events arrive late. What is the best change to the incremental logic?
- A. Continue filtering by ingestion_timestamp, but widen the window to the last 7 days on each run.
- B. Filter by event_timestamp for the target business date range and allow reprocessing of a small historical window to capture late events. (Correct answer)
- C. Use the maximum ingestion_timestamp from the previous run as a high-water-mark and filter events with ingestion_timestamp > that value.
- D. Replace ingestion_timestamp with event_id as the watermark and process only events with event_id greater than the last processed value.
Correct answer: B
Explanation: Filtering by event_timestamp aligns the silver table with the business date of the click. Reprocessing a bounded historical window (e.g., last few days of event_timestamp) captures late-arriving events while keeping processing efficient. Option A still attributes events by arrival time. Option C continues to rely solely on ingestion order and can miss out-of-order events. Option D uses event_id, which is not guaranteed to be time-ordered.
Sample Question 4 — Incremental Data Processing
You maintain a dimension table dim_customer as SCD Type 2 in your warehouse. A CDC feed from the source system provides records with operation_type (I, U, D), customer_id, attributes, and updated_at. The business requires preserving history for attribute changes and reflecting hard deletes from the source as of the delete time. Which incremental strategy best satisfies these requirements?
- A. For I and U operations, MERGE into dim_customer updating existing rows in place; ignore D operations to preserve history.
- B. For I and U operations, insert new rows with new effective_start_date and expire previous current rows; for D operations, set an expiry date on the current row. (Correct answer)
- C. For I operations, insert new rows; for U operations, update existing rows in place; for D operations, physically delete rows from dim_customer.
- D. Treat all CDC records as inserts into dim_customer and rely on downstream reports to deduplicate by customer_id and latest updated_at.
Correct answer: B
Explanation: SCD Type 2 requires creating new versions for changes and expiring the previous current row. Option B does this for inserts and updates and models deletes by expiring the current row at the delete time, preserving history while reflecting that the customer was deleted. Option A overwrites history and ignores deletes. Option C loses history and removes deleted customers entirely. Option D leaves ambiguous multiple versions and pushes complex logic downstream.
Sample Question 5 — Incremental Data Processing
An incremental batch job loads new rows into a large fact_sales table every hour. It uses the following logic:
1. Read source rows where updated_at > last_max_updated_at.
2. Append these rows to fact_sales.
3. Update last_max_updated_at to the maximum updated_at from the current batch.
After a clock skew incident on the source database, some rows have updated_at values slightly earlier than rows processed in a previous batch. Analysts notice missing sales in fact_sales. What is the main flaw in this incremental strategy?
- A. Using updated_at as a watermark instead of ingestion_timestamp causes duplicates.
- B. Using a strict '>' comparison on updated_at can miss records when timestamps are not perfectly ordered. (Correct answer)
- C. Appending rows instead of using MERGE prevents partition pruning on fact_sales.
- D. Updating last_max_updated_at after each batch makes the pipeline non-idempotent.
Correct answer: B
Explanation: The logic assumes updated_at is strictly increasing. With clock skew or out-of-order updates, some rows have updated_at less than or equal to the previous maximum and are skipped by the updated_at > last_max_updated_at filter. Option A misattributes the issue to the choice of column. Option C is unrelated to the missing rows. Option D is not inherently a problem; the flaw is the strict comparison with a non-monotonic watermark.
Sample Question 6 — Incremental Data Processing
You design an incremental pipeline for a large fact_events table stored in a lakehouse. The raw bronze table is partitioned by ingestion_date, while the business wants analytics by event_date (from event_timestamp). Events can arrive up to 3 days late. The silver layer should:
- Attribute events to the correct event_date.
- Avoid scanning the entire bronze table on each run.
- Be safe to re-run for the same processing date without creating duplicates.
Which approach best meets these requirements?
- A. Each day, read bronze partitions where ingestion_date = processing_date, derive event_date, and append to silver. Do not reprocess older ingestion_date partitions.
- B. Each day, read bronze partitions where ingestion_date is in [processing_date - 3, processing_date], derive event_date, upsert into silver for those event_dates using MERGE on event_id. (Correct answer)
- C. Each day, read all bronze partitions, derive event_date, and fully rebuild silver for all dates to ensure correctness.
- D. Each day, read bronze partitions where event_date = processing_date, derive event_date, and append to silver. Use ingestion_date only for partitioning silver.
Correct answer: B
Explanation: Reading only the last 3 days of ingestion_date partitions bounds the scan while covering the known lateness window. Deriving event_date and MERGE-ing on event_id into silver ensures correct event_date attribution, handles late arrivals, and allows safe re-runs without duplicates. Option A misses late events. Option C is effectively a full refresh and too expensive. Option D cannot effectively prune ingestion_date partitions and does not address idempotent upserts.
Sample Question 7 — Incremental Data Processing
A warehouse table fact_orders_incremental is built hourly from a CDC stream. The CDC contains operation_type (I, U, D), order_id, and other attributes. The pipeline logic is:
1. Read CDC events for the last hour.
2. For each batch, write all events as new rows into a staging table.
3. Insert all rows from staging into fact_orders_incremental (append-only).
Analysts report that order-level metrics are inflated because some orders appear multiple times, and cancelled orders (D) are still counted. The team wants a correct, incremental, and rerun-safe design. What is the best next step?
- A. Change the pipeline to MERGE CDC events into fact_orders_incremental on order_id, applying I, U, and D semantics instead of appending. (Correct answer)
- B. Continue appending all CDC events, but filter out D operations and rely only on I and U for reporting.
- C. Aggregate the staging table by order_id before inserting into fact_orders_incremental to reduce duplicates.
- D. Switch from CDC to a periodic full extract of the source orders table and truncate/reload fact_orders_incremental each hour.
Correct answer: A
Explanation: A CDC-based pipeline should apply inserts, updates, and deletes to maintain a correct current state. Using MERGE on order_id with operation_type allows the table to reflect one current row per order and to remove cancelled orders, while remaining incremental and rerun-safe. Option B ignores deletes. Option C does not correctly interpret operation types. Option D abandons CDC benefits and is costly for large tables.
Sample Question 8 — Incremental Data Processing
A daily incremental job populates a curated gold table sales_summary_by_store_date from a silver fact_sales table. The job processes only fact_sales rows where sale_date = processing_date and writes aggregated results for that date. Sometimes, due to upstream delays, additional fact_sales rows for a given sale_date arrive one or two days later. The business requires that sales_summary_by_store_date always reflect the final, complete totals per store and date. What is the most appropriate adjustment to the incremental strategy?
- A. Continue processing only sale_date = processing_date, but add a separate weekly reconciliation job that compares summaries to the source and adjusts differences.
- B. On each run, reprocess and overwrite sales_summary_by_store_date for a rolling window of the last N days (e.g., 3 days) based on sale_date. (Correct answer)
- C. Switch to a full refresh of sales_summary_by_store_date every night by aggregating all history from fact_sales.
- D. Filter fact_sales by ingestion_date instead of sale_date to ensure all late-arriving rows are included in the next run.
Correct answer: B
Explanation: Reprocessing and overwriting a rolling window of recent sale_dates allows late-arriving rows to be incorporated while limiting the amount of data processed. Over time, each date's summary converges to the final, complete total. Option A adds complexity and leaves summaries temporarily incorrect. Option C is more expensive than necessary. Option D misattributes sales to the ingestion date instead of the correct business sale_date.
Sample Question 9 — Incremental Data Processing
You manage a multi-layer architecture (bronze, silver, gold) for a large e-commerce dataset. The bronze orders_raw table is partitioned by ingestion_date. The silver orders_clean table is partitioned by order_date and clustered by customer_id. A new requirement is to support efficient incremental backfills for specific historical date ranges (e.g., reprocess orders for a given week last year) without scanning unrelated data. Which design choice best supports both regular daily incremental loads and targeted backfills?
- A. Keep bronze partitioned by ingestion_date and silver partitioned by order_date; design the incremental jobs to read only the necessary ingestion_date partitions for the target order_date range. (Correct answer)
- B. Repartition bronze by customer_id and silver by ingestion_date so that both layers share the same partitioning scheme, simplifying incremental logic.
- C. Partition both bronze and silver by a composite key (order_date, customer_id) to maximize partition pruning for all queries and backfills.
- D. Stop partitioning silver and rely solely on clustering by customer_id to support both incremental loads and backfills.
Correct answer: A
Explanation: Keeping bronze partitioned by ingestion_date supports efficient ingestion, while partitioning silver by order_date aligns with typical business filters and date-based backfills. Incremental jobs can map the relevant ingestion_date partitions to the target order_date range, enabling efficient daily loads and targeted backfills. Option B misaligns partitioning with date-based requirements. Option C over-partitions and adds operational overhead. Option D forces large scans for both daily loads and backfills.
Sample Question 10 — Incremental Data Processing
An incremental pipeline loads data into a silver table user_activity from a bronze table. The job runs hourly and is designed to be idempotent. The logic is:
1. Read bronze rows where ingestion_timestamp between last_run_start and current_run_start.
2. Upsert into user_activity using MERGE on activity_id.
When a run fails halfway through, the orchestrator retries the same hour with the same last_run_start and current_run_start. After the retry, some activities appear twice in downstream aggregates. What is the most likely cause?
- A. The ingestion_timestamp window is too wide and overlaps with the next hour's run, causing duplicates.
- B. The MERGE condition on activity_id is incorrect or missing, causing inserts instead of updates on retry. (Correct answer)
- C. Using ingestion_timestamp instead of event_timestamp makes the pipeline non-idempotent by design.
- D. The pipeline should use append-only writes instead of MERGE to avoid duplicates on retries.
Correct answer: B
Explanation: For an idempotent design, reprocessing the same ingestion window should update existing rows, not insert duplicates. If the MERGE condition does not correctly match on activity_id, retries will insert new rows instead of updating, causing duplicates. Option A describes overlapping windows, which is not indicated here. Option C confuses time semantics with idempotency. Option D would exacerbate duplication on retries.
How to Study DB-DEA Incremental Data Processing
Combine these DB-DEA Incremental Data Processing practice questions with the official Databricks Academy materials and hands-on practice in a Databricks Community Edition workspace. The DB-DEA exam emphasizes applied knowledge of PySpark, Spark SQL, and Delta Lake, so always relate concepts back to real notebooks and jobs you've built.
About the Databricks DB-DEA Exam
- Questions: 45 multiple choice
- Duration: 90 minutes
- Passing score: 70%
- Cost: $200 USD
- Domains: 6 (this is 17% of the exam)
- Validity: 2 years
Other DB-DEA Domains
Start the free DB-DEA Incremental Data Processing practice test now | 10-question quick start | All DB-DEA domains | DB-DEA Cheat Sheet