FlashGenius Logo FlashGenius
Login Sign Up

Databricks Certified Data Engineer Associate Practice Questions: Databricks Lakehouse Platform Domain

Test your Databricks Certified Data Engineer Associate knowledge with 10 practice questions from the Databricks Lakehouse Platform domain. Includes detailed explanations and answers.

Databricks Certified Data Engineer Associate Practice Questions

Master the Databricks Lakehouse Platform Domain

Test your knowledge in the Databricks Lakehouse Platform domain with these 10 practice questions. Each question is designed to help you prepare for the Databricks Certified Data Engineer Associate certification exam with detailed explanations to reinforce your learning.

Question 1

During development, a data engineer accidentally ran a notebook that overwrote a silver Delta table with incorrect data. The team needs to: - Quickly restore the table to its state from two hours ago. - Avoid manually re-running the entire upstream pipeline. Which Delta Lake capability should they use to address this issue?

A) Time travel to query or restore the table as of a specific timestamp or version.

B) OPTIMIZE the table to compact files and remove the incorrect data.

C) VACUUM the table to delete old data files and free up storage.

D) Repartition the table on a different column to isolate the incorrect data.

Show Answer & Explanation

Correct Answer: A

Explanation:

Delta Lake time travel allows querying and restoring a table as of a previous version or timestamp, enabling quick recovery of the correct state without re-running upstream pipelines. OPTIMIZE, VACUUM, or repartitioning change storage layout or clean up files but do not revert data to an earlier logical state.

Question 2

A notebook connects to an external database using a username and password currently written directly in the code. During a security review, the team is told to remove credentials from notebooks while keeping the pipeline functional. What should the team do next?

A) Move the credentials into a separate notebook and import that notebook into the pipeline

B) Store the credentials in a Databricks secret and reference the secret from the notebook

C) Save the credentials in a workspace folder with restricted permissions

D) Attach the notebook to a SQL warehouse so the credentials are no longer visible in code

Show Answer & Explanation

Correct Answer: B

Explanation:

Databricks secret management is the appropriate place to store sensitive credentials. The notebook can reference the secret at runtime without embedding the username and password directly in code.

Question 3

A large Delta table backs several critical dashboards in Databricks SQL. Users report that dashboard queries have become slower over time. Investigation shows: - The table has millions of small files due to frequent micro-batch writes. - Queries often filter on `customer_id` and `event_timestamp`. - The team recently enabled result caching on the SQL warehouse, but performance is still poor for many queries. Which action is most likely to provide a sustained performance improvement for these queries?

A) Increase the size of the SQL warehouse to use more compute for the same queries.

B) Run OPTIMIZE on the Delta table and use Z-ORDER by `customer_id` to compact files and cluster data.

C) Disable all caching features so that queries always read the latest data from storage.

D) Add more partitions on `event_timestamp` at the hour level to create many smaller partitions.

Show Answer & Explanation

Correct Answer: B

Explanation:

Running OPTIMIZE compacts many small files into fewer larger ones, reducing file overhead, and Z-ORDER by `customer_id` clusters data to improve predicate filtering for common queries. This directly addresses the small-file problem and data layout, providing sustained performance gains beyond what more compute or caching alone can offer.

Question 4

A data engineering team is setting up centralized governance for multiple Databricks workspaces used by different departments. They need fine-grained permissions on tables, consistent catalog.schema.table naming, and cross-workspace governance. Which platform feature should they adopt as the core of their governance strategy?

A) The legacy Hive metastore configured separately in each workspace

B) Unity Catalog as the centralized governance and metadata layer

C) Cluster-level access control lists configured on all-purpose clusters

D) DBFS directory permissions managed by workspace admins

Show Answer & Explanation

Correct Answer: B

Explanation:

Unity Catalog is the recommended centralized governance layer for Databricks. It provides a three-level namespace (catalog.schema.table), fine-grained permissions, and cross-workspace governance, directly matching the team’s requirements. The legacy Hive metastore, cluster ACLs, and DBFS permissions cannot provide this centralized, object-level control across workspaces.

Question 5

A data engineering team is building a Lakehouse on Databricks. They need a storage layer that supports ACID transactions, schema enforcement, schema evolution, and time travel for their tables. They also want to be able to roll back to previous versions of the data for auditing. Which Databricks storage technology should they use for their tables to meet these requirements?

A) Plain Parquet files stored in cloud object storage without any additional metadata

B) Delta Lake tables stored in cloud object storage with a transaction log

C) CSV files stored in cloud object storage with external table definitions

D) JSON files stored in cloud object storage with manually maintained version folders

Show Answer & Explanation

Correct Answer: B

Explanation:

Delta Lake tables add a transaction log on top of files in cloud object storage, providing ACID transactions, schema enforcement and evolution, and time travel. This allows reliable updates and the ability to query or restore previous table versions for auditing. Plain Parquet, CSV, or JSON files alone do not provide these transactional and versioning capabilities.

Question 6

A data engineering team is building a production-grade ingestion pipeline on Databricks. Requirements include: - Declarative pipeline definitions. - Built-in data quality checks with automatic handling of bad records. - Automatic dependency management between tables. - Operational monitoring for pipeline health. They are currently orchestrating multiple notebooks with Jobs and manually managing dependencies and data quality checks. Which Databricks feature best meets these requirements?

A) All-purpose clusters with scheduled notebooks and custom logging logic.

B) Delta Live Tables (DLT) pipelines orchestrated through Workflows (Jobs).

C) Standalone SQL queries scheduled in SQL warehouses without any orchestration.

D) External ETL tools writing to cloud storage, with Databricks only used for ad-hoc queries.

Show Answer & Explanation

Correct Answer: B

Explanation:

Delta Live Tables provides declarative pipeline definitions, built-in expectations for data quality with automatic handling of bad records, automatic dependency management between tables, and integrated monitoring. Orchestrating DLT pipelines via Workflows (Jobs) fits the production scheduling requirement. The other options require manual management of dependencies and data quality or move core pipeline logic outside Databricks.

Question 7

A team has a Delta Lake table registered in Unity Catalog that backs several BI dashboards. Over time, they notice that queries are slowing down. Investigation shows: - The table has many partitions with a large number of small files in each. - The cluster is appropriately sized, and there are no obvious resource bottlenecks. Which action is most likely to improve query performance while controlling costs?

A) Run OPTIMIZE on the table periodically and consider using ZORDER on frequently filtered columns.

B) Convert the table from a managed table to an external table to move data to user-managed storage.

C) Increase cluster size significantly and leave the table layout unchanged.

D) Drop and recreate the table as a non-Delta table in CSV format to simplify the file structure.

Show Answer & Explanation

Correct Answer: A

Explanation:

Running OPTIMIZE compacts many small files into fewer larger files, which improves query performance and can reduce overhead. Using ZORDER on frequently filtered columns further improves data skipping. Changing managed vs external status does not address file layout, simply scaling the cluster increases costs without fixing the root cause, and switching to CSV removes Delta Lake benefits without solving the small-file issue.

Question 8

A company is designing its production ETL pipelines on Databricks. The pipelines run on a fixed schedule every night and must be isolated from ad-hoc analytics workloads to avoid resource contention and unexpected costs. Which cluster strategy best aligns with these requirements?

A) Use a single large all-purpose cluster shared by all analysts and ETL jobs, and keep it running 24/7.

B) Use job clusters for the scheduled ETL workloads so that each job gets its own cluster that terminates when the job finishes.

C) Run ETL jobs on developers’ personal all-purpose clusters to reuse existing capacity.

D) Use only serverless SQL warehouses for all ETL workloads, regardless of whether they run notebooks or SQL.

Show Answer & Explanation

Correct Answer: B

Explanation:

Job clusters are created for the duration of a job and then terminated, providing strong isolation between workloads and better cost control for scheduled ETL compared to long-running all-purpose clusters. Sharing a single all-purpose cluster or using personal clusters mixes production with ad-hoc workloads, and serverless SQL alone is not appropriate for all notebook-based ETL patterns.

Question 9

A financial services company needs to restrict access to sensitive columns (such as customer SSN) and certain rows (such as VIP customers) in a Unity Catalog table. Different user groups should see different subsets of the data, but all users should query the same logical object. Where should the company primarily implement these access controls?

A) At the cluster level by restricting who can attach to clusters that access the table.

B) Inside notebooks by adding conditional logic to filter rows and mask columns for each user group.

C) At the Unity Catalog table and view level, using permissions and views to enforce row and column-level security.

D) By limiting access to the underlying cloud storage path and not using Unity Catalog permissions.

Show Answer & Explanation

Correct Answer: C

Explanation:

Unity Catalog provides data-centric permissions at the catalog, schema, table, and view levels. Row- and column-level security is typically implemented using views combined with appropriate grants so that different groups see filtered or masked data while querying a consistent logical object. Cluster-level, notebook-level, or storage-only controls cannot reliably enforce fine-grained, centrally governed access policies.

Question 10

A company has multiple Databricks workspaces. The data platform team wants one centralized way to govern access to tables so permissions are not managed separately in each workspace folder structure. Which Databricks capability best addresses this requirement?

A) Unity Catalog using catalogs and schemas for centralized governance

B) Workspace folders because they organize notebooks and assets by team

C) All-purpose compute because it can be shared across users

D) Databricks Repos because Git tracks changes to files

Show Answer & Explanation

Correct Answer: A

Explanation:

Unity Catalog is the correct answer because it provides centralized governance for data assets using a hierarchy that includes catalogs and schemas. The requirement is about centrally managing access to tables, which is a governance function, not a workspace organization or development feature. Workspace folders and Repos help organize code and assets, but they do not provide centralized table-level governance.

Ready to Accelerate Your Databricks Certified Data Engineer Associate Preparation?

Join thousands of professionals who are advancing their careers through expert certification preparation with FlashGenius.

  • ✅ Unlimited practice questions across all Databricks Certified Data Engineer Associate domains
  • ✅ Full-length exam simulations with real-time scoring
  • ✅ AI-powered performance tracking and weak area identification
  • ✅ Personalized study plans with adaptive learning
  • ✅ Mobile-friendly platform for studying anywhere, anytime
  • ✅ Expert explanations and study resources
Start Free Practice Now

Already have an account? Sign in here

About Databricks Certified Data Engineer Associate Certification

The Databricks Certified Data Engineer Associate certification validates your expertise in databricks lakehouse platform and other critical domains. Our comprehensive practice questions are carefully crafted to mirror the actual exam experience and help you identify knowledge gaps before test day.

Practice Resources for Databricks DEA Certification

Strengthen your DB-DEA prep with focused practice questions across the most important exam domains.

Recommended Guide

Databricks Data Engineer Associate: Your Complete 2026 Guide

Preparing for the DB-DEA exam? This complete guide covers exam structure, key topics, study strategy, and real-world preparation tips to help you pass on your first attempt.

  • ✔️ Full exam breakdown (latest blueprint)
  • ✔️ Key domains and high-weight topics
  • ✔️ Study roadmap + preparation strategy
  • ✔️ Tips to avoid common exam mistakes
📘 Read the Complete Guide