FlashGenius Logo FlashGenius
Login Sign Up

Databricks Certified Data Engineer Associate Practice Questions: Production Pipelines Domain

Test your Databricks Certified Data Engineer Associate knowledge with 10 practice questions from the Production Pipelines domain. Includes detailed explanations and answers.

Databricks Certified Data Engineer Associate Practice Questions

Master the Production Pipelines Domain

Test your knowledge in the Production Pipelines domain with these 10 practice questions. Each question is designed to help you prepare for the Databricks Certified Data Engineer Associate certification exam with detailed explanations to reinforce your learning.

Question 1

A daily batch pipeline computes customer lifetime value (CLV) and writes results to a warehouse table used by downstream marketing campaigns. After a new version of the pipeline was deployed, monitoring shows that the total number of customers in the CLV table dropped by 30%, but the pipeline tasks all succeeded and no errors were raised. The team has versioned both the old and new pipeline code and configurations. Marketing wants to avoid sending campaigns based on potentially incorrect CLV values while the issue is investigated. What is the most appropriate immediate action for the data engineering team to take?

A) Roll back to the previous pipeline version and backfill the CLV table from the last known good checkpoint

B) Increase the pipeline’s retry count and rerun today’s job to see if the row counts return to normal

C) Disable all monitoring alerts temporarily to avoid unnecessary noise while the team investigates the cause

D) Keep the new pipeline in place but manually correct the CLV table by inserting missing customers using ad-hoc SQL

Show Answer & Explanation

Correct Answer: A

Explanation:

Rolling back to the last known-good pipeline version and backfilling from a trusted checkpoint restores reliable CLV data for marketing while investigation continues. Increasing retries will not fix a logic or data issue in the new version, disabling alerts hides symptoms, and manual ad-hoc corrections are brittle, non-repeatable, and undermine determinism.

Question 2

A data engineer needs to introduce a small bug fix to a production batch pipeline that loads orders into the data warehouse. The fix has been implemented and tested in the dev environment. The engineer is under time pressure and considers editing the production job definition directly in the orchestration tool’s UI to apply the fix immediately, planning to update the code repository later. What is the most appropriate action?

A) Apply the fix directly in the production orchestration UI to minimize downtime, then document the change afterward.

B) Commit the fix to version control, run automated tests, and deploy to production through the existing CI/CD pipeline, even if it takes slightly longer.

C) Clone the production job into a separate "hotfix" job, apply the change there, and schedule it to run instead of the original job.

D) Pause the production pipeline until a full regression test suite can be run manually in all environments.

Show Answer & Explanation

Correct Answer: B

Explanation:

Using version control and the existing CI/CD pipeline ensures that changes are tested, auditable, and reproducible. Even under time pressure, this is the safest and most maintainable way to deploy fixes to production pipelines.

Question 3

A data engineering team is designing CI/CD for a new set of ETL pipelines and ML models. Currently, application teams use a standard CI/CD system for code, but data and ML changes are deployed manually by running scripts directly in production. Recent incidents include a broken schema change that corrupted a production table and a model deployment that silently degraded prediction quality. The team wants a deployment process that reduces risk, enforces checks before production changes, and supports quick rollback if needed. Which approach best aligns with production best practices for data and ML pipelines?

A) Continue manual deployments but require engineers to double-check their scripts and have a peer review before running them in production

B) Integrate pipelines and models into the existing CI/CD system with automated tests, data validation stages, and staged deployments (e.g., dev → test → prod) with gates

C) Deploy all changes directly to production but increase logging so that issues can be detected and fixed faster after they occur

D) Create a separate CI/CD system only for ML models, while keeping ETL pipelines as manual scripts to avoid overcomplicating deployments

Show Answer & Explanation

Correct Answer: B

Explanation:

Integrating ETL pipelines and ML models into CI/CD with automated tests, data validation stages, and staged environment promotion introduces systematic checks before production changes and enables controlled rollout and rollback. This significantly reduces deployment risk compared to manual scripts and aligns with modern production practices.

Question 4

A global retailer is refactoring its Databricks jobs to use Databricks Asset Bundles. Previously, each workspace (dev, staging, prod) had jobs manually created in the UI and linked to a Databricks Repo that tracked the `main` branch of a Git repository. They introduce a bundle with this high-level structure in the same Git repo: - `bundle.yml` defining: - A single job resource `nightly_sales_etl` that points to notebooks in the repo - Targets: `dev`, `staging`, `prod`, each with its own workspace URL and cluster overrides The team configures a CI/CD pipeline so that: - On pull requests, it runs `databricks bundle validate -t dev` only. - On merges to `main`, it runs `databricks bundle deploy -t dev` and then `databricks bundle run -t dev`. After enabling the bundle, they notice that changes merged to `main` are correctly deployed and run in the dev workspace, but staging and prod jobs in their respective workspaces still point to the old manually created jobs and do not reflect the bundle configuration. The engineering manager wants to fully standardize on bundles for all environments while keeping code and configuration centralized. What is the most appropriate next step?

A) Update the Databricks Repo configuration in staging and prod workspaces to track the `dev` branch instead of `main`, so that changes deployed to dev automatically propagate to staging and prod.

B) Extend the CI/CD pipeline with additional stages that, after approvals, run `databricks bundle deploy -t staging` and `databricks bundle deploy -t prod` to apply the same bundle to those workspaces.

C) Delete the `staging` and `prod` targets from `bundle.yml` and continue managing staging and prod jobs manually in the UI while using bundles only for dev.

D) Replace the `nightly_sales_etl` job resource in `bundle.yml` with three separate job resources (one per environment) and deploy them all to the dev workspace, then clone them manually into staging and prod.

Show Answer & Explanation

Correct Answer: B

Explanation:

To standardize on bundles across environments, the retailer must explicitly deploy the bundle to each target/workspace. Extending CI/CD with stages that run `databricks bundle deploy -t staging` and `databricks bundle deploy -t prod` (with appropriate approvals) ensures that staging and prod jobs are created or updated from the same declarative configuration. Changing Repo branches, keeping manual jobs, or cloning jobs manually does not leverage bundle targets and reintroduces configuration drift.

Question 5

A team runs a daily ETL pipeline that processes 2 TB of log data and loads it into a warehouse. The job has started to miss its 3-hour SLA as data volume grows. The current implementation reads all data from the previous day and recomputes all aggregates from scratch. The pipeline runs on a fixed-size cluster, and increasing cluster size would significantly increase costs. What is the most appropriate change to improve performance while controlling costs?

A) Double the cluster size so the job finishes faster, accepting the higher compute cost as the price of meeting the SLA

B) Change the pipeline to process only new or changed data each day and compute aggregates incrementally, using partitioning where possible

C) Reduce the number of validation checks in the pipeline so that less time is spent on data quality and more on core transformations

D) Decrease the job frequency to every other day so that there is more time to process the larger batches

Show Answer & Explanation

Correct Answer: B

Explanation:

Refactoring the pipeline to process only new or changed data and to leverage partitioning for incremental aggregation reduces the amount of data processed each run. This design improvement increases performance and scalability while keeping compute costs under control, instead of relying solely on scaling hardware.

Question 6

An ML engineer has been training models in a notebook and then manually running the same notebook against production data once a week to generate predictions. The notebook connects directly to production databases using the engineer’s personal credentials. The data platform team wants to turn this into a proper production inference pipeline. Which change should they prioritize first to make this workload production-ready?

A) Increase the notebook’s compute resources so it can process predictions faster

B) Wrap the notebook in a scheduled job but keep using the engineer’s personal credentials for access

C) Refactor the notebook into a version-controlled, orchestrated pipeline that runs with service accounts and defined dependencies

D) Ask the engineer to double-check results manually after each run before sharing predictions

Show Answer & Explanation

Correct Answer: C

Explanation:

Refactoring the notebook into a version-controlled, orchestrated pipeline that uses service accounts and explicit dependencies addresses core production needs: repeatability, security, and maintainability. Simply adding compute, scheduling the same notebook with personal credentials, or relying on manual checks leaves the workload ad-hoc and fragile.

Question 7

A data engineering team has a nightly batch pipeline that loads transactional data from an operational database into a data warehouse. The pipeline is orchestrated with a DAG scheduler and currently only fails when tasks crash due to code errors. Last month, a silent data issue caused incorrect aggregates to be published for several days because the pipeline never failed, even though half the records were missing in the source extract. The team wants to reduce the risk of silently publishing bad data. What is the most appropriate change to make to this production pipeline?

A) Increase the pipeline schedule to run every hour so that bad data is corrected more quickly when discovered

B) Add automated data quality checks (for volume, null rates, and key constraints) that fail the pipeline when critical thresholds are violated

C) Require manual review of a sample of records after each run before downstream tables are refreshed

D) Add more verbose logging to the existing tasks so engineers can inspect logs if a problem is suspected

Show Answer & Explanation

Correct Answer: B

Explanation:

Automated data quality checks that validate volume, null rates, and key constraints and fail the pipeline on critical threshold violations directly address the risk of silently publishing bad data. This introduces fail-fast behavior based on data outcomes, not just code errors. Increasing frequency, manual review, or more verbose logging do not reliably prevent silent data corruption in a production setting.

Question 8

An ML team has deployed a model-serving pipeline that scores incoming events in near real-time and writes predictions to a feature store. The CI pipeline validates code style, runs unit tests, and checks that the model can be loaded. After a recent model update, business stakeholders observed a significant drop in conversion rates, but there were no errors in the serving pipeline. The team wants to improve their CI/CD process to catch problematic model updates before they impact production. Which addition to the CI/CD pipeline is most appropriate?

A) Add a step that runs the model on a fixed validation dataset and enforces minimum performance thresholds before allowing deployment

B) Require manual approval from a product manager for every model deployment to ensure business alignment

C) Increase the number of unit tests around feature transformation code to reach 100% coverage

D) Add a linter that enforces strict coding style rules for all model training scripts

Show Answer & Explanation

Correct Answer: A

Explanation:

Running the model on a representative validation dataset and enforcing minimum performance thresholds introduces a performance gate in CI/CD, directly addressing the risk of deploying models that degrade business metrics while remaining technically valid. Manual approvals, more unit tests, or stricter linting do not systematically prevent performance regressions.

Question 9

A data engineering team is designing a new feature for a mobile app that shows users their current loyalty points balance within a few seconds of making a purchase. The existing data platform runs hourly batch jobs that update a warehouse used for analytics dashboards. The team is considering reusing the hourly batch pipeline to power the in-app loyalty balance feature. Which approach best meets the requirements while balancing complexity and latency?

A) Reuse the existing hourly batch pipeline and have the app read loyalty balances from the warehouse, accepting up to one hour of staleness.

B) Build a streaming pipeline that consumes purchase events in near real time and updates a low-latency store used by the app for loyalty balances.

C) Increase the frequency of the batch pipeline to run every 5 minutes and have the app read from the warehouse.

D) Generate loyalty balances on-demand by querying all historical transactions from the warehouse each time a user opens the app.

Show Answer & Explanation

Correct Answer: B

Explanation:

The requirement is to show updated loyalty balances within a few seconds of purchase, which calls for low-latency processing. A streaming pipeline that updates a low-latency store is appropriate for this user-facing, near-real-time feature, whereas batch pipelines are optimized for throughput and cost, not second-level latency.

Question 10

A global enterprise has standardized on Databricks for production data pipelines. The platform team has: - Enabled serverless compute for production jobs in specific regions. - Restricted interactive all-purpose clusters to development workspaces only. A data engineer is designing a multi-step production workflow: 1) A daily batch ETL task that prepares curated feature tables. 2) A scheduled ML batch scoring task that runs twice per day using those features. 3) A short validation task that checks scoring outputs and updates a status table. The workload is moderately spiky, with larger volumes on certain days. The team wants to minimize operational overhead and avoid using interactive compute in production, while keeping costs reasonable. Which approach best aligns with these constraints and the platform team’s policies?

A) Run all three tasks on a single, shared all-purpose cluster in the production workspace to maximize reuse of compute and minimize startup overhead.

B) Configure each task in the workflow to use serverless compute in the production workspace, leveraging per-task auto-scaling and avoiding cluster management.

C) Use serverless compute only for the ML scoring task and run the ETL and validation tasks on an all-purpose cluster in a development workspace.

D) Create a large, fixed-size job cluster for the ETL and scoring tasks, and use serverless compute only for the short validation task to reduce startup time.

Show Answer & Explanation

Correct Answer: B

Explanation:

Configuring each production task to use serverless compute in the production workspace respects the policy that interactive all-purpose clusters are limited to development, while leveraging per-task auto-scaling and fully managed compute for a moderately spiky workload. Running on an all-purpose cluster in production (A) violates governance. Splitting production tasks onto a dev all-purpose cluster (C) breaks environment separation and policy. Using a large fixed job cluster for the main tasks (D) increases management overhead and may be overprovisioned compared to using serverless for those compute-heavy steps.

Ready to Accelerate Your Databricks Certified Data Engineer Associate Preparation?

Join thousands of professionals who are advancing their careers through expert certification preparation with FlashGenius.

  • ✅ Unlimited practice questions across all Databricks Certified Data Engineer Associate domains
  • ✅ Full-length exam simulations with real-time scoring
  • ✅ AI-powered performance tracking and weak area identification
  • ✅ Personalized study plans with adaptive learning
  • ✅ Mobile-friendly platform for studying anywhere, anytime
  • ✅ Expert explanations and study resources
Start Free Practice Now

Already have an account? Sign in here

About Databricks Certified Data Engineer Associate Certification

The Databricks Certified Data Engineer Associate certification validates your expertise in production pipelines and other critical domains. Our comprehensive practice questions are carefully crafted to mirror the actual exam experience and help you identify knowledge gaps before test day.

Practice Resources for Databricks DEA Certification

Strengthen your DB-DEA prep with focused practice questions across the most important exam domains.

Recommended Guide

Databricks Data Engineer Associate: Your Complete 2026 Guide

Preparing for the DB-DEA exam? This complete guide covers exam structure, key topics, study strategy, and real-world preparation tips to help you pass on your first attempt.

  • ✔️ Full exam breakdown (latest blueprint)
  • ✔️ Key domains and high-weight topics
  • ✔️ Study roadmap + preparation strategy
  • ✔️ Tips to avoid common exam mistakes
📘 Read the Complete Guide