Databricks Certified Data Engineer Associate Practice Questions: Production Pipelines Domain
Test your Databricks Certified Data Engineer Associate knowledge with 10 practice questions from the Production Pipelines domain. Includes detailed explanations and answers.
Databricks Certified Data Engineer Associate Practice Questions
Master the Production Pipelines Domain
Test your knowledge in the Production Pipelines domain with these 10 practice questions. Each question is designed to help you prepare for the Databricks Certified Data Engineer Associate certification exam with detailed explanations to reinforce your learning.
Question 1
A daily batch pipeline computes customer lifetime value (CLV) and writes results to a warehouse table used by downstream marketing campaigns. After a new version of the pipeline was deployed, monitoring shows that the total number of customers in the CLV table dropped by 30%, but the pipeline tasks all succeeded and no errors were raised. The team has versioned both the old and new pipeline code and configurations. Marketing wants to avoid sending campaigns based on potentially incorrect CLV values while the issue is investigated. What is the most appropriate immediate action for the data engineering team to take?
Show Answer & Explanation
Correct Answer: A
Rolling back to the last known-good pipeline version and backfilling from a trusted checkpoint restores reliable CLV data for marketing while investigation continues. Increasing retries will not fix a logic or data issue in the new version, disabling alerts hides symptoms, and manual ad-hoc corrections are brittle, non-repeatable, and undermine determinism.
Question 2
A data engineer needs to introduce a small bug fix to a production batch pipeline that loads orders into the data warehouse. The fix has been implemented and tested in the dev environment. The engineer is under time pressure and considers editing the production job definition directly in the orchestration tool’s UI to apply the fix immediately, planning to update the code repository later. What is the most appropriate action?
Show Answer & Explanation
Correct Answer: B
Using version control and the existing CI/CD pipeline ensures that changes are tested, auditable, and reproducible. Even under time pressure, this is the safest and most maintainable way to deploy fixes to production pipelines.
Question 3
A data engineering team is designing CI/CD for a new set of ETL pipelines and ML models. Currently, application teams use a standard CI/CD system for code, but data and ML changes are deployed manually by running scripts directly in production. Recent incidents include a broken schema change that corrupted a production table and a model deployment that silently degraded prediction quality. The team wants a deployment process that reduces risk, enforces checks before production changes, and supports quick rollback if needed. Which approach best aligns with production best practices for data and ML pipelines?
Show Answer & Explanation
Correct Answer: B
Integrating ETL pipelines and ML models into CI/CD with automated tests, data validation stages, and staged environment promotion introduces systematic checks before production changes and enables controlled rollout and rollback. This significantly reduces deployment risk compared to manual scripts and aligns with modern production practices.
Question 4
A global retailer is refactoring its Databricks jobs to use Databricks Asset Bundles. Previously, each workspace (dev, staging, prod) had jobs manually created in the UI and linked to a Databricks Repo that tracked the `main` branch of a Git repository. They introduce a bundle with this high-level structure in the same Git repo: - `bundle.yml` defining: - A single job resource `nightly_sales_etl` that points to notebooks in the repo - Targets: `dev`, `staging`, `prod`, each with its own workspace URL and cluster overrides The team configures a CI/CD pipeline so that: - On pull requests, it runs `databricks bundle validate -t dev` only. - On merges to `main`, it runs `databricks bundle deploy -t dev` and then `databricks bundle run -t dev`. After enabling the bundle, they notice that changes merged to `main` are correctly deployed and run in the dev workspace, but staging and prod jobs in their respective workspaces still point to the old manually created jobs and do not reflect the bundle configuration. The engineering manager wants to fully standardize on bundles for all environments while keeping code and configuration centralized. What is the most appropriate next step?
Show Answer & Explanation
Correct Answer: B
To standardize on bundles across environments, the retailer must explicitly deploy the bundle to each target/workspace. Extending CI/CD with stages that run `databricks bundle deploy -t staging` and `databricks bundle deploy -t prod` (with appropriate approvals) ensures that staging and prod jobs are created or updated from the same declarative configuration. Changing Repo branches, keeping manual jobs, or cloning jobs manually does not leverage bundle targets and reintroduces configuration drift.
Question 5
A team runs a daily ETL pipeline that processes 2 TB of log data and loads it into a warehouse. The job has started to miss its 3-hour SLA as data volume grows. The current implementation reads all data from the previous day and recomputes all aggregates from scratch. The pipeline runs on a fixed-size cluster, and increasing cluster size would significantly increase costs. What is the most appropriate change to improve performance while controlling costs?
Show Answer & Explanation
Correct Answer: B
Refactoring the pipeline to process only new or changed data and to leverage partitioning for incremental aggregation reduces the amount of data processed each run. This design improvement increases performance and scalability while keeping compute costs under control, instead of relying solely on scaling hardware.
Question 6
An ML engineer has been training models in a notebook and then manually running the same notebook against production data once a week to generate predictions. The notebook connects directly to production databases using the engineer’s personal credentials. The data platform team wants to turn this into a proper production inference pipeline. Which change should they prioritize first to make this workload production-ready?
Show Answer & Explanation
Correct Answer: C
Refactoring the notebook into a version-controlled, orchestrated pipeline that uses service accounts and explicit dependencies addresses core production needs: repeatability, security, and maintainability. Simply adding compute, scheduling the same notebook with personal credentials, or relying on manual checks leaves the workload ad-hoc and fragile.
Question 7
A data engineering team has a nightly batch pipeline that loads transactional data from an operational database into a data warehouse. The pipeline is orchestrated with a DAG scheduler and currently only fails when tasks crash due to code errors. Last month, a silent data issue caused incorrect aggregates to be published for several days because the pipeline never failed, even though half the records were missing in the source extract. The team wants to reduce the risk of silently publishing bad data. What is the most appropriate change to make to this production pipeline?
Show Answer & Explanation
Correct Answer: B
Automated data quality checks that validate volume, null rates, and key constraints and fail the pipeline on critical threshold violations directly address the risk of silently publishing bad data. This introduces fail-fast behavior based on data outcomes, not just code errors. Increasing frequency, manual review, or more verbose logging do not reliably prevent silent data corruption in a production setting.
Question 8
An ML team has deployed a model-serving pipeline that scores incoming events in near real-time and writes predictions to a feature store. The CI pipeline validates code style, runs unit tests, and checks that the model can be loaded. After a recent model update, business stakeholders observed a significant drop in conversion rates, but there were no errors in the serving pipeline. The team wants to improve their CI/CD process to catch problematic model updates before they impact production. Which addition to the CI/CD pipeline is most appropriate?
Show Answer & Explanation
Correct Answer: A
Running the model on a representative validation dataset and enforcing minimum performance thresholds introduces a performance gate in CI/CD, directly addressing the risk of deploying models that degrade business metrics while remaining technically valid. Manual approvals, more unit tests, or stricter linting do not systematically prevent performance regressions.
Question 9
A data engineering team is designing a new feature for a mobile app that shows users their current loyalty points balance within a few seconds of making a purchase. The existing data platform runs hourly batch jobs that update a warehouse used for analytics dashboards. The team is considering reusing the hourly batch pipeline to power the in-app loyalty balance feature. Which approach best meets the requirements while balancing complexity and latency?
Show Answer & Explanation
Correct Answer: B
The requirement is to show updated loyalty balances within a few seconds of purchase, which calls for low-latency processing. A streaming pipeline that updates a low-latency store is appropriate for this user-facing, near-real-time feature, whereas batch pipelines are optimized for throughput and cost, not second-level latency.
Question 10
A global enterprise has standardized on Databricks for production data pipelines. The platform team has: - Enabled serverless compute for production jobs in specific regions. - Restricted interactive all-purpose clusters to development workspaces only. A data engineer is designing a multi-step production workflow: 1) A daily batch ETL task that prepares curated feature tables. 2) A scheduled ML batch scoring task that runs twice per day using those features. 3) A short validation task that checks scoring outputs and updates a status table. The workload is moderately spiky, with larger volumes on certain days. The team wants to minimize operational overhead and avoid using interactive compute in production, while keeping costs reasonable. Which approach best aligns with these constraints and the platform team’s policies?
Show Answer & Explanation
Correct Answer: B
Configuring each production task to use serverless compute in the production workspace respects the policy that interactive all-purpose clusters are limited to development, while leveraging per-task auto-scaling and fully managed compute for a moderately spiky workload. Running on an all-purpose cluster in production (A) violates governance. Splitting production tasks onto a dev all-purpose cluster (C) breaks environment separation and policy. Using a large fixed job cluster for the main tasks (D) increases management overhead and may be overprovisioned compared to using serverless for those compute-heavy steps.
Ready to Accelerate Your Databricks Certified Data Engineer Associate Preparation?
Join thousands of professionals who are advancing their careers through expert certification preparation with FlashGenius.
- ✅ Unlimited practice questions across all Databricks Certified Data Engineer Associate domains
- ✅ Full-length exam simulations with real-time scoring
- ✅ AI-powered performance tracking and weak area identification
- ✅ Personalized study plans with adaptive learning
- ✅ Mobile-friendly platform for studying anywhere, anytime
- ✅ Expert explanations and study resources
Already have an account? Sign in here
About Databricks Certified Data Engineer Associate Certification
The Databricks Certified Data Engineer Associate certification validates your expertise in production pipelines and other critical domains. Our comprehensive practice questions are carefully crafted to mirror the actual exam experience and help you identify knowledge gaps before test day.
Practice Resources for Databricks DEA Certification
Strengthen your DB-DEA prep with focused practice questions across the most important exam domains.
Databricks Data Engineer Associate: Your Complete 2026 Guide
Preparing for the DB-DEA exam? This complete guide covers exam structure, key topics, study strategy, and real-world preparation tips to help you pass on your first attempt.
- ✔️ Full exam breakdown (latest blueprint)
- ✔️ Key domains and high-weight topics
- ✔️ Study roadmap + preparation strategy
- ✔️ Tips to avoid common exam mistakes