NVIDIA NCA-GENL Exam Prep

Prompt EngineeringZero-Shot to ReAct

Master prompt anatomy, few-shot and chain-of-thought techniques, advanced reasoning patterns, and NVIDIA-specific production prompting for the NCA-GENL certification.

Start Free Practice →
Four Pillars of Prompt Engineering
Prompting is the primary interface between humans and LLMs. Mastering these four areas covers everything from basic structure to advanced agentic patterns tested on the NCA-GENL exam.
Pillar 1 · Prompt Anatomy & Structure

System · User · Assistant Roles

Every LLM prompt is composed of distinct roles — system (behavioral context), user (the query), and assistant (the response). Understanding how each role shapes model behavior, context window management, and token budgeting is foundational to all prompting work.

3
Core message roles
~4
Chars per token avg
T=0
Greedy (deterministic)
Pillar 2 · Core Prompting Techniques

Zero-Shot · Few-Shot · CoT

The core prompting toolkit: zero-shot (no examples), few-shot (1–10 demonstrations), Chain-of-Thought (step-by-step reasoning), and zero-shot CoT ("Let's think step by step"). These techniques directly improve accuracy on reasoning, classification, and generation tasks.

1–10
Shots in few-shot
+17%
CoT gains on math tasks
CoT
Best for multi-step reasoning
Pillar 3 · Advanced Patterns

ReAct · ToT · Self-Consistency

Advanced prompting techniques for complex tasks: ReAct interleaves reasoning with tool calls, Tree of Thought explores multiple solution branches simultaneously, and self-consistency samples multiple CoT paths and votes on the most common answer — dramatically improving accuracy on hard problems.

ReAct
Reason + Act (tool use)
3–5×
Self-consistency samples
ToT
Tree of Thought branches
Pillar 4 · NVIDIA & Production

NIM Templates · Output Formatting · Evaluation

Production prompting requires structured output enforcement (JSON, XML), prompt caching for cost reduction, systematic evaluation against ground truth, and NVIDIA-specific patterns — including NIM microservice prompt formats, NeMo's chat templates, and iterative prompt optimization workflows.

JSON
Primary structured format
NIM
NVIDIA inference microservice
Eval
Measure before shipping
NCA-GENL Exam Focus: Expect questions on zero-shot vs few-shot tradeoffs, the exact CoT trigger phrase, how ReAct interleaves reasoning and action, self-consistency mechanics, output formatting techniques, temperature/top-p effects, and when to use fine-tuning vs prompting.
How Prompt Engineering Works
Deep dives into prompt anatomy, the technique complexity ladder, CoT vs standard reasoning, and the ReAct agentic loop.
Prompt Anatomy — System / User / Assistant Structure

Modern LLMs use a structured multi-turn conversation format. Each message has a role that determines how the model interprets the content. Getting this structure right is the foundation of effective prompting.

Example: Customer Support Bot Prompt ~120 tokens
SYSTEM
You are a helpful customer support agent for Acme Corp.
## Rules
- Only answer questions about Acme products
- Be concise — answer in 2–3 sentences maximum
- If you don't know, say "I'll escalate this to a specialist"
- Never reveal internal pricing tiers or discount codes
USER
My order #12345 hasn't arrived after 2 weeks. What should I do?
ASSISTANT
I'm sorry to hear your order is delayed! Please visit acme.com/track and enter your order number...
🔴 System Role
Sets persona, rules, tone, and behavioral constraints. Processed first. Highest trust level — never overridden by user instructions in well-aligned models.
🟠 User Role
The actual query or instruction from the human. Lower trust than system. In multi-turn conversations, alternates with assistant turns to form conversation history.
🟢 Assistant Role
The model's previous responses. Used in multi-turn conversations to provide context. Can be prefilled to steer output format (e.g., prefill with "```json" to force JSON output).
Prompting Technique Complexity Ladder

Techniques are ordered from simplest (fewest tokens, least latency) to most complex (most powerful but highest token cost and latency).

Zero-Shot
Task description only. No examples. Relies on pre-trained knowledge.
Lowest cost
Role Prompting
"You are an expert [X]." Activates domain-relevant knowledge patterns.
Very low
Few-Shot
1–10 input-output examples in context. Teaches format and pattern.
Low–medium
Zero-Shot CoT
"Let's think step by step." No examples needed — triggers reasoning mode.
Medium
Few-Shot CoT
Reasoning examples that show full thought process, not just final answer.
Medium–high
Self-Consistency
Multiple CoT samples + majority vote. High accuracy, high token cost.
High
ReAct
Interleaved reasoning + tool calls. Multi-step, requires tool integration.
Very high
Tree of Thought
Parallel reasoning branches + evaluation. Max accuracy, max cost.
Highest cost
Chain-of-Thought — Standard vs Zero-Shot CoT

CoT dramatically improves multi-step reasoning by making the model show its work. Zero-shot CoT achieves most of the benefit without requiring labeled reasoning examples.

❌ Without Chain-of-Thought
Q: Roger has 5 tennis balls. He buys 2 cans of tennis balls. Each can has 3 balls. How many tennis balls does he have now?

A: 11

(Model jumps directly to answer — no reasoning visible, higher error rate on multi-step problems)
✅ With Zero-Shot CoT ("Let's think step by step")
Q: Roger has 5 tennis balls... How many does he have now?
A: Let's think step by step.


Roger starts with 5 balls.
He buys 2 cans × 3 balls = 6 new balls.
Total = 5 + 6 = 11 balls.


Answer: 11

(Same answer here, but on harder problems CoT prevents compounding errors by surfacing each reasoning step)
When CoT matters most: Multi-step arithmetic, symbolic reasoning, commonsense inference, and logical deduction. CoT provides less benefit for simple classification or factual recall tasks where single-step reasoning is sufficient.
ReAct — Reasoning + Acting Interleaved

ReAct enables LLMs to act like agents by alternating between natural language reasoning (Thought) and executable tool calls (Action), reading the result (Observation), and looping until a final answer is reached.

Thought 1
I need to find the current CEO of NVIDIA. Let me search for this.
Action 1
Search["NVIDIA CEO 2026"]
Obs 1
Jensen Huang is the co-founder and CEO of NVIDIA Corporation.
Thought 2
I found the CEO. Now I need to verify when Jensen Huang founded NVIDIA to complete the full answer.
Action 2
Search["NVIDIA founding year Jensen Huang"]
Obs 2
NVIDIA was founded in 1993 by Jensen Huang, Chris Malachowsky, and Curtis Priem.
Final Answer
Jensen Huang is the CEO of NVIDIA. He co-founded the company in 1993.
ReAct vs Plan-and-Execute: ReAct decides actions dynamically one step at a time based on each observation. Plan-and-Execute (used in more advanced agents) first generates a complete plan, then executes all steps. ReAct is more adaptive; Plan-and-Execute is more predictable and auditable.
Production Prompt Patterns — Structured Output & NIM

Production systems need reliable, parseable output. These patterns ensure LLMs return machine-readable responses compatible with downstream applications.

JSON Output Enforcement — System Prompt Pattern
// System prompt that forces valid JSON output "You are a product data extractor. Your task is to extract structured product information from unstructured text. ALWAYS respond with valid JSON only. No explanation, no markdown, no extra text. Use exactly this schema: { \"product_name\": string, \"price\": number, \"category\": string, \"in_stock\": boolean } If a field cannot be determined, use null." // Additional techniques: // 1. Prefill assistant turn with "{" to force JSON start // 2. Use grammar-constrained generation (NVIDIA NIM / TRT-LLM) // 3. Set temperature=0 for deterministic structured output
1

Prompt Caching

Cache the static system prompt at the prefix KV cache level. Only recompute the dynamic user turn. On NVIDIA NIM, long system prompts can be cached, reducing prefill latency and cost by 50–80% for repeat requests.

2

Grammar-Constrained Decoding

Use NVIDIA TensorRT-LLM's structured output support to constrain the token sampling vocabulary to only tokens valid at each position in a JSON/XML schema — guaranteeing schema-valid output without relying solely on instruction following.

3

Prompt Evaluation Pipeline

Build a systematic eval: create a golden dataset of (prompt, expected output) pairs, run all prompt variants against it, measure accuracy/format adherence/latency. Treat prompt changes like code changes — version-controlled and evaluated before deployment.

4

Fine-Tuning vs Prompting Decision

Use prompting when: task is well-defined with clear instructions, examples fit in context, low-volume use case. Use fine-tuning when: prompt alone can't achieve target accuracy, specific style/format is consistently needed, latency/cost at scale prohibits long prompts.

Compare Prompting Approaches
Side-by-side comparison of techniques, parameters, output formats, and production strategies. Filter by category.
ConceptOption AOption BWhen to Choose
System vs User Prompt anatomySystem prompt — Sets persona, rules, and behavioral constraints. Highest trust. Cannot be overridden by user input in well-aligned models.User prompt — The actual task or query. Lower trust. Processed after system context is established.System prompt for stable instructions that apply to all queries; user prompt for task-specific dynamic content
Zero-Shot vs Few-Shot anatomyZero-shot — Task description only, no examples. Fast, minimal tokens, relies on pre-trained knowledge.Few-shot — 1–10 example pairs demonstrate desired format and behavior. Slower, more tokens, higher accuracy on format-sensitive tasks.Zero-shot for standard tasks; few-shot when format is novel or accuracy needs improvement without fine-tuning
Temperature vs Top-p anatomyTemperature — Scales entire probability distribution. T=0 = greedy (deterministic). T=1 = unmodified. Higher = more random.Top-p (nucleus sampling) — Samples only from the top tokens whose cumulative probability ≥ p. More stable than temperature for controlling diversity.Lower temperature (0–0.3) for factual/code tasks; Top-p 0.9 for balanced creative tasks; never set both high simultaneously
Context Window Management anatomyPrompt compression — Summarize chat history, remove redundant context, use retrieved snippets instead of full documents.Extended context models — Use models with 128K+ context windows (e.g., Llama 3.1 with 128K, NIM endpoints) without compression.Compression for cost/latency efficiency; extended context when full document fidelity is critical and cost is acceptable
Role Prompting vs Persona anatomyRole prompting — "You are a senior software engineer." Activates relevant domain knowledge and vocabulary patterns.Persona prompting — Detailed character description with name, communication style, and domain expertise. More specific behavioral control.Role prompting for expertise activation; persona prompting for consistent character in customer-facing products
CoT vs Standard Prompting techniqueStandard prompt — Direct question, immediate answer. Efficient for simple tasks. Higher error rate on multi-step reasoning.Chain-of-Thought — Forces intermediate reasoning steps before final answer. +3–17% accuracy on reasoning tasks. Higher token cost.Standard for simple classification or retrieval; CoT for arithmetic, logic, multi-step planning, or any task with intermediate dependencies
Zero-Shot CoT vs Few-Shot CoT techniqueZero-shot CoT — "Let's think step by step" appended to the question. No examples needed. Good general performance.Few-shot CoT — Include 3–8 solved examples with full reasoning chains. Higher accuracy, especially for domain-specific problem types.Zero-shot CoT as default baseline; few-shot CoT when zero-shot misses domain-specific reasoning patterns
Greedy Decoding vs Sampling techniqueGreedy decoding — Always pick highest-probability token. Deterministic, efficient. Best for structured/factual output.Sampling — Probabilistic token selection (temperature > 0). Diverse, creative outputs. Required for self-consistency.Greedy for production fact-based tasks; sampling for creative generation and self-consistency ensemble methods
Direct Instruction vs Exemplar techniqueDirect instruction — "Classify this sentiment as positive, negative, or neutral. Return only the label."Exemplar-based — Show 3 examples of input → label pairs, then present the test input. Format is taught by demonstration.Direct instruction when the task is standard; exemplar-based when format is unconventional or edge cases need clear demonstration
Positive vs Negative Instructions techniquePositive ("Do X") — "Always respond in bullet points. Keep answers under 100 words."Negative ("Don't do X") — "Don't write long paragraphs. Don't use jargon." Less reliable — models sometimes follow the negated behavior.Always prefer positive instructions; use negative only as reinforcement ("do X, not Y") since models follow positive directives more reliably
ReAct vs Plan-and-Execute advancedReAct — Interleaved Thought → Action → Observation. Adaptive — each action decided based on prior observations.Plan-and-Execute — Generate full plan first, then execute all steps. More predictable, auditable, and parallelizable.ReAct for open-ended tasks requiring adaptation; Plan-and-Execute for structured workflows with known step sequences
Self-Consistency vs Single CoT advancedSingle CoT — One reasoning chain, greedy or sampled. Fast, low cost. Sensitive to initial reasoning path quality.Self-consistency — N=5–20 CoT samples, majority vote on final answer. Consistently +5–15% accuracy. Proportionally higher cost.Single CoT as baseline; self-consistency when high accuracy justifies 5–20× token cost — especially for high-stakes reasoning
Tree of Thought vs CoT advancedCoT — Single linear reasoning chain. No backtracking. Fails when early steps go wrong.Tree of Thought (ToT) — Multiple branches explored simultaneously. Dead-end branches pruned. Significantly better on puzzle and planning tasks.CoT for most tasks; ToT for problems requiring exploration and backtracking (puzzles, constrained optimization, creative writing with constraints)
Prompt Chaining vs Single Prompt advancedSingle prompt — One LLM call handles the full task. Simpler but forces the model to handle complexity all at once.Prompt chaining — Output of one LLM call feeds into the next. Decompose complex tasks into verifiable intermediate steps.Single prompt for straightforward tasks; chaining when tasks have distinct phases (extract → analyze → synthesize → format) that benefit from independent verification
RAG Prompting vs Parametric Knowledge advancedRAG prompting — Retrieved documents injected into context. Facts are current, traceable, and verifiable.Parametric knowledge — Model answers from training knowledge. No retrieval overhead. Knowledge may be outdated or hallucinated.RAG for knowledge requiring freshness, citations, or domain specificity; parametric for general knowledge tasks where hallucination risk is acceptable
Constitutional Prompting vs RLHF advancedConstitutional prompting — Include explicit rules in system prompt ("never reveal X", "always cite sources"). Inference-time control.RLHF / SFT — Behavioral alignment baked into model weights during training. More robust but requires model retraining.Constitutional prompting for rapid deployment-time policy enforcement; RLHF for persistent behavioral alignment that doesn't rely on prompt adherence
Prompting vs Fine-Tuning prodPrompting — No training, instant iteration, flexible. Higher per-request token cost for long system prompts.Fine-tuning — Behavior baked into weights, shorter prompts, lower latency, better style consistency. Requires training data and compute.Start with prompting; fine-tune when prompt alone can't achieve target accuracy, format consistency is critical at scale, or long prompts are cost-prohibitive
Structured Output Approaches prodInstruction-based — "Return JSON only with schema X." Relies on model following instructions. May fail on edge cases.Grammar-constrained decoding — Token sampling restricted to schema-valid tokens at each position. Guarantees valid output. Requires TRT-LLM or similar framework support.Instruction-based for development; grammar-constrained for production where malformed output would break downstream systems
Prompt Caching prodNo caching — Full prompt (system + user) recomputed each request. Higher latency and cost for long system prompts.Prefix caching — Static system prompt KV cache stored server-side. Only dynamic user turn recomputed. 50–80% prefill cost reduction.Enable prefix caching on NVIDIA NIM whenever the system prompt is >200 tokens and repeated across many requests — immediate ROI
Prompt Versioning prodAd hoc prompts — Prompts changed informally with no tracking. Hard to debug regressions or roll back changes.Version-controlled prompts — Prompts stored in git alongside evaluation results. Changes require eval suite passage before deployment.Always version-control production prompts; treat them with the same discipline as application code — regressions in prompt quality are real bugs
NIM Chat vs Completion Format prodChat format (messages[]) — Structured array of role/content objects. Preferred for instruction-following models. Supports system prompt natively.Completion format (prompt) — Raw text string. Legacy format for base (non-instruct) models. Requires manual role tokens if needed.Use chat format for all modern instruction-tuned models via NVIDIA NIM; completion format only for base model experimentation
Automatic Prompt Optimization prodManual iteration — Human writes, tests, and refines prompts based on output quality. Slow, expertise-dependent.Automatic Prompt Engineer (APE) — Use an LLM to generate and evaluate candidate prompts against a dataset. Finds non-obvious phrasings.Manual for initial prompt design; APE or DSPy-style optimization when manual iteration plateaus and a labeled eval set is available
Real-World Prompting Examples
Four end-to-end scenarios showing how prompting techniques are applied in practice — from few-shot formatting to full ReAct agent loops.
Example 1 · Few-Shot Prompting

Technical Support Classifier — Teaching Format with Examples

A SaaS company needs to classify support tickets into categories (billing, bug, feature-request, account) so they can route them to the right team. Zero-shot prompting gives inconsistent category names. Few-shot fixes this.

Problem: Zero-shot returns "billing issue", "Billing", "payment problem" for the same category — breaking the downstream routing logic.

  1. Define the canonical output set: Restrict to exactly four labels: billing | bug | feature-request | account. Include this as a constraint in the system prompt.
  2. Create 3–5 labeled examples: Select edge cases that demonstrate the hardest classification boundaries — e.g., "I can't login" (account, not bug) vs "Login button doesn't work" (bug).
  3. Format examples consistently: Each shot follows the pattern: Ticket: "[text]" → Category: [label]. Consistent formatting teaches the model to match the exact output format.
  4. Set temperature to 0: Classification tasks benefit from greedy decoding — only one correct answer exists, so randomness hurts accuracy.
  5. Validate on held-out set: Run 100 historical tickets through the few-shot prompt and confirm >95% match human labels before deploying.
Result: Classification accuracy improves from 72% (zero-shot) to 94% (few-shot with 5 examples) with consistent, machine-parseable output format. Routing logic works reliably. No fine-tuning required.
Example 2 · Chain-of-Thought + Self-Consistency

Financial Analysis — Reliable Reasoning at High Stakes

A fintech company uses an LLM to analyze earnings reports and estimate year-over-year revenue growth. Single CoT passes give correct answers ~78% of the time — not good enough for financial decisions.

Solution: Self-consistency with Chain-of-Thought reasoning.

  1. Write few-shot CoT prompt: Include 3 worked examples that show full arithmetic reasoning: "Q3 revenue was $X. Q3 last year was $Y. Growth = (X-Y)/Y × 100 = Z%." Make the reasoning explicit and consistent.
  2. Set temperature to 0.7: Self-consistency requires diverse reasoning paths, so sampling (not greedy) is needed. Temperature 0.7 produces variety without incoherence.
  3. Sample N=10 completions: Generate 10 independent CoT responses for each earnings report query. Each response independently calculates the answer via its own reasoning path.
  4. Extract final answers: Parse the "Answer: X%" from each of the 10 responses using a regex or secondary extraction prompt.
  5. Take majority vote: The answer appearing most frequently (e.g., "12.4%" in 7/10 responses) is returned as the final answer.
Result: Accuracy improves from 78% (single CoT) to 91% (self-consistency N=10). The additional cost (10× token usage) is justified given that each analysis informs a financial decision worth thousands of dollars.
Example 3 · ReAct Agent Loop

Research Assistant — Multi-Step Tool-Augmented Question Answering

A legal research tool needs to answer complex questions that require looking up case law, checking statutes, and cross-referencing dates — impossible to answer reliably from parametric knowledge alone.

Solution: ReAct prompting with search and database tool calls.

  1. Define tool signatures in system prompt: Tell the model about available tools: search_caselaw(query), lookup_statute(code, section), get_date(case_id). Each tool described with input/output format.
  2. Provide ReAct format example: Show one complete Thought → Action → Observation → Thought → ... → Answer trace in the few-shot examples so the model knows exactly what format to follow.
  3. Implement tool execution layer: Parse Action outputs from the model, execute the actual tool, and inject results as Observation in the next turn.
  4. Add loop termination logic: Detect when the model produces "Final Answer:" to stop the agent loop. Also add max-step guardrails (e.g., abort after 10 steps) to prevent infinite loops.
  5. Log full traces for audit: Legal use case requires every reasoning step and source citation to be preserved for human review and compliance.
Result: Answer accuracy on multi-hop legal questions improves from 43% (single RAG call) to 81% (ReAct with 3–5 tool calls). Auditable reasoning traces satisfy legal compliance requirements.
Example 4 · Production NIM Deployment

Enterprise Document Extraction — Structured Output at Scale

A logistics company processes 50,000 shipping documents per day. They need to extract 12 structured fields from each document (shipper, receiver, weight, dimensions, HS code, etc.) into a validated JSON object for their ERP system.

Solution: Optimized production prompt pipeline on NVIDIA NIM.

  1. Design system prompt with JSON schema: Define the exact output schema in the system prompt with field types, required vs optional, and format constraints. Set temperature=0 for deterministic output.
  2. Enable grammar-constrained decoding: Configure TensorRT-LLM's guided decoding with the JSON schema. Token sampling is constrained so only syntactically valid JSON tokens are possible at each position — zero malformed outputs.
  3. Enable prefix caching on NIM: The 800-token system prompt (schema + instructions) is identical for all 50K daily requests. NIM caches the KV cache for this prefix, eliminating 800 tokens of prefill computation per request — ~65% latency reduction.
  4. Build eval pipeline: Maintain 500 manually annotated documents as ground truth. Every prompt change is evaluated against this set for field-level accuracy before deployment to production.
  5. Implement fallback routing: If any required field is null in the extracted JSON, route document to human review queue rather than failing silently.
Result: 99.2% schema-valid JSON output (vs 87% instruction-only baseline). Prefix caching reduces P50 latency from 340ms to 118ms. Human review queue drops 70% vs previous template-based extraction system.
Practice Quiz — Prompt Engineering
10 NCA-GENL style questions with instant explanations. Covers all four prompting pillars.
Prompt Advisor
Describe your prompting challenge and get a tailored technique recommendation.
Memory Hooks — Flip Cards
8 core prompt engineering concepts to lock in before exam day. Click to flip.
Pillar 2 · Technique

Zero-shot CoT trigger phrase?

Click to reveal →

"Let's think step by step."

Appended to the question, this phrase activates chain-of-thought reasoning without any labeled examples. Works because it instructs the model to allocate tokens to intermediate reasoning before the final answer.

Pillar 1 · Anatomy

3 message roles in LLM conversations?

Click to reveal →

1. System — persona, rules, behavioral constraints. Highest trust.
2. User — the human query. Lower trust. Dynamic per request.
3. Assistant — prior model responses. Context for multi-turn conversations.

Pillar 2 · Technique

Few-shot vs Zero-shot — key tradeoff?

Click to reveal →

Zero-shot: No examples → lower tokens, faster, relies on pre-training.
Few-shot: 1–10 examples → higher tokens, slower, but teaches format and improves accuracy on novel or format-sensitive tasks.
Use few-shot when zero-shot accuracy is insufficient.

Pillar 3 · Advanced

ReAct loop components?

Click to reveal →

Thought → what the model reasons it should do next
Action → the tool call it executes (search, calc, DB query)
Observation → the tool's return value
Repeats until → Final Answer

ReAct = Reasoning + Acting

Pillar 3 · Advanced

How does self-consistency work?

Click to reveal →

1. Sample N CoT paths (temperature > 0 for diversity, typically N=5–20)
2. Extract final answer from each path
3. Take majority vote — most common answer wins

Correct reasoning paths converge; incorrect ones diverge. +5–15% over single CoT.

Pillar 1 · Anatomy

Temperature 0 vs Temperature 1?

Click to reveal →

T=0: Greedy decoding — always pick highest probability token. Fully deterministic. Best for: code, JSON, factual Q&A.

T=1: Unmodified probability distribution. Balanced creativity. Use for: conversation, CoT, balanced tasks.

T>1: Amplified randomness. Creative/experimental only.

Pillar 3 · Advanced

Tree of Thought vs Chain-of-Thought?

Click to reveal →

CoT: Single linear reasoning chain. No backtracking. If early step is wrong, final answer is wrong.

ToT: Explores multiple parallel branches simultaneously. Prunes dead ends. Backtracks. Much better for puzzles, planning, and creative tasks with constraints.

Pillar 4 · Production

When to fine-tune vs prompt?

Click to reveal →

Prompt when: Task is well-defined, examples fit in context, low volume, need fast iteration.

Fine-tune when: Prompt alone can't hit accuracy target, consistent style needed at scale, long prompts are too expensive, latency is critical.

Click any card to flip · Click again to return

🟢 NVIDIA NCA-GENL Exam Prep Platform

Ready to Pass the NCA-GENL?

Access 500+ practice questions, full topic guides, and adaptive flashcards — all aligned to the latest NVIDIA NCA-GENL exam objectives.