What is chain-of-thought (CoT) prompting and why does it work?

Chain-of-Thought (CoT) prompting instructs the LLM to output intermediate reasoning steps before providing a final answer. It works because forcing explicit reasoning traces causes the model to allocate more computation to solving the problem rather than jumping to a conclusion. CoT dramatically improves performance on multi-step arithmetic, symbolic reasoning, and commonsense tasks. Zero-shot CoT uses the simple trigger phrase 'Let's think step by step' without requiring example demonstrations.

What is the difference between zero-shot and few-shot prompting?

Zero-shot prompting gives the model a task description with no examples — relying entirely on the model's pre-trained knowledge. Few-shot prompting provides 1–10 input-output example pairs (shots) directly in the prompt to demonstrate the desired behavior before the actual query. Few-shot prompting improves accuracy on novel or format-specific tasks by showing the model the expected pattern, but increases token usage and must be balanced against context window limits.

What is ReAct prompting and when should you use it?

ReAct (Reasoning + Acting) is a prompting framework that interleaves natural language reasoning traces with tool action calls. The model alternates between Thought (reasoning about what to do), Action (calling a tool like a search engine or calculator), and Observation (reading the tool's output) until it reaches a final answer. ReAct is ideal for tasks requiring external information retrieval, multi-step tool use, or agent-like behavior where the model needs to interact with external systems.

What is self-consistency in chain-of-thought prompting?

Self-consistency is a decoding strategy for CoT prompting that generates multiple diverse reasoning paths (by sampling at a higher temperature) and then takes the most common final answer across those paths. This leverages the intuition that correct reasoning paths will converge on the same answer even if they take different routes, while incorrect reasoning paths are more likely to diverge. Self-consistency consistently improves accuracy over greedy decoding with CoT by 3–17% on reasoning benchmarks.

How does temperature affect LLM output in prompt engineering?

Temperature controls the randomness of token sampling. At temperature 0, the model always picks the highest-probability token (greedy decoding) — fully deterministic and most factually consistent. Higher temperatures (0.7–1.0) increase randomness, producing more diverse and creative outputs. In prompt engineering: use low temperature (0–0.3) for factual Q&A, code generation, and structured outputs; use higher temperature (0.7–1.0) for creative writing and brainstorming; use medium temperature (0.4–0.7) for balanced CoT reasoning.

Prompt Engineering — NVIDIA NCA-GENL Exam Prep

Four Pillars of Prompt Engineering

Prompting is the primary interface between humans and LLMs. Mastering these four areas covers everything from basic structure to advanced agentic patterns tested on the NCA-GENL exam.

Pillar 1 · Prompt Anatomy & Structure

System · User · Assistant Roles

Every LLM prompt is composed of distinct roles — system (behavioral context), user (the query), and assistant (the response). Understanding how each role shapes model behavior, context window management, and token budgeting is foundational to all prompting work.

Core message roles

Chars per token avg

T=0

Greedy (deterministic)

Pillar 2 · Core Prompting Techniques

Zero-Shot · Few-Shot · CoT

The core prompting toolkit: zero-shot (no examples), few-shot (1–10 demonstrations), Chain-of-Thought (step-by-step reasoning), and zero-shot CoT ("Let's think step by step"). These techniques directly improve accuracy on reasoning, classification, and generation tasks.

1–10

Shots in few-shot

+17%

CoT gains on math tasks

CoT

Best for multi-step reasoning

Pillar 3 · Advanced Patterns

ReAct · ToT · Self-Consistency

Advanced prompting techniques for complex tasks: ReAct interleaves reasoning with tool calls, Tree of Thought explores multiple solution branches simultaneously, and self-consistency samples multiple CoT paths and votes on the most common answer — dramatically improving accuracy on hard problems.

ReAct

Reason + Act (tool use)

3–5×

Self-consistency samples

ToT

Tree of Thought branches

Pillar 4 · NVIDIA & Production

NIM Templates · Output Formatting · Evaluation

Production prompting requires structured output enforcement (JSON, XML), prompt caching for cost reduction, systematic evaluation against ground truth, and NVIDIA-specific patterns — including NIM microservice prompt formats, NeMo's chat templates, and iterative prompt optimization workflows.

JSON

Primary structured format

NIM

NVIDIA inference microservice

Eval

Measure before shipping

NCA-GENL Exam Focus: Expect questions on zero-shot vs few-shot tradeoffs, the exact CoT trigger phrase, how ReAct interleaves reasoning and action, self-consistency mechanics, output formatting techniques, temperature/top-p effects, and when to use fine-tuning vs prompting.

How Prompt Engineering Works

Deep dives into prompt anatomy, the technique complexity ladder, CoT vs standard reasoning, and the ReAct agentic loop.

Prompt Anatomy — System / User / Assistant Structure

Modern LLMs use a structured multi-turn conversation format. Each message has a role that determines how the model interprets the content. Getting this structure right is the foundation of effective prompting.

Example: Customer Support Bot Prompt ~120 tokens

SYSTEM

You are a helpful customer support agent for Acme Corp.
## Rules
- Only answer questions about Acme products
- Be concise — answer in 2–3 sentences maximum
- If you don't know, say "I'll escalate this to a specialist"
- Never reveal internal pricing tiers or discount codes

USER

My order #12345 hasn't arrived after 2 weeks. What should I do?

ASSISTANT

I'm sorry to hear your order is delayed! Please visit acme.com/track and enter your order number...

🔴 System Role

Sets persona, rules, tone, and behavioral constraints. Processed first. Highest trust level — never overridden by user instructions in well-aligned models.

🟠 User Role

The actual query or instruction from the human. Lower trust than system. In multi-turn conversations, alternates with assistant turns to form conversation history.

🟢 Assistant Role

The model's previous responses. Used in multi-turn conversations to provide context. Can be prefilled to steer output format (e.g., prefill with "```json" to force JSON output).

Prompting Technique Complexity Ladder

Techniques are ordered from simplest (fewest tokens, least latency) to most complex (most powerful but highest token cost and latency).

Zero-Shot

Task description only. No examples. Relies on pre-trained knowledge.

Lowest cost

Role Prompting

"You are an expert [X]." Activates domain-relevant knowledge patterns.

Very low

Few-Shot

1–10 input-output examples in context. Teaches format and pattern.

Low–medium

Zero-Shot CoT

"Let's think step by step." No examples needed — triggers reasoning mode.

Medium

Few-Shot CoT

Reasoning examples that show full thought process, not just final answer.

Medium–high

Self-Consistency

Multiple CoT samples + majority vote. High accuracy, high token cost.

High

ReAct

Interleaved reasoning + tool calls. Multi-step, requires tool integration.

Very high

Tree of Thought

Parallel reasoning branches + evaluation. Max accuracy, max cost.

Highest cost

Chain-of-Thought — Standard vs Zero-Shot CoT

CoT dramatically improves multi-step reasoning by making the model show its work. Zero-shot CoT achieves most of the benefit without requiring labeled reasoning examples.

❌ Without Chain-of-Thought

Q: Roger has 5 tennis balls. He buys 2 cans of tennis balls. Each can has 3 balls. How many tennis balls does he have now?

A: 11

(Model jumps directly to answer — no reasoning visible, higher error rate on multi-step problems)

✅ With Zero-Shot CoT ("Let's think step by step")

Q: Roger has 5 tennis balls... How many does he have now?
A: Let's think step by step.

Roger starts with 5 balls.
He buys 2 cans × 3 balls = 6 new balls.
Total = 5 + 6 = 11 balls.

Answer: 11

(Same answer here, but on harder problems CoT prevents compounding errors by surfacing each reasoning step)

When CoT matters most: Multi-step arithmetic, symbolic reasoning, commonsense inference, and logical deduction. CoT provides less benefit for simple classification or factual recall tasks where single-step reasoning is sufficient.

ReAct — Reasoning + Acting Interleaved

ReAct enables LLMs to act like agents by alternating between natural language reasoning (Thought) and executable tool calls (Action), reading the result (Observation), and looping until a final answer is reached.

Thought 1

I need to find the current CEO of NVIDIA. Let me search for this.

Action 1

Search["NVIDIA CEO 2026"]

Obs 1

Jensen Huang is the co-founder and CEO of NVIDIA Corporation.

Thought 2

I found the CEO. Now I need to verify when Jensen Huang founded NVIDIA to complete the full answer.

Action 2

Search["NVIDIA founding year Jensen Huang"]

Obs 2

NVIDIA was founded in 1993 by Jensen Huang, Chris Malachowsky, and Curtis Priem.

Final Answer

Jensen Huang is the CEO of NVIDIA. He co-founded the company in 1993.

ReAct vs Plan-and-Execute: ReAct decides actions dynamically one step at a time based on each observation. Plan-and-Execute (used in more advanced agents) first generates a complete plan, then executes all steps. ReAct is more adaptive; Plan-and-Execute is more predictable and auditable.

Production Prompt Patterns — Structured Output & NIM

Production systems need reliable, parseable output. These patterns ensure LLMs return machine-readable responses compatible with downstream applications.

JSON Output Enforcement — System Prompt Pattern
// System prompt that forces valid JSON output"You are a product data extractor. Your task is to extract structured product information from unstructured text.ALWAYS respond with valid JSON only. No explanation, no markdown, no extra text.Use exactly this schema:{  \"product_name\": string,  \"price\": number,  \"category\": string,  \"in_stock\": boolean}If a field cannot be determined, use null."// Additional techniques:// 1. Prefill assistant turn with "{" to force JSON start// 2. Use grammar-constrained generation (NVIDIA NIM / TRT-LLM)// 3. Set temperature=0 for deterministic structured output

Prompt Caching

Cache the static system prompt at the prefix KV cache level. Only recompute the dynamic user turn. On NVIDIA NIM, long system prompts can be cached, reducing prefill latency and cost by 50–80% for repeat requests.

Grammar-Constrained Decoding

Use NVIDIA TensorRT-LLM's structured output support to constrain the token sampling vocabulary to only tokens valid at each position in a JSON/XML schema — guaranteeing schema-valid output without relying solely on instruction following.

Prompt Evaluation Pipeline

Build a systematic eval: create a golden dataset of (prompt, expected output) pairs, run all prompt variants against it, measure accuracy/format adherence/latency. Treat prompt changes like code changes — version-controlled and evaluated before deployment.

Fine-Tuning vs Prompting Decision

Use prompting when: task is well-defined with clear instructions, examples fit in context, low-volume use case. Use fine-tuning when: prompt alone can't achieve target accuracy, specific style/format is consistently needed, latency/cost at scale prohibits long prompts.

Compare Prompting Approaches

Side-by-side comparison of techniques, parameters, output formats, and production strategies. Filter by category.

Concept	Option A	Option B	When to Choose
System vs User Prompt anatomy	System prompt — Sets persona, rules, and behavioral constraints. Highest trust. Cannot be overridden by user input in well-aligned models.	User prompt — The actual task or query. Lower trust. Processed after system context is established.	System prompt for stable instructions that apply to all queries; user prompt for task-specific dynamic content
Zero-Shot vs Few-Shot anatomy	Zero-shot — Task description only, no examples. Fast, minimal tokens, relies on pre-trained knowledge.	Few-shot — 1–10 example pairs demonstrate desired format and behavior. Slower, more tokens, higher accuracy on format-sensitive tasks.	Zero-shot for standard tasks; few-shot when format is novel or accuracy needs improvement without fine-tuning
Temperature vs Top-p anatomy	Temperature — Scales entire probability distribution. T=0 = greedy (deterministic). T=1 = unmodified. Higher = more random.	Top-p (nucleus sampling) — Samples only from the top tokens whose cumulative probability ≥ p. More stable than temperature for controlling diversity.	Lower temperature (0–0.3) for factual/code tasks; Top-p 0.9 for balanced creative tasks; never set both high simultaneously
Context Window Management anatomy	Prompt compression — Summarize chat history, remove redundant context, use retrieved snippets instead of full documents.	Extended context models — Use models with 128K+ context windows (e.g., Llama 3.1 with 128K, NIM endpoints) without compression.	Compression for cost/latency efficiency; extended context when full document fidelity is critical and cost is acceptable
Role Prompting vs Persona anatomy	Role prompting — "You are a senior software engineer." Activates relevant domain knowledge and vocabulary patterns.	Persona prompting — Detailed character description with name, communication style, and domain expertise. More specific behavioral control.	Role prompting for expertise activation; persona prompting for consistent character in customer-facing products
CoT vs Standard Prompting technique	Standard prompt — Direct question, immediate answer. Efficient for simple tasks. Higher error rate on multi-step reasoning.	Chain-of-Thought — Forces intermediate reasoning steps before final answer. +3–17% accuracy on reasoning tasks. Higher token cost.	Standard for simple classification or retrieval; CoT for arithmetic, logic, multi-step planning, or any task with intermediate dependencies
Zero-Shot CoT vs Few-Shot CoT technique	Zero-shot CoT — "Let's think step by step" appended to the question. No examples needed. Good general performance.	Few-shot CoT — Include 3–8 solved examples with full reasoning chains. Higher accuracy, especially for domain-specific problem types.	Zero-shot CoT as default baseline; few-shot CoT when zero-shot misses domain-specific reasoning patterns
Greedy Decoding vs Sampling technique	Greedy decoding — Always pick highest-probability token. Deterministic, efficient. Best for structured/factual output.	Sampling — Probabilistic token selection (temperature > 0). Diverse, creative outputs. Required for self-consistency.	Greedy for production fact-based tasks; sampling for creative generation and self-consistency ensemble methods
Direct Instruction vs Exemplar technique	Direct instruction — "Classify this sentiment as positive, negative, or neutral. Return only the label."	Exemplar-based — Show 3 examples of input → label pairs, then present the test input. Format is taught by demonstration.	Direct instruction when the task is standard; exemplar-based when format is unconventional or edge cases need clear demonstration
Positive vs Negative Instructions technique	Positive ("Do X") — "Always respond in bullet points. Keep answers under 100 words."	Negative ("Don't do X") — "Don't write long paragraphs. Don't use jargon." Less reliable — models sometimes follow the negated behavior.	Always prefer positive instructions; use negative only as reinforcement ("do X, not Y") since models follow positive directives more reliably
ReAct vs Plan-and-Execute advanced	ReAct — Interleaved Thought → Action → Observation. Adaptive — each action decided based on prior observations.	Plan-and-Execute — Generate full plan first, then execute all steps. More predictable, auditable, and parallelizable.	ReAct for open-ended tasks requiring adaptation; Plan-and-Execute for structured workflows with known step sequences
Self-Consistency vs Single CoT advanced	Single CoT — One reasoning chain, greedy or sampled. Fast, low cost. Sensitive to initial reasoning path quality.	Self-consistency — N=5–20 CoT samples, majority vote on final answer. Consistently +5–15% accuracy. Proportionally higher cost.	Single CoT as baseline; self-consistency when high accuracy justifies 5–20× token cost — especially for high-stakes reasoning
Tree of Thought vs CoT advanced	CoT — Single linear reasoning chain. No backtracking. Fails when early steps go wrong.	Tree of Thought (ToT) — Multiple branches explored simultaneously. Dead-end branches pruned. Significantly better on puzzle and planning tasks.	CoT for most tasks; ToT for problems requiring exploration and backtracking (puzzles, constrained optimization, creative writing with constraints)
Prompt Chaining vs Single Prompt advanced	Single prompt — One LLM call handles the full task. Simpler but forces the model to handle complexity all at once.	Prompt chaining — Output of one LLM call feeds into the next. Decompose complex tasks into verifiable intermediate steps.	Single prompt for straightforward tasks; chaining when tasks have distinct phases (extract → analyze → synthesize → format) that benefit from independent verification
RAG Prompting vs Parametric Knowledge advanced	RAG prompting — Retrieved documents injected into context. Facts are current, traceable, and verifiable.	Parametric knowledge — Model answers from training knowledge. No retrieval overhead. Knowledge may be outdated or hallucinated.	RAG for knowledge requiring freshness, citations, or domain specificity; parametric for general knowledge tasks where hallucination risk is acceptable
Constitutional Prompting vs RLHF advanced	Constitutional prompting — Include explicit rules in system prompt ("never reveal X", "always cite sources"). Inference-time control.	RLHF / SFT — Behavioral alignment baked into model weights during training. More robust but requires model retraining.	Constitutional prompting for rapid deployment-time policy enforcement; RLHF for persistent behavioral alignment that doesn't rely on prompt adherence
Prompting vs Fine-Tuning prod	Prompting — No training, instant iteration, flexible. Higher per-request token cost for long system prompts.	Fine-tuning — Behavior baked into weights, shorter prompts, lower latency, better style consistency. Requires training data and compute.	Start with prompting; fine-tune when prompt alone can't achieve target accuracy, format consistency is critical at scale, or long prompts are cost-prohibitive
Structured Output Approaches prod	Instruction-based — "Return JSON only with schema X." Relies on model following instructions. May fail on edge cases.	Grammar-constrained decoding — Token sampling restricted to schema-valid tokens at each position. Guarantees valid output. Requires TRT-LLM or similar framework support.	Instruction-based for development; grammar-constrained for production where malformed output would break downstream systems
Prompt Caching prod	No caching — Full prompt (system + user) recomputed each request. Higher latency and cost for long system prompts.	Prefix caching — Static system prompt KV cache stored server-side. Only dynamic user turn recomputed. 50–80% prefill cost reduction.	Enable prefix caching on NVIDIA NIM whenever the system prompt is >200 tokens and repeated across many requests — immediate ROI
Prompt Versioning prod	Ad hoc prompts — Prompts changed informally with no tracking. Hard to debug regressions or roll back changes.	Version-controlled prompts — Prompts stored in git alongside evaluation results. Changes require eval suite passage before deployment.	Always version-control production prompts; treat them with the same discipline as application code — regressions in prompt quality are real bugs
NIM Chat vs Completion Format prod	Chat format (messages[]) — Structured array of role/content objects. Preferred for instruction-following models. Supports system prompt natively.	Completion format (prompt) — Raw text string. Legacy format for base (non-instruct) models. Requires manual role tokens if needed.	Use chat format for all modern instruction-tuned models via NVIDIA NIM; completion format only for base model experimentation
Automatic Prompt Optimization prod	Manual iteration — Human writes, tests, and refines prompts based on output quality. Slow, expertise-dependent.	Automatic Prompt Engineer (APE) — Use an LLM to generate and evaluate candidate prompts against a dataset. Finds non-obvious phrasings.	Manual for initial prompt design; APE or DSPy-style optimization when manual iteration plateaus and a labeled eval set is available

Real-World Prompting Examples

Four end-to-end scenarios showing how prompting techniques are applied in practice — from few-shot formatting to full ReAct agent loops.

Example 1 · Few-Shot Prompting

Technical Support Classifier — Teaching Format with Examples

A SaaS company needs to classify support tickets into categories (billing, bug, feature-request, account) so they can route them to the right team. Zero-shot prompting gives inconsistent category names. Few-shot fixes this.

Problem: Zero-shot returns "billing issue", "Billing", "payment problem" for the same category — breaking the downstream routing logic.

Define the canonical output set: Restrict to exactly four labels: billing | bug | feature-request | account. Include this as a constraint in the system prompt.
Create 3–5 labeled examples: Select edge cases that demonstrate the hardest classification boundaries — e.g., "I can't login" (account, not bug) vs "Login button doesn't work" (bug).
Format examples consistently: Each shot follows the pattern: Ticket: "[text]" → Category: [label]. Consistent formatting teaches the model to match the exact output format.
Set temperature to 0: Classification tasks benefit from greedy decoding — only one correct answer exists, so randomness hurts accuracy.
Validate on held-out set: Run 100 historical tickets through the few-shot prompt and confirm >95% match human labels before deploying.

Result: Classification accuracy improves from 72% (zero-shot) to 94% (few-shot with 5 examples) with consistent, machine-parseable output format. Routing logic works reliably. No fine-tuning required.

Example 2 · Chain-of-Thought + Self-Consistency

Financial Analysis — Reliable Reasoning at High Stakes

A fintech company uses an LLM to analyze earnings reports and estimate year-over-year revenue growth. Single CoT passes give correct answers ~78% of the time — not good enough for financial decisions.

Solution: Self-consistency with Chain-of-Thought reasoning.

Write few-shot CoT prompt: Include 3 worked examples that show full arithmetic reasoning: "Q3 revenue was $X. Q3 last year was $Y. Growth = (X-Y)/Y × 100 = Z%." Make the reasoning explicit and consistent.
Set temperature to 0.7: Self-consistency requires diverse reasoning paths, so sampling (not greedy) is needed. Temperature 0.7 produces variety without incoherence.
Sample N=10 completions: Generate 10 independent CoT responses for each earnings report query. Each response independently calculates the answer via its own reasoning path.
Extract final answers: Parse the "Answer: X%" from each of the 10 responses using a regex or secondary extraction prompt.
Take majority vote: The answer appearing most frequently (e.g., "12.4%" in 7/10 responses) is returned as the final answer.

Result: Accuracy improves from 78% (single CoT) to 91% (self-consistency N=10). The additional cost (10× token usage) is justified given that each analysis informs a financial decision worth thousands of dollars.

Example 3 · ReAct Agent Loop

Research Assistant — Multi-Step Tool-Augmented Question Answering

A legal research tool needs to answer complex questions that require looking up case law, checking statutes, and cross-referencing dates — impossible to answer reliably from parametric knowledge alone.

Solution: ReAct prompting with search and database tool calls.

Define tool signatures in system prompt: Tell the model about available tools: search_caselaw(query), lookup_statute(code, section), get_date(case_id). Each tool described with input/output format.
Provide ReAct format example: Show one complete Thought → Action → Observation → Thought → ... → Answer trace in the few-shot examples so the model knows exactly what format to follow.
Implement tool execution layer: Parse Action outputs from the model, execute the actual tool, and inject results as Observation in the next turn.
Add loop termination logic: Detect when the model produces "Final Answer:" to stop the agent loop. Also add max-step guardrails (e.g., abort after 10 steps) to prevent infinite loops.
Log full traces for audit: Legal use case requires every reasoning step and source citation to be preserved for human review and compliance.

Result: Answer accuracy on multi-hop legal questions improves from 43% (single RAG call) to 81% (ReAct with 3–5 tool calls). Auditable reasoning traces satisfy legal compliance requirements.

Example 4 · Production NIM Deployment

Enterprise Document Extraction — Structured Output at Scale

A logistics company processes 50,000 shipping documents per day. They need to extract 12 structured fields from each document (shipper, receiver, weight, dimensions, HS code, etc.) into a validated JSON object for their ERP system.

Solution: Optimized production prompt pipeline on NVIDIA NIM.

Design system prompt with JSON schema: Define the exact output schema in the system prompt with field types, required vs optional, and format constraints. Set temperature=0 for deterministic output.
Enable grammar-constrained decoding: Configure TensorRT-LLM's guided decoding with the JSON schema. Token sampling is constrained so only syntactically valid JSON tokens are possible at each position — zero malformed outputs.
Enable prefix caching on NIM: The 800-token system prompt (schema + instructions) is identical for all 50K daily requests. NIM caches the KV cache for this prefix, eliminating 800 tokens of prefill computation per request — ~65% latency reduction.
Build eval pipeline: Maintain 500 manually annotated documents as ground truth. Every prompt change is evaluated against this set for field-level accuracy before deployment to production.
Implement fallback routing: If any required field is null in the extracted JSON, route document to human review queue rather than failing silently.

Result: 99.2% schema-valid JSON output (vs 87% instruction-only baseline). Prefix caching reduces P50 latency from 340ms to 118ms. Human review queue drops 70% vs previous template-based extraction system.

Practice Quiz — Prompt Engineering

10 NCA-GENL style questions with instant explanations. Covers all four prompting pillars.

Prompt Advisor

Describe your prompting challenge and get a tailored technique recommendation.

Memory Hooks — Flip Cards

8 core prompt engineering concepts to lock in before exam day. Click to flip.

Pillar 2 · Technique

Zero-shot CoT trigger phrase?

Click to reveal →

"Let's think step by step."

Appended to the question, this phrase activates chain-of-thought reasoning without any labeled examples. Works because it instructs the model to allocate tokens to intermediate reasoning before the final answer.

Pillar 1 · Anatomy

3 message roles in LLM conversations?

Click to reveal →

1. System — persona, rules, behavioral constraints. Highest trust.
2. User — the human query. Lower trust. Dynamic per request.
3. Assistant — prior model responses. Context for multi-turn conversations.

Pillar 2 · Technique

Few-shot vs Zero-shot — key tradeoff?

Click to reveal →

Zero-shot: No examples → lower tokens, faster, relies on pre-training.
Few-shot: 1–10 examples → higher tokens, slower, but teaches format and improves accuracy on novel or format-sensitive tasks.
Use few-shot when zero-shot accuracy is insufficient.

Pillar 3 · Advanced

ReAct loop components?

Click to reveal →

Thought → what the model reasons it should do next
Action → the tool call it executes (search, calc, DB query)
Observation → the tool's return value
Repeats until → Final Answer

ReAct = Reasoning + Acting

Pillar 3 · Advanced

How does self-consistency work?

Click to reveal →

1. Sample N CoT paths (temperature > 0 for diversity, typically N=5–20)
2. Extract final answer from each path
3. Take majority vote — most common answer wins

Correct reasoning paths converge; incorrect ones diverge. +5–15% over single CoT.

Pillar 1 · Anatomy

Temperature 0 vs Temperature 1?

Click to reveal →

T=0: Greedy decoding — always pick highest probability token. Fully deterministic. Best for: code, JSON, factual Q&A.

T=1: Unmodified probability distribution. Balanced creativity. Use for: conversation, CoT, balanced tasks.

T>1: Amplified randomness. Creative/experimental only.

Pillar 3 · Advanced

Tree of Thought vs Chain-of-Thought?

Click to reveal →

CoT: Single linear reasoning chain. No backtracking. If early step is wrong, final answer is wrong.

ToT: Explores multiple parallel branches simultaneously. Prunes dead ends. Backtracks. Much better for puzzles, planning, and creative tasks with constraints.

Pillar 4 · Production

When to fine-tune vs prompt?

Click to reveal →

Prompt when: Task is well-defined, examples fit in context, low volume, need fast iteration.

Fine-tune when: Prompt alone can't hit accuracy target, consistent style needed at scale, long prompts are too expensive, latency is critical.

Click any card to flip · Click again to return

Prompt Engineering — Zero-Shot to ReAct

System · User · Assistant Roles

Zero-Shot · Few-Shot · CoT

ReAct · ToT · Self-Consistency

NIM Templates · Output Formatting · Evaluation

Prompt Caching

Grammar-Constrained Decoding

Prompt Evaluation Pipeline

Fine-Tuning vs Prompting Decision

Technical Support Classifier — Teaching Format with Examples

Financial Analysis — Reliable Reasoning at High Stakes

Research Assistant — Multi-Step Tool-Augmented Question Answering

Enterprise Document Extraction — Structured Output at Scale

Zero-shot CoT trigger phrase?

3 message roles in LLM conversations?

Few-shot vs Zero-shot — key tradeoff?

ReAct loop components?

How does self-consistency work?

Temperature 0 vs Temperature 1?

Tree of Thought vs Chain-of-Thought?

When to fine-tune vs prompt?

Ready to Pass the NCA-GENL?