Master prompt anatomy, few-shot and chain-of-thought techniques, advanced reasoning patterns, and NVIDIA-specific production prompting for the NCA-GENL certification.
Start Free Practice →Every LLM prompt is composed of distinct roles — system (behavioral context), user (the query), and assistant (the response). Understanding how each role shapes model behavior, context window management, and token budgeting is foundational to all prompting work.
The core prompting toolkit: zero-shot (no examples), few-shot (1–10 demonstrations), Chain-of-Thought (step-by-step reasoning), and zero-shot CoT ("Let's think step by step"). These techniques directly improve accuracy on reasoning, classification, and generation tasks.
Advanced prompting techniques for complex tasks: ReAct interleaves reasoning with tool calls, Tree of Thought explores multiple solution branches simultaneously, and self-consistency samples multiple CoT paths and votes on the most common answer — dramatically improving accuracy on hard problems.
Production prompting requires structured output enforcement (JSON, XML), prompt caching for cost reduction, systematic evaluation against ground truth, and NVIDIA-specific patterns — including NIM microservice prompt formats, NeMo's chat templates, and iterative prompt optimization workflows.
Modern LLMs use a structured multi-turn conversation format. Each message has a role that determines how the model interprets the content. Getting this structure right is the foundation of effective prompting.
Techniques are ordered from simplest (fewest tokens, least latency) to most complex (most powerful but highest token cost and latency).
CoT dramatically improves multi-step reasoning by making the model show its work. Zero-shot CoT achieves most of the benefit without requiring labeled reasoning examples.
ReAct enables LLMs to act like agents by alternating between natural language reasoning (Thought) and executable tool calls (Action), reading the result (Observation), and looping until a final answer is reached.
Production systems need reliable, parseable output. These patterns ensure LLMs return machine-readable responses compatible with downstream applications.
Cache the static system prompt at the prefix KV cache level. Only recompute the dynamic user turn. On NVIDIA NIM, long system prompts can be cached, reducing prefill latency and cost by 50–80% for repeat requests.
Use NVIDIA TensorRT-LLM's structured output support to constrain the token sampling vocabulary to only tokens valid at each position in a JSON/XML schema — guaranteeing schema-valid output without relying solely on instruction following.
Build a systematic eval: create a golden dataset of (prompt, expected output) pairs, run all prompt variants against it, measure accuracy/format adherence/latency. Treat prompt changes like code changes — version-controlled and evaluated before deployment.
Use prompting when: task is well-defined with clear instructions, examples fit in context, low-volume use case. Use fine-tuning when: prompt alone can't achieve target accuracy, specific style/format is consistently needed, latency/cost at scale prohibits long prompts.
| Concept | Option A | Option B | When to Choose |
|---|---|---|---|
| System vs User Prompt anatomy | System prompt — Sets persona, rules, and behavioral constraints. Highest trust. Cannot be overridden by user input in well-aligned models. | User prompt — The actual task or query. Lower trust. Processed after system context is established. | System prompt for stable instructions that apply to all queries; user prompt for task-specific dynamic content |
| Zero-Shot vs Few-Shot anatomy | Zero-shot — Task description only, no examples. Fast, minimal tokens, relies on pre-trained knowledge. | Few-shot — 1–10 example pairs demonstrate desired format and behavior. Slower, more tokens, higher accuracy on format-sensitive tasks. | Zero-shot for standard tasks; few-shot when format is novel or accuracy needs improvement without fine-tuning |
| Temperature vs Top-p anatomy | Temperature — Scales entire probability distribution. T=0 = greedy (deterministic). T=1 = unmodified. Higher = more random. | Top-p (nucleus sampling) — Samples only from the top tokens whose cumulative probability ≥ p. More stable than temperature for controlling diversity. | Lower temperature (0–0.3) for factual/code tasks; Top-p 0.9 for balanced creative tasks; never set both high simultaneously |
| Context Window Management anatomy | Prompt compression — Summarize chat history, remove redundant context, use retrieved snippets instead of full documents. | Extended context models — Use models with 128K+ context windows (e.g., Llama 3.1 with 128K, NIM endpoints) without compression. | Compression for cost/latency efficiency; extended context when full document fidelity is critical and cost is acceptable |
| Role Prompting vs Persona anatomy | Role prompting — "You are a senior software engineer." Activates relevant domain knowledge and vocabulary patterns. | Persona prompting — Detailed character description with name, communication style, and domain expertise. More specific behavioral control. | Role prompting for expertise activation; persona prompting for consistent character in customer-facing products |
| CoT vs Standard Prompting technique | Standard prompt — Direct question, immediate answer. Efficient for simple tasks. Higher error rate on multi-step reasoning. | Chain-of-Thought — Forces intermediate reasoning steps before final answer. +3–17% accuracy on reasoning tasks. Higher token cost. | Standard for simple classification or retrieval; CoT for arithmetic, logic, multi-step planning, or any task with intermediate dependencies |
| Zero-Shot CoT vs Few-Shot CoT technique | Zero-shot CoT — "Let's think step by step" appended to the question. No examples needed. Good general performance. | Few-shot CoT — Include 3–8 solved examples with full reasoning chains. Higher accuracy, especially for domain-specific problem types. | Zero-shot CoT as default baseline; few-shot CoT when zero-shot misses domain-specific reasoning patterns |
| Greedy Decoding vs Sampling technique | Greedy decoding — Always pick highest-probability token. Deterministic, efficient. Best for structured/factual output. | Sampling — Probabilistic token selection (temperature > 0). Diverse, creative outputs. Required for self-consistency. | Greedy for production fact-based tasks; sampling for creative generation and self-consistency ensemble methods |
| Direct Instruction vs Exemplar technique | Direct instruction — "Classify this sentiment as positive, negative, or neutral. Return only the label." | Exemplar-based — Show 3 examples of input → label pairs, then present the test input. Format is taught by demonstration. | Direct instruction when the task is standard; exemplar-based when format is unconventional or edge cases need clear demonstration |
| Positive vs Negative Instructions technique | Positive ("Do X") — "Always respond in bullet points. Keep answers under 100 words." | Negative ("Don't do X") — "Don't write long paragraphs. Don't use jargon." Less reliable — models sometimes follow the negated behavior. | Always prefer positive instructions; use negative only as reinforcement ("do X, not Y") since models follow positive directives more reliably |
| ReAct vs Plan-and-Execute advanced | ReAct — Interleaved Thought → Action → Observation. Adaptive — each action decided based on prior observations. | Plan-and-Execute — Generate full plan first, then execute all steps. More predictable, auditable, and parallelizable. | ReAct for open-ended tasks requiring adaptation; Plan-and-Execute for structured workflows with known step sequences |
| Self-Consistency vs Single CoT advanced | Single CoT — One reasoning chain, greedy or sampled. Fast, low cost. Sensitive to initial reasoning path quality. | Self-consistency — N=5–20 CoT samples, majority vote on final answer. Consistently +5–15% accuracy. Proportionally higher cost. | Single CoT as baseline; self-consistency when high accuracy justifies 5–20× token cost — especially for high-stakes reasoning |
| Tree of Thought vs CoT advanced | CoT — Single linear reasoning chain. No backtracking. Fails when early steps go wrong. | Tree of Thought (ToT) — Multiple branches explored simultaneously. Dead-end branches pruned. Significantly better on puzzle and planning tasks. | CoT for most tasks; ToT for problems requiring exploration and backtracking (puzzles, constrained optimization, creative writing with constraints) |
| Prompt Chaining vs Single Prompt advanced | Single prompt — One LLM call handles the full task. Simpler but forces the model to handle complexity all at once. | Prompt chaining — Output of one LLM call feeds into the next. Decompose complex tasks into verifiable intermediate steps. | Single prompt for straightforward tasks; chaining when tasks have distinct phases (extract → analyze → synthesize → format) that benefit from independent verification |
| RAG Prompting vs Parametric Knowledge advanced | RAG prompting — Retrieved documents injected into context. Facts are current, traceable, and verifiable. | Parametric knowledge — Model answers from training knowledge. No retrieval overhead. Knowledge may be outdated or hallucinated. | RAG for knowledge requiring freshness, citations, or domain specificity; parametric for general knowledge tasks where hallucination risk is acceptable |
| Constitutional Prompting vs RLHF advanced | Constitutional prompting — Include explicit rules in system prompt ("never reveal X", "always cite sources"). Inference-time control. | RLHF / SFT — Behavioral alignment baked into model weights during training. More robust but requires model retraining. | Constitutional prompting for rapid deployment-time policy enforcement; RLHF for persistent behavioral alignment that doesn't rely on prompt adherence |
| Prompting vs Fine-Tuning prod | Prompting — No training, instant iteration, flexible. Higher per-request token cost for long system prompts. | Fine-tuning — Behavior baked into weights, shorter prompts, lower latency, better style consistency. Requires training data and compute. | Start with prompting; fine-tune when prompt alone can't achieve target accuracy, format consistency is critical at scale, or long prompts are cost-prohibitive |
| Structured Output Approaches prod | Instruction-based — "Return JSON only with schema X." Relies on model following instructions. May fail on edge cases. | Grammar-constrained decoding — Token sampling restricted to schema-valid tokens at each position. Guarantees valid output. Requires TRT-LLM or similar framework support. | Instruction-based for development; grammar-constrained for production where malformed output would break downstream systems |
| Prompt Caching prod | No caching — Full prompt (system + user) recomputed each request. Higher latency and cost for long system prompts. | Prefix caching — Static system prompt KV cache stored server-side. Only dynamic user turn recomputed. 50–80% prefill cost reduction. | Enable prefix caching on NVIDIA NIM whenever the system prompt is >200 tokens and repeated across many requests — immediate ROI |
| Prompt Versioning prod | Ad hoc prompts — Prompts changed informally with no tracking. Hard to debug regressions or roll back changes. | Version-controlled prompts — Prompts stored in git alongside evaluation results. Changes require eval suite passage before deployment. | Always version-control production prompts; treat them with the same discipline as application code — regressions in prompt quality are real bugs |
| NIM Chat vs Completion Format prod | Chat format (messages[]) — Structured array of role/content objects. Preferred for instruction-following models. Supports system prompt natively. | Completion format (prompt) — Raw text string. Legacy format for base (non-instruct) models. Requires manual role tokens if needed. | Use chat format for all modern instruction-tuned models via NVIDIA NIM; completion format only for base model experimentation |
| Automatic Prompt Optimization prod | Manual iteration — Human writes, tests, and refines prompts based on output quality. Slow, expertise-dependent. | Automatic Prompt Engineer (APE) — Use an LLM to generate and evaluate candidate prompts against a dataset. Finds non-obvious phrasings. | Manual for initial prompt design; APE or DSPy-style optimization when manual iteration plateaus and a labeled eval set is available |
A SaaS company needs to classify support tickets into categories (billing, bug, feature-request, account) so they can route them to the right team. Zero-shot prompting gives inconsistent category names. Few-shot fixes this.
Problem: Zero-shot returns "billing issue", "Billing", "payment problem" for the same category — breaking the downstream routing logic.
billing | bug | feature-request | account. Include this as a constraint in the system prompt.Ticket: "[text]" → Category: [label]. Consistent formatting teaches the model to match the exact output format.A fintech company uses an LLM to analyze earnings reports and estimate year-over-year revenue growth. Single CoT passes give correct answers ~78% of the time — not good enough for financial decisions.
Solution: Self-consistency with Chain-of-Thought reasoning.
A legal research tool needs to answer complex questions that require looking up case law, checking statutes, and cross-referencing dates — impossible to answer reliably from parametric knowledge alone.
Solution: ReAct prompting with search and database tool calls.
search_caselaw(query), lookup_statute(code, section), get_date(case_id). Each tool described with input/output format.A logistics company processes 50,000 shipping documents per day. They need to extract 12 structured fields from each document (shipper, receiver, weight, dimensions, HS code, etc.) into a validated JSON object for their ERP system.
Solution: Optimized production prompt pipeline on NVIDIA NIM.
null in the extracted JSON, route document to human review queue rather than failing silently."Let's think step by step."
Appended to the question, this phrase activates chain-of-thought reasoning without any labeled examples. Works because it instructs the model to allocate tokens to intermediate reasoning before the final answer.
1. System — persona, rules, behavioral constraints. Highest trust.
2. User — the human query. Lower trust. Dynamic per request.
3. Assistant — prior model responses. Context for multi-turn conversations.
Zero-shot: No examples → lower tokens, faster, relies on pre-training.
Few-shot: 1–10 examples → higher tokens, slower, but teaches format and improves accuracy on novel or format-sensitive tasks.
Use few-shot when zero-shot accuracy is insufficient.
Thought → what the model reasons it should do next
Action → the tool call it executes (search, calc, DB query)
Observation → the tool's return value
Repeats until → Final Answer
ReAct = Reasoning + Acting
1. Sample N CoT paths (temperature > 0 for diversity, typically N=5–20)
2. Extract final answer from each path
3. Take majority vote — most common answer wins
Correct reasoning paths converge; incorrect ones diverge. +5–15% over single CoT.
T=0: Greedy decoding — always pick highest probability token. Fully deterministic. Best for: code, JSON, factual Q&A.
T=1: Unmodified probability distribution. Balanced creativity. Use for: conversation, CoT, balanced tasks.
T>1: Amplified randomness. Creative/experimental only.
CoT: Single linear reasoning chain. No backtracking. If early step is wrong, final answer is wrong.
ToT: Explores multiple parallel branches simultaneously. Prunes dead ends. Backtracks. Much better for puzzles, planning, and creative tasks with constraints.
Prompt when: Task is well-defined, examples fit in context, low volume, need fast iteration.
Fine-tune when: Prompt alone can't hit accuracy target, consistent style needed at scale, long prompts are too expensive, latency is critical.
Click any card to flip · Click again to return