Research, Statistics & Assessment

Research, Stats & Assessment on the NCE

This domain accounts for roughly 10–15% of the NCE. Questions test conceptual understanding — recognizing research designs, interpreting statistical results, and distinguishing reliability from validity.

The #1 NCE Trap: Confusing reliability and validity, and confusing Type I and Type II errors. Reliability = consistency; validity = accuracy. Type I = false positive (rejecting a true null); Type II = false negative (missing a real effect). These distinctions appear repeatedly across multiple question formats.

Four Core Content Areas

🔬

Research Design

How We Investigate

Experimental, quasi-experimental, correlational, descriptive, and qualitative designs — plus threats to internal and external validity.

Key exam: identifying design type from a scenario

📊

Descriptive Statistics

Describing Data

Central tendency, variability, normal distribution, skewness, standard scores (z, T, stanine), correlation coefficients.

Key exam: normal curve properties, skew direction

📋

Reliability & Validity

Test Quality

Five reliability types (test-retest, parallel forms, split-half, inter-rater, Cronbach's alpha) and four validity types (content, construct, concurrent, predictive).

Key exam: matching type to scenario

📏

Scales of Measurement

NOIR Framework

Nominal, Ordinal, Interval, Ratio — hierarchical scales with increasing mathematical power and properties.

Key exam: classifying a variable by scale type

High-Priority Exam Topics at a Glance

Topic	What the NCE Tests	Common Trap
Experimental Design	Random assignment, IV/DV, control group	Correlation ≠ causation; quasi-exp has no random assignment
Normal Distribution	68-95-99.7 rule, mean=median=mode	Confusing positive/negative skew with mean position
z-scores & T-scores	Convert and interpret standard scores	T-score mean=50 SD=10; not same as t-test statistic
Correlation	Strength from \|r\|; r² = variance explained	High r does not imply causation
Reliability	Match reliability type to scenario	Split-half ≠ test-retest; internal consistency ≠ stability
Validity	Match validity type to evidence source	Face validity ≠ true validity; construct is broadest
NOIR Scales	Classify a variable's measurement scale	IQ = interval (no true zero), not ratio; class rank = ordinal
Type I / II Errors	Distinguish false positive from false negative	Type I = α level set by researcher; Type II = β (missed effect)

Research Design

The NCE tests your ability to identify a research design from a brief scenario, understand what conclusions each design allows, and recognize threats to validity.

Five Major Research Designs

Gold Standard

True Experimental

Random assignment of participants to conditions. Researcher manipulates the independent variable (IV) and measures the dependent variable (DV). Control group receives no treatment or a comparison treatment.

Only design that allows cause-and-effect conclusions. Random assignment equates groups on all known and unknown variables.

Approximation

Quasi-Experimental

Resembles experimental design but lacks random assignment. Groups may be pre-existing (e.g., classrooms, clinics). Researcher still manipulates an IV and measures a DV.

Cannot fully rule out confounds. Allows some causal inference but weaker than true experimental.

Relationship Study

Correlational

Examines the relationship between two or more variables without manipulation. No IV or DV — just variables. Produces a correlation coefficient (r).

CANNOT establish causation — only association. "Correlation ≠ causation" is the single most tested research principle.

Observation

Descriptive

Documents characteristics of a phenomenon as it naturally occurs. Methods: surveys, case studies, naturalistic observation, archival research.

No manipulation; no cause-effect claims. Describes "what is" — generates hypotheses for further testing.

Meaning & Experience

Qualitative

Non-numerical; explores lived experience, meaning, and context. Methods: phenomenology, grounded theory, ethnography, narrative inquiry, case study.

Results not statistically generalizable. Aims for depth (transferability) over breadth. Trustworthiness replaces reliability/validity.

Key Research Concepts: Variables, Hypotheses & Ethics

Foundational concepts required for all research design questions

Research

Variables

Independent Variable (IV): The variable the researcher manipulates — the presumed cause. Dependent Variable (DV): The outcome measured — the presumed effect. Extraneous/Confounding Variables: Uncontrolled variables that can explain results. Operational definition: How a variable is specifically measured or defined.

Hypotheses

Null hypothesis (H₀): States no relationship or no effect — what we attempt to disprove. Alternative hypothesis (H₁): States a relationship or effect exists. A statistically significant result (p < .05) means we reject H₀ — not that we have "proved" the alternative.

Research Ethics

Informed consent: Participants must understand and voluntarily agree. Confidentiality: Data protected; distinguishable from anonymity (identity not collected at all). Debriefing: Explain true purpose after deception studies. IRB: Institutional Review Board must approve human subjects research.

Sampling

Random sampling: Every member of population has equal chance of selection — enables generalizability. Convenience sampling: Use of available participants — common but limits external validity. Stratified sampling: Population divided into subgroups, then randomly sampled proportionally.

⚑ NCE Focus: Distinguish random assignment (used within a study to create groups — ensures internal validity) from random sampling (used to select participants from a population — ensures external validity). These serve completely different purposes and are frequently confused on the NCE.

Threats to Internal Validity

History

An external event (outside the study) occurs during the study and affects outcomes — not the treatment.

Maturation

Participants naturally change over time (grow, fatigue, learn) regardless of the intervention.

Testing Effect

Taking the pretest affects performance on the posttest — practice or sensitization effect.

Instrumentation

The measurement tool or observers change over time, creating inconsistency in measurement.

Regression to Mean

Extreme scores at pretest tend to move toward the mean at posttest — not because of treatment.

Selection Bias

Non-equivalent groups at the start — systematic differences between treatment and control groups.

Mortality / Attrition

Differential dropout — if certain types of participants leave one group more than another, results are skewed.

Diffusion

Control group learns about or adopts the treatment, reducing the difference between groups.

📊 Correlation Strength — Interpreting r Values

|r| = 0.90–1.00

Very Strong

|r| = 0.70–0.89

Strong

|r| = 0.50–0.69

Moderate

|r| = 0.30–0.49

Weak

|r| = 0.00–0.29

Negligible

Key: Strength = absolute value of r. Direction (positive/negative) indicates the type of relationship, not its strength. r² = coefficient of determination = proportion of shared variance.

Statistics & the Normal Curve

Descriptive statistics, the normal distribution, standard scores, and inferential testing — the numerical backbone of NCE research questions.

Central Tendency & Variability

Describing the center and spread of a distribution

Statistics

Measures of Central Tendency

Mean: Arithmetic average — sensitive to outliers; most commonly used. Median: Middle value when data are ordered; robust to outliers; best for skewed distributions. Mode: Most frequently occurring value; only measure for nominal data; a distribution can have no mode or multiple modes.

Measures of Variability

Range: Max minus min — simple but highly sensitive to outliers. Variance (σ²): Average squared deviation from the mean. Standard Deviation (σ or SD): Square root of variance; same units as data; most used measure of spread. Larger SD = more spread out scores.

⚑ NCE Focus: Know which measure of central tendency is most appropriate for each distribution type. Skewed data → use median (not affected by extreme scores). Nominal data → use mode only. The mean is pulled toward the tail in a skewed distribution.

Skewness — Mean, Median & Mode Relationships

Tail points RIGHT

Positive Skew

Mode < Median < Mean

Mean is pulled toward the positive (right) tail by high outliers. Example: income distribution — a few very high earners pull the mean up. Median better represents the "typical" person.

Symmetric distribution

Normal (No Skew)

Mode = Median = Mean

Perfect bell curve — mean, median, and mode are identical. The 68-95-99.7 rule applies. Most psychological tests are designed to approximate a normal distribution.

Tail points LEFT

Negative Skew

Mean < Median < Mode

Mean is pulled toward the negative (left) tail by low outliers. Example: an easy test where most people score high but a few score very low pulls the mean down.

The Normal Distribution — 68-95-99.7 Rule

68%

of scores fall within ±1 SD of the mean

95%

of scores fall within ±2 SD of the mean

99.7%

of scores fall within ±3 SD of the mean

Standard Score Comparison

z-score

Mean = 0

SD = 1

Formula: z = (X − μ) / σ. Negative = below mean; positive = above. Basis for all other standard scores.

T-score

Mean = 50

SD = 10

T = 50 + 10z. Eliminates negative scores. Used in MMPI, many personality tests. T=60 = 1 SD above mean.

IQ (WAIS/WISC)

Mean = 100

SD = 15

IQ = 100 + 15z. Stanford-Binet historically used SD=16. IQ 115 = +1 SD. IQ 70 = −2 SD (intellectual disability threshold).

Stanine

Mean = 5

SD = 2

9-point scale (1–9). Stanines 1–3 = below average; 4–6 = average; 7–9 = above average. Broad bands used in educational testing.

SAT (each section)

Mean = 500

SD = 100

SAT = 500 + 100z. Score of 600 = +1 SD. Score of 300 = −2 SD. Scores range 200–800 per section.

Inferential Statistics — Hypothesis Testing & Error Types

Using sample data to draw conclusions about populations

Inferential

Statistical Significance

p-value: Probability of obtaining results at least as extreme as observed, assuming H₀ is true. p < .05: Reject H₀ (statistically significant at 5% level). p < .01: More stringent threshold. Significance ≠ practical importance — statistical significance can occur with large samples even for trivial effects.

Common Statistical Tests

t-test: Compare means of 2 groups. ANOVA: Compare means of 3+ groups (avoids inflated Type I error from multiple t-tests). Chi-square (χ²): Relationships between categorical variables. Pearson r: Correlation between two continuous variables. Spearman rho: Correlation for ranked/ordinal data.

⚑ NCE Focus: Know when to use each test — t-test for 2 groups, ANOVA for 3+. The reason to use ANOVA instead of multiple t-tests is to control the familywise Type I error rate. Statistical significance (p < .05) means the probability of getting these results by chance is less than 5% — it does NOT prove the alternative hypothesis.

Type I & Type II Errors

Type I Error · False Positive

Alpha Error (α)

Rejecting the null hypothesis when it is actually true. Concluding there is an effect when there really isn't one. The probability of a Type I error equals the significance level (α = .05 means 5% chance of false positive).

"The boy who cried wolf" — claiming something real when it isn't.

Type II Error · False Negative

Beta Error (β)

Failing to reject the null hypothesis when it is actually false. Missing a real effect — concluding there's no difference when one actually exists. Reduced by increasing sample size, effect size, or α level.

"Missing the wolf" — failing to detect something real that exists.

Statistical Power = 1 − β = the probability of correctly detecting a true effect. Power increases with larger sample size, larger effect size, and higher α level (but higher α also increases Type I error risk).

Assessment & Measurement

Reliability, validity, scales of measurement, norm vs. criterion-referenced testing, and test score interpretation — the assessment concepts most frequently tested on the NCE.

NOIR Scales of Measurement

Nominal

Properties: Categories only; no order; no meaningful distance between values. The weakest scale.

OK operations: Count (frequency), mode, chi-square
Examples: Gender, diagnosis (DSM category), race/ethnicity, type of treatment, yes/no responses

Ordinal

Properties: Rank order; unequal intervals between ranks; no true zero. Knows position, not distance.

OK operations: Median, percentile rank, Spearman rho
Examples: Class rank, Likert-scale responses, severity ratings, socioeconomic status levels

Interval

Properties: Equal intervals between values; no true zero (zero is arbitrary, not absence of trait).

OK operations: Mean, SD, Pearson r, t-test, ANOVA
Examples: IQ scores, SAT scores, temperature (°C/°F), most standardized psychological tests

Ratio

Properties: Equal intervals + true zero (zero = complete absence of the attribute). Highest level scale.

OK operations: All mathematical operations including ratios
Examples: Height, weight, age, income, reaction time, number of absences

Key exam trap: IQ scores are interval, not ratio. An IQ of 0 doesn't mean "no intelligence" — zero is not meaningful. Similarly, a person with an IQ of 100 does not have "twice the intelligence" of someone with IQ 50. Likert scales are technically ordinal (the interval between "agree" and "strongly agree" may not equal that between "neutral" and "agree").

Reliability — Five Types

Stability over time

Test-Retest

Same test administered to same group on two occasions; scores correlated. Measures temporal stability. Time interval matters — too short = carryover; too long = real change.

Key word: "same test, two times"

Equivalence across forms

Parallel / Alternate Forms

Two equivalent versions of the same test administered; scores correlated. Eliminates practice effects from test-retest. Expensive — requires creating two equivalent forms.

Key word: "two equivalent versions"

Internal consistency

Split-Half

Single test split into two halves (e.g., odd vs. even items); halves correlated. Corrected upward using the Spearman-Brown prophecy formula to estimate full-test reliability.

Key word: "one test, two halves"

Internal consistency

Cronbach's Alpha (α)

The most widely used measure of internal consistency. Represents the average of all possible split-half correlations. Values range 0–1; α ≥ .70 is generally acceptable; α ≥ .90 preferred for high-stakes decisions.

Key word: "average of all split-halves"

Agreement between raters

Inter-Rater Reliability

Two or more raters/observers score the same subject; their ratings are correlated or compared using Cohen's kappa (categorical) or intraclass correlation (continuous). Essential for observational or projective measures.

Key word: "two raters, same person"

Validity — Four Types

Domain coverage

Content Validity

Does the test adequately sample the full content domain it claims to measure? Evaluated by expert judgment, not by correlation. A math exam covering only addition lacks content validity if algebra is also in the curriculum.

Key: expert review; no correlation coefficient needed

Current performance prediction

Concurrent Validity

Test scores correlate with another established criterion measured at the same time. A new depression scale given alongside the BDI-II — if they correlate strongly, the new scale has concurrent validity.

Key: "concurrent" = same time; "at once"

Future performance prediction

Predictive Validity

Test scores correlate with a criterion measured in the future. SAT scores predicting college GPA. An aptitude test given now that predicts job performance later. Criterion is measured after the test.

Key: "predictive" = future criterion; longitudinal

Theoretical construct

Construct Validity

Does the test measure the theoretical construct it claims to measure? The broadest validity type — encompassing content, criterion, convergent, and discriminant evidence. Required for psychological constructs like "anxiety" or "intelligence."

Key: broadest type; convergent + discriminant evidence

⚖️ Reliability vs. Validity — The Critical Distinction

Concept	Definition	Relationship	Example
Reliability	Consistency — produces the same results repeatedly	Necessary but NOT sufficient for validity	A scale that consistently reads 5 lbs too heavy is reliable but not valid
Validity	Accuracy — measures what it claims to measure	Implies reliability — a valid test must be reliable	If the scale consistently overestimates, it's not valid for measuring true weight
Neither	Inconsistent AND inaccurate	Worst outcome — random error dominates	Scale reads 4 lbs one day, 7 lbs next — neither consistent nor accurate
Valid only?	Cannot exist — an inconsistent test cannot be accurate	Validity requires reliability as a prerequisite	Impossible: accuracy requires consistency first

Norm-Referenced vs. Criterion-Referenced Assessment

Two fundamentally different frameworks for interpreting test scores

Assessment

Norm-Referenced

Compares an individual's score to a normative group (standardization sample). Results expressed as percentile ranks, standard scores (z, T, IQ), or stanines. Designed to produce a spread of scores — bell curve distribution. Most standardized psychological tests (WAIS, MMPI). Purpose: rank individuals relative to peers.

Criterion-Referenced

Compares performance to a predetermined standard or criterion — not to other people. Results expressed as percentage correct or mastery/non-mastery. A score is interpreted regardless of how others perform. Examples: driver's license test, professional licensing exams, NCLEX. Purpose: determine if a standard has been met.

⚑ NCE Focus: The NCE itself is criterion-referenced — you pass by meeting a set score, not by outperforming others. The Standard Error of Measurement (SEM) reflects the precision of individual scores — a smaller SEM means greater measurement precision. Confidence intervals around a score use SEM to reflect uncertainty.

Practice Quiz — Research, Statistics & Assessment

10 NCE-style questions. Select the best answer for each.

Question 1 of 10

A researcher finds a correlation of r = +0.72 between childhood stress and adult anxiety. The coefficient of determination (r²) for this relationship is approximately 0.52. This means:

AThe correlation is not statistically significant because r² is less than r

BApproximately 52% of the variance in adult anxiety is explained by childhood stress

C52% of adults with childhood stress will develop anxiety disorders

DChildhood stress causes adult anxiety in just over half of cases

The coefficient of determination (r²) represents the proportion of variance in one variable that is explained by the other. r = 0.72, so r² = 0.72² ≈ 0.52 = 52% of shared variance. This does NOT imply causation (correlation study only), and it does NOT mean 52% of people develop the outcome. Options C and D incorrectly imply causation and prediction of specific cases.

Question 2 of 10

On a test where most students scored very high but a small number scored extremely low, the distribution would be:

APositively skewed, with Mean > Median > Mode

BNormal, with Mean = Median = Mode

CNegatively skewed, with Mean < Median < Mode

DBimodal, with two distinct peaks in the distribution

When most scores are high but a few are extremely low, the tail extends to the left — this is a negative skew. In a negatively skewed distribution, the mean is pulled downward by the low outliers, so Mean < Median < Mode. The mode (most common score) remains at the high end. A classic example: an easy exam where most people score 90–100 but a few score very low.

Question 3 of 10

A researcher conducts a study and rejects the null hypothesis. Later it is determined that the null hypothesis was actually true. This is an example of:

AType I error — rejecting a true null hypothesis (false positive)

BType II error — failing to reject a false null hypothesis (false negative)

CA problem with statistical power — the study was underpowered

DAn acceptable outcome when p < .05 was used as the significance level

Type I error (α) = rejecting H₀ when H₀ is true = false positive. The researcher concluded there was an effect when there really wasn't one. This is the "boy who cried wolf" error. Type II error = failing to reject H₀ when H₀ is false. Setting α = .05 means we accept a 5% chance of making a Type I error — so while it's "expected" statistically, it's still an error when it occurs.

Question 4 of 10

A counselor wants to assess whether a new therapy outcomes scale gives consistent results when administered to the same clients one week apart. Which reliability method should be used?

ASplit-half reliability — dividing the scale into two halves and correlating them

BTest-retest reliability — administering the same test twice and correlating the scores

CInter-rater reliability — having two counselors independently score the same client

DParallel forms reliability — creating a second equivalent version of the scale

Test-retest reliability measures temporal stability — consistency of scores across time. The scenario describes administering "the same test to the same clients one week apart" — the defining feature of test-retest. Split-half measures internal consistency from one administration. Inter-rater involves multiple scorers. Parallel forms requires two equivalent test versions.

Question 5 of 10

IQ scores, SAT scores, and most standardized psychological tests are classified on which scale of measurement?

ANominal — categories with no inherent order or numeric meaning

BOrdinal — rank-ordered with unequal intervals between values

CInterval — equal intervals between values but no true zero point

DRatio — equal intervals with a true zero representing complete absence

IQ and SAT scores are interval scale. They have equal intervals (the gap between 100 and 110 = the gap between 90 and 100), but there is no true zero — an IQ of 0 does not mean "zero intelligence." Because there's no true zero, you cannot make ratio statements: a person with IQ 150 does not have "twice the intelligence" of someone with IQ 75. Ratio scales (height, weight, age) do have a true zero.

Question 6 of 10

A researcher wants to compare therapy outcome scores across three treatment groups (CBT, DBT, and medication). Which statistical test is most appropriate?

APearson r — to examine the correlation between treatment type and outcomes

BChi-square — to compare frequencies across the three groups

Ct-test — to compare means of two independent groups

DANOVA — to compare means across three or more independent groups

ANOVA (Analysis of Variance) is used when comparing means across 3 or more groups. Pearson r measures correlation between continuous variables. Chi-square tests relationships between categorical variables (not mean comparisons). A t-test can only compare 2 groups — running multiple t-tests across 3 groups would inflate the Type I error rate, which is exactly why ANOVA exists.

Question 7 of 10

A student scores 60 on a psychological measure that uses T-scores (mean = 50, SD = 10). What does this score indicate?

AThe student scored at the mean for this measure

BThe student scored 1 standard deviation above the mean

CThe student scored 2 standard deviations above the mean

DThe student scored at the 60th percentile

For T-scores: mean = 50, SD = 10. A score of 60 = 50 + 1(10) = 1 standard deviation above the mean. This corresponds to approximately the 84th percentile on the normal curve. T = 70 would be +2 SD; T = 50 = mean; T = 40 = −1 SD. Option D is incorrect — a score of 60 on a T-score scale does not equal the 60th percentile (it's actually the 84th).

Question 8 of 10

A test development team asks a panel of subject matter experts to review each item and judge whether it adequately represents the content domain of the construct being measured. This process evaluates:

AConstruct validity — whether the test measures the theoretical construct

BPredictive validity — whether test scores predict future performance

CContent validity — whether the test adequately covers the full domain

DConcurrent validity — whether the test correlates with another measure given simultaneously

Content validity is established through expert judgment — reviewing whether test items adequately sample the full domain. It does NOT involve computing a correlation coefficient. Construct validity (broadest) requires multiple sources of evidence including convergent and discriminant evidence. Predictive and concurrent validity both require correlating test scores with an external criterion.

Question 9 of 10

A researcher uses random assignment to place participants into treatment and control groups. The primary purpose of random assignment is to:

AEnsure the sample is representative of the larger population (external validity)

BEquate groups on known and unknown variables, ruling out selection bias (internal validity)

CIncrease statistical power by reducing between-group variance

DPrevent participant dropout (mortality) from affecting the results

Random assignment's purpose is to create equivalent groups — it equates participants on all characteristics (known and unknown) through probability, eliminating selection bias and supporting causal inference. This ensures internal validity. It does NOT address external validity (random sampling does that), does not directly reduce variance, and does not prevent attrition.

Question 10 of 10

Which of the following correlation coefficients indicates the strongest relationship between two variables?

Ar = +0.45

Br = −0.78

Cr = +0.30

Dr = −0.15

The strength of a correlation is determined by its absolute value — direction (positive/negative) does not indicate strength. |−0.78| = 0.78, which is greater than |+0.45| = 0.45, |+0.30| = 0.30, and |−0.15| = 0.15. Therefore r = −0.78 represents the strongest relationship. A negative correlation simply means the variables move in opposite directions — it can be just as strong as a positive one.

0/10

Questions correct — review explanations above

Memory Hooks

Mnemonics and shortcuts for the statistical and research concepts most commonly tested on the NCE.

🎯

Reliability vs. Validity — Dartboard

Reliable but not valid = darts clustered together but away from the bullseye. Valid = darts clustered on the bullseye. An unreliable test cannot be valid. A reliable test can still fail to be valid. Reliability is necessary but NOT sufficient for validity.

Mnemonic: "You must be consistent before you can be accurate."

🐺

Type I vs. Type II Errors

Type I = False Positive — "The boy who cried wolf" — you say there's an effect when there isn't. Type II = False Negative — "Missing the wolf" — a real effect exists but you didn't detect it. α controls Type I; power (1−β) reduces Type II.

Mnemonic: "Type I = I cried wolf. Type II = I missed the wolf."

📏

NOIR Scales — Power Increases

N–O–I–R = progressively more powerful scales. Nominal (labels only) → Ordinal (rank) → Interval (equal gaps, no zero) → Ratio (equal gaps + true zero). IQ = Interval (zero isn't "no intelligence"). Age = Ratio (zero = birth).

Mnemonic: "NOIR — each step adds a new superpower."

📐

Skew — Mean Follows the Tail

The mean is always pulled toward the tail. Positive skew = tail right → Mean > Median > Mode. Negative skew = tail left → Mean < Median < Mode. Think: income = positive skew (the ultra-wealthy pull the mean up past the median).

Mnemonic: "The mean chases the tail like a dog."

🔢

Standard Score Quick Reference

z: mean=0, SD=1. T: mean=50, SD=10. IQ: mean=100, SD=15. Stanine: mean=5, SD=2. SAT: mean=500, SD=100. Pattern: each adds a zero to the mean, except stanines (smallest scale). T=60 = z=+1 = IQ=115 = stanine=7.

Mnemonic: "0, 50, 100, 500 — the means keep adding a zero."

🔬

Random Assignment vs. Random Sampling

Random Sampling = selecting WHO is in the study → external validity (generalizability). Random Assignment = deciding WHICH GROUP participants go into → internal validity (causal inference). They serve different purposes and are the most commonly confused research design concepts.

Mnemonic: "Sampling = who's IN. Assignment = which GROUP."

⚡ Research & Stats Quick-Reference Cheat Sheet

Concept / Term	Key Fact	Common Trap
r²	Proportion of shared variance between two variables	r² ≠ r; does not imply causation
Type I Error (α)	Reject true H₀ = false positive	Opposite of Type II; α = p-value threshold
Type II Error (β)	Fail to reject false H₀ = false negative	Reduced by increasing power/sample size
Test-retest reliability	Same test, two times → temporal stability	≠ parallel forms (which uses two versions)
Content validity	Expert review of domain coverage; no correlation needed	≠ face validity (which is just appearance)
IQ scale type	Interval — equal intervals, no true zero	Students often say "ratio" — wrong; IQ 0 ≠ no intelligence
Negative skew	Tail left; Mean < Median < Mode	Students reverse this — remember "mean follows the tail"
ANOVA	Compare means of 3+ groups	t-test = 2 groups only; multiple t-tests inflate Type I error
Correlation ≠ Causation	r shows association, not cause-and-effect	Most commonly tested research principle on the NCE
T-score 60	+1 SD above mean (T mean=50, SD=10)	T=60 ≠ 60th percentile; it's approximately the 84th

Flashcards & Study Advisor

Tap any card to flip it. Use the advisor panel for targeted study by topic area.

Flashcards — Research, Statistics & Assessment

Statistics

What does r² (coefficient of determination) tell you, and how does it differ from r?

tap to reveal

Answer

r² = proportion of variance in one variable explained by the other (shared variance). r = strength and direction of the linear relationship. If r = 0.80, r² = 0.64 → 64% shared variance. r² is always positive; r can be negative.

Research

What is the key difference between a true experimental design and a quasi-experimental design?

tap to reveal

Answer

True experimental = random assignment to conditions → can establish causation. Quasi-experimental = NO random assignment (uses pre-existing groups) → weaker causal inference. Both involve manipulation of an IV and measurement of a DV.

Errors

Define Type I and Type II errors and identify which is controlled by the significance level (α).

tap to reveal

Answer

Type I = reject true H₀ = false positive (α controls this). Type II = fail to reject false H₀ = false negative (β). Power = 1 − β. Setting α = .05 means you accept a 5% chance of a Type I error. Reducing α increases Type II error risk.

NOIR

Why are IQ scores classified as interval scale rather than ratio scale?

tap to reveal

Answer

IQ is interval because it lacks a true zero — an IQ of 0 does not mean "zero intelligence." Without a true zero, you cannot form ratio statements (IQ 100 ≠ twice IQ 50). Ratio scale requires meaningful zero (e.g., height, age, weight where 0 = complete absence).

Reliability

Which reliability type uses the Spearman-Brown prophecy formula, and why?

tap to reveal

Answer

Split-half reliability uses the Spearman-Brown formula to correct for the fact that splitting a test in half creates a shorter test — and shorter tests are generally less reliable. The formula estimates what the full-length test's reliability would be.

Normal Curve

What percentage of scores fall within ±1, ±2, and ±3 standard deviations of the mean?

tap to reveal

Answer

±1 SD = 68% of scores. ±2 SD = 95% of scores. ±3 SD = 99.7% of scores. This is the 68-95-99.7 rule (empirical rule). The mean = median = mode in a perfectly normal distribution.

Skewness

In a positively skewed distribution, what is the correct order of mean, median, and mode?

tap to reveal

Answer

Positive skew (tail right): Mode < Median < Mean. The mean is pulled farthest toward the positive tail by high outliers. The median is the best measure of central tendency for skewed distributions. Negative skew reverses this: Mean < Median < Mode.

Validity

Which type of validity is established by expert review and does NOT require computing a correlation coefficient?

tap to reveal

Answer

Content validity — established through systematic expert review of whether items adequately sample the content domain. No correlation needed. Concurrent and predictive validity both require correlating the test with an external criterion. Construct validity requires multiple lines of evidence.

Master All NCE Research & Stats on FlashGenius

Spaced repetition flashcards covering all NCE content areas. Study smarter.

Unlock Full Flashcard Deck on FlashGenius →

Study Advisor

Research Design

Descriptive Statistics

Normal Curve & Scores

Reliability & Validity

NOIR & Inferential Stats

Research Design — Exam Focus

True experimental design is the ONLY design that allows cause-and-effect conclusions. Random assignment to groups is the defining feature. Without it, you have quasi-experimental at best.
Correlational research cannot establish causation — this is the single most tested research principle on the NCE. Even a very high correlation (r = 0.99) cannot prove causation.
Random assignment vs. random sampling is a critical distinction: random assignment → internal validity (causation). Random sampling → external validity (generalizability). They are not interchangeable.
History threat = external event during the study. Maturation threat = participants naturally change. Regression to the mean = extreme scores at pretest naturally move toward average at posttest.
Qualitative research uses terms like "transferability" (not generalizability) and "trustworthiness" (not reliability/validity) — different epistemological framework.

Descriptive Statistics — Exam Focus

Mean is sensitive to outliers; use median for skewed distributions. Mode is the only appropriate measure for nominal data.
Standard deviation vs. variance: SD = √Variance. Both measure spread. SD is in the original units; variance is in squared units. Larger SD = more spread in scores.
Negative skew: tail goes LEFT; Mean < Median < Mode. Most people score HIGH but a few score very low. Positive skew: tail goes RIGHT; Mode < Median < Mean. Most people score LOW but a few score very high (e.g., income).
Correlation strength: determined by the absolute value of r, not the sign. r = −0.85 is stronger than r = +0.40. The sign tells direction only.
r² interpretation: always square r to get shared variance. r = 0.70 → r² = 0.49 → 49% shared variance. This is the coefficient of determination.

Normal Curve & Standard Scores — Exam Focus

68-95-99.7 rule: ±1 SD = 68%; ±2 SD = 95%; ±3 SD = 99.7%. Must know these for interpreting standard scores on the NCE.
T-score: mean = 50, SD = 10. T = 60 → +1 SD → ~84th percentile. T = 70 → +2 SD → ~98th percentile. Used in MMPI-3, many personality measures.
IQ: mean = 100, SD = 15 (WAIS/WISC). IQ 115 = +1 SD; IQ 130 = +2 SD. Intellectual disability typically defined as ≤70 (−2 SD), with adaptive behavior deficits.
z-score is the basis for all other standard scores. Positive z = above mean; negative z = below mean. z = (X − mean) / SD.
Stanines: 1–9 scale, mean = 5, SD = 2. Stanines 4–6 = average range. Broader bands than other standard scores — used to classify general performance levels.

Reliability & Validity — Exam Focus

Test-retest = same test, two times (stability). Parallel forms = two equivalent versions (equivalence). Split-half = one test, two halves (internal consistency, corrected with Spearman-Brown). Cronbach's alpha = most common internal consistency measure. Inter-rater = two raters, one person (agreement).
Content validity = expert review, no correlation. Concurrent validity = correlates with criterion now. Predictive validity = correlates with future criterion. Construct validity = broadest; measures the theoretical construct.
Reliability is necessary but not sufficient for validity. A perfectly reliable test can measure the wrong thing. A valid test must be reliable.
Face validity is NOT a true form of validity — it just means the test appears to measure what it claims. Does not require empirical evidence.
SEM (Standard Error of Measurement): smaller SEM = more precise scores = more reliable test. Used to create confidence intervals around individual scores.

NOIR Scales & Inferential Stats — Exam Focus

Nominal: categories only (diagnosis, gender). Mode only. Chi-square. Ordinal: rank order, unequal intervals (class rank, Likert). Median, percentile. Interval: equal intervals, no true zero (IQ, SAT, temperature °C). Mean, SD, Pearson r. Ratio: equal intervals + true zero (height, age, income). All operations.
IQ = Interval (most commonly missed NOIR question). No true zero → cannot make ratio statements.
t-test = 2 groups. ANOVA = 3+ groups (controls familywise error). Chi-square = categorical data frequencies. Pearson r = continuous variables correlation.
Type I error (α) = false positive = reject true H₀. Set by researcher as significance level (typically .05). Type II error (β) = false negative = fail to reject false H₀. Reduced by increasing sample size or effect size.
Power = 1 − β. To increase power: larger sample size, larger effect size, higher α level. A study with power = .80 has an 80% chance of detecting a true effect.