Stella Cole -- APWH Intervention

course-04124f7fd5 · AP World History · Hole-filling course

3 issues to address before shipping:

Longest answer isn't the correct one too often
Correct answers aren't systematically longer
No pattern-matching shortcuts

208

Significant

Detailed findings

Each section below explains what was checked, what we found, and what to do about it. Lower scores are better; the score combines all check results.

What the AI reviewer said about the course

The course's biggest strength is its tight, evidence-based remediation design: it cites the student's exact diagnostic errors, names the misconceptions, and drills production of precise AP terminology through coordinated articles and FRQs. Its biggest weakness is severe length/elaboration bias in MCQ construction—correct answers average 31 characters longer than distractors and are the unique longest option 52.5% of the time, giving a test-savvy student a reliable shortcut that bypasses the very content knowledge the course aims to build.

Course pedagogy review (5 dimensions)

Passes

Course is right-scoped for the gap

The course is tightly scoped to the specific gaps identified on Stella's diagnostic. The welcome article explicitly cites her 46/55 MCQ vs 21/55 writing score and lists the exact question numbers (Q11-Q51) where she struggled. Each session targets concrete misses—Tamerlane's collapse mechanism, Mughal political marriages, the Enlightenment-as-engine, the tributary system, the Ottoman-WWI causal chain—and articles repeatedly reference her specific wrong answers (e.g., "you wrote 'to get more land'") to correct them. This is a remediation course, not a full APWH rebuild.

Passes

Articles teach what practice tests

Each session follows a consistent pattern: a brief preview article naming the concepts, a content article that walks through each diagnostic error with the correct answer, and then MCQs and FRQs that test exactly those concepts. Practice items map cleanly back to the article content—e.g., Session 5 articles teach the tributary/Canton systems, and Q30, Q33, Q34, Q42 test those exact mechanisms. FRQ prompts even cite the student's diagnostic answers and ask her to name the precise term taught in the article.

Passes

Course names and corrects misconceptions

This is one of the course's strongest features. Articles explicitly quote the student's wrong answers and name the misconception (e.g., "you wrote 'to get more land'—that misses the crucial context: this was about recovery of losses, not new conquest"; "syncretism describes cultural blending; the wedding's significance was political"). MCQ distractors frequently embody the exact misconceptions she had on the diagnostic (Q14, Q41, Q56 are explicit "which best corrects this claim?" items). The mechanism-vs.-cause distinction for the slave trade is surfaced repeatedly.

Passes

Question demand matches the AP skill

The named gap is producing precise AP terminology rather than gesturing at concepts. The FRQs directly target this—they require Stella to "name the specific AP concept" in sentence 1 and explain the mechanism in sentence 2, which is exactly the production-from-memory skill the diagnostic flagged. MCQs go beyond term-matching by asking students to distinguish causes from mechanisms (Q23, Q49), select the corrective claim (Q14, Q41, Q47, Q56), or identify structural reasoning (Q5, Q34). Demand matches the gap.

Issue

No pattern-matching shortcuts

The deterministic measurements show strict-longest-correct rate at 52.5% (threshold 35%) and mean correct-answer length exceeds distractor length by 31.3 chars (threshold 8). Both metrics substantially exceed thresholds, indicating a test-savvy student could pick the longest, most elaborated option to score well above chance without content mastery. Specificity asymmetry (20%) and absolute-language asymmetry (2.5%) are within thresholds, but the length bias alone is severe enough to fail this check. Examples include Q1, Q5, Q8, Q9, Q29, Q34, Q43, Q50 where the correct answer is markedly longer and more elaborated.

Course-level statistical signals

Issue

Longest answer isn't the correct one too often

21/40 = 52.5% of items have correct = strictly longest with ≥8 char gap (threshold ≤35%, chance=25%)

Passes

Correct answers are spread across A/B/C/D

χ²=0.80 (crit p<.01: 11.34); distribution={'A': 10, 'B': 10, 'C': 8, 'D': 12}

Issue

Correct answers aren't systematically longer

mean correct=134.9 chars, mean distractor=103.6 chars, diff=+31.3, ratio=1.30 (thresholds: diff≤8, ratio≤1.25)

Passes

Correct answers don't systematically cite more factors than distractors

8/40 = 20.0% of items have correct answer citing ≥2 more factors than ALL distractors (threshold ≤25%)

Passes

Distractors don't carry absolute language the correct answer lacks

1/40 = 2.5% of items have ≥2 distractors with hard absolutes (always/never/only/all/none/every/must) while correct does not (threshold ≤25%)

Item-level issues found

Issue	Items affected
No absolute language traps	13 of 40 (32%)
No wildly long answer	10 of 40 (25%)
Stem is a question	6 of 40 (15%)
Correct answer is unambiguously correct	9 of 40 (22%)
Wrong answers are plausible	21 of 40 (52%)
Question stem is clear	13 of 40 (32%)
Explanation actually teaches	40 of 40 (100%)
Question demand matches lesson goal	19 of 40 (47%)

FRQ checks (16 FRQs in this course)

Issue	FRQs affected
FRQ prompt is clear	6 of 16 (37%)
Rubric scores what the prompt asks	3 of 15 (20%)

Passes

FRQs collectively serve the course goal

This is an exceptionally well-aligned gap-fill course. The diagnostic identified Stella's specific weakness — she recognizes content (84% MCQ) but cannot name precise AP mechanisms in writing (38% FRQ), consistently substituting vague descriptions or naming effects instead of causes/mechanisms. Every FRQ directly targets this gap with the correct cognitive demand. Each session pairs a "two-sentence rule" precision drill (FRQs 1, 3, 5, 7, 9, 11, 13, 15) — explicitly forcing her to name the precise AP term and explain the mechanism, often referencing her exact diagnostic errors ("Your diagnostic answer: 'idk ngl'") — with a stimulus-based analysis FRQ (2, 4, 6, 8, 10, 12, 14, 16) that requires her to deploy those terms in extended writing. The articles teach the same mechanisms (Tamerlane's internal fragmentation, Enlightenment as cause vs. reform as effect, plantation demand vs. slave-trade mechanism, tributary/Canton systems, decolonization vs. Cold War alliances, mandate system, Soviet dissolution factors) that the FRQs then test. The cognitive demand matches the gap exactly: precision-of-naming and cause-vs-mechanism discrimination, not generic synthesis. No orphan content, no format mismatches.

Glossary

Every check explained — what it looks at, why it matters, how to fix it.

Show all check definitions

Item structure & format

Right number of choices

What this checksEach MCQ should have exactly four answer choices.

Why it mattersAP exam MCQs always have four options; deviation breaks student expectations and platform rendering.

How to fixEdit the item to have exactly four choices.

Has an explanation

What this checksEvery MCQ must include an explanation of why the right answer is right.

Why it mattersStudents need to learn from wrong answers — without an explanation, the question only tests, it doesn't teach.

How to fixAdd an explanation field that walks through why the keyed answer is correct.

Correct-answer letter exists

What this checksThe keyed correct-answer letter must actually match one of the choices.

Why it mattersIf the key says 'C' but there's no choice C, students cannot ever get credit.

How to fixRe-check the answer key against the choices.

No placeholder text

What this checksThe item must not contain markers like [TODO], [INSERT], or [PLACEHOLDER].

Why it mattersPlaceholders that escape into production are visible to students and signal an unfinished item.

How to fixReplace the placeholder with the intended content.

No wildly long answer

What this checksNo single answer choice should be more than twice as long as the shortest choice in the same item.

Why it mattersLength asymmetry lets test-savvy students guess the correct answer without reading carefully.

How to fixTrim the long choice or expand the short ones so the four options are similar in length.

All choices distinct

What this checksAll four choices must be different from each other.

Why it mattersDuplicate choices waste a slot and confuse students.

How to fixRewrite duplicate choices.

Stem is a question

What this checksThe stem should end with a question mark or use clear question language ('Which...', 'What...', 'Best describes...').

Why it mattersStatement-style stems leave students guessing what they're being asked.

How to fixRewrite the stem as a clear question.

No 'all/none of the above'

What this checksChoices must not include 'all of the above' or 'none of the above'.

Why it mattersThese trap options test reading strategy more than content knowledge and aren't used on the AP exam.

How to fixReplace the option with a real distractor that targets a misconception.

No absolute language traps

What this checksChoices must not contain absolute words like 'always', 'never', 'only', 'must' (especially when only the wrong answers contain them).

Why it mattersTest-savvy students learn to eliminate any choice with an absolute, regardless of content.

How to fixSoften the absolute language, or move it into the correct answer too if it's content-relevant.

Wrong answers have rationale

What this checksEach wrong answer should be backed by a rationale explaining what misconception it targets.

Why it mattersDistractors written without a rationale tend to be implausible or arbitrary, missing the chance to diagnose student thinking.

How to fixFor each distractor, add a one-line note describing the misconception or partial-understanding it represents.

Wrong-answer feedback shown to students

What this checksWhen a student picks a wrong answer, they should see feedback explaining why it's wrong.

Why it mattersPer-distractor feedback is the most direct teaching moment — silence on a wrong answer is a missed opportunity.

How to fixAdd per-choice feedback so students see specific guidance when they choose wrong.

Course-level answer bias

Longest answer isn't the correct one too often

What this checksAcross the course, the correct answer should NOT be the strictly-longest option (with a meaningful ≥8-char gap to the next-longest) more than ~35% of the time. Chance baseline is 25%. Tiny gaps (1–4 chars) are filtered out as measurement noise — invisible to students.

Why it mattersIf the correct answer is consistently and noticeably the longest, students can pass the course by always picking the longest option without learning anything.

How to fixFor items where the correct answer is meaningfully longer, either trim the correct or expand the distractors so they match in length.

Correct answers are spread across A/B/C/D

What this checksAcross the course, correct-answer positions should be roughly evenly distributed across A/B/C/D (statistical chi-square test).

Why it mattersPosition skew lets students guess from a pattern (e.g., 'always pick C') instead of from understanding.

How to fixRe-shuffle the position of correct answers so they're balanced across A/B/C/D.

Correct answers aren't systematically longer

What this checksAcross the course, the average length of correct answers should be close to the average length of wrong answers (not more than ~25% longer).

Why it mattersEven if no single item triggers the per-item length check, a systematic length difference across the course rewards length-guessing strategies.

How to fixAudit the course for items where the correct answer is noticeably longer; equalize lengths.

Correct answers don't systematically cite more factors than distractors

What this checksAcross the course, no more than ~25% of items should have the correct answer enumerating ≥2 more factors/mechanisms (counted via 'and' + comma joiners) than ALL distractors. A +1 factor gap is treated as counting noise (e.g., appositive commas counted as list items) and ignored.

Why it mattersEven when option lengths are balanced, a systematic 'correct answer integrates more factors' pattern lets test-savvy students pick the most-elaborated option without engaging with content.

How to fixIn the distractor generator, instruct: distractors must cite the SAME NUMBER of factors/mechanisms/criteria as the correct answer — distractors are wrong because the factors are wrong, not because there are fewer of them.

Distractors don't carry absolute language the correct answer lacks

What this checksAcross the course, no more than ~25% of items should have ≥2 distractors containing hard absolutes (always, never, only, all, none, every, must) while the correct answer has none.

Why it mattersTest-savvy students eliminate options with absolutes by reflex — if your distractors carry them and your correct answers don't, students can pass without content knowledge.

How to fixEither (a) instruct the distractor generator to avoid hard absolutes entirely, or (b) ensure the correct answer carries an absolute when distractors do (rare but valid for content reasons).

Question quality (AI review)

Correct answer is unambiguously correct

What this checksAn AI reviewer (Claude Sonnet) confirms that the keyed answer is the unambiguous best answer (no other choice is equally defensible).

Why it mattersItems with two defensible answers frustrate strong students and undermine the validity of the question.

How to fixTighten the stem or the distractors so only one answer is defensible.

Wrong answers are plausible

What this checksAn AI reviewer confirms that each wrong answer would be picked by a student with a real misconception (not absurd or trivially obvious).

Why it mattersImplausible distractors make the item easy to guess and don't help diagnose what students don't understand.

How to fixReplace weak distractors with ones grounded in known student misconceptions for the topic.

Question stem is clear

What this checksAn AI reviewer confirms the stem is self-contained, unambiguous, and doesn't cue the answer.

Why it mattersAmbiguous stems test reading skill rather than the targeted content knowledge.

How to fixRewrite the stem to be specific, complete, and free of giveaway cues.

Explanation actually teaches

What this checksAn AI reviewer confirms the explanation explains WHY the correct answer is correct AND WHY each distractor is wrong (not just 'Correct!').

Why it mattersTrivial explanations like 'Correct!' miss the most valuable teaching moment in the entire item.

How to fixRewrite the explanation to walk through the reasoning for the correct answer and address the most likely wrong-answer choices.

Lesson↔question alignment (AI review)

Question demand matches lesson goal

What this checksAn AI reviewer compares the cognitive level the question demands (recall vs analyze vs synthesize) to the level the lesson is meant to build.

Why it mattersIf the lesson teaches mechanism but the questions only test vocabulary, students 'pass' without learning the targeted skill.

How to fixEither upgrade the question to require the targeted cognitive level, or move the question to a recall-focused lesson.

Article supports the question

What this checksFor items linked to an article/passage, an AI reviewer confirms the article contains what's needed to answer the question.

Why it mattersIf the article doesn't teach what the question tests, the practice is broken — students are being tested on content they weren't given.

How to fixEither expand the article to cover the missing content, or move the question to a different article that teaches it.

Course pedagogy (AI review)

Course is right-scoped for the gap

What this checksAn AI curriculum reviewer judges whether the course is tightly scoped to the named gap (not over-bloated, not so thin it misses the gap).

Why it mattersHole-filling courses should fill the specific gap; rebuilds and stretched-thin courses both miss the point.

How to fixTrim irrelevant content if over-broad, or add focused articles/practice if under-scoped.

Articles teach what practice tests

What this checksAn AI reviewer judges whether articles teach a concept first, then practice tests that concept, in a clear sequence.

Why it mattersRandom ordering or orphan questions break the teaching loop — students get tested on things they were never taught.

How to fixReorder so each concept appears in an article before the related practice item.

Course names and corrects misconceptions

What this checksAn AI reviewer judges whether the course explicitly names the likely wrong-thinking patterns and corrects them.

Why it mattersHole-filling exists because the student got something wrong — generic content review without naming the misconception is unlikely to fix it.

How to fixAdd explicit 'common mistake' or 'why students confuse X with Y' sections to articles.

Question demand matches the AP skill

What this checksAn AI reviewer judges whether the cognitive demand of the questions matches the AP skill being remediated.

Why it mattersA 'mechanism' gap requires explain-the-mechanism items; a 'discrimination' gap requires distinguishing-cases items.

How to fixRe-author questions to target the specific AP skill type the course names.

No pattern-matching shortcuts

What this checksAn AI reviewer judges whether students CAN'T pass the exit ticket via shortcuts like answer-length bias or keyword cueing.

Why it mattersIf the student can pass without learning the concept, the course's value is zero regardless of how good the teaching is.

How to fixAudit for length bias, keyword cueing in stems, and vocabulary lifted verbatim from articles into correct answers.

FRQ quality

Prompt is real (not a placeholder)

What this checksFRQ stems must not be placeholders like 'test prompt' or 'TODO'.

Why it mattersPlaceholder prompts ship to students and make the course visibly broken.

How to fixReplace the placeholder with the actual FRQ prompt.

FRQ has a rubric

What this checksEvery FRQ must have a non-empty rubric.

Why it mattersWithout a rubric, the AI grader has no scoring criteria and student responses cannot be graded.

How to fixAuthor a rubric with point allocations and accepted answer paths.

Rubric uses criteria language

What this checksThe rubric should mention point allocations (e.g., '1 point') or scoring criteria language.

Why it mattersFree-form rubric text without explicit criteria is hard for the grader to apply consistently.

How to fixRestructure the rubric into explicit criteria with point values.

Autograder URL is set

What this checksEach FRQ must point to an autograder URL.

Why it mattersWithout a grader URL, FRQ submissions go nowhere — students don't get scores or feedback.

How to fixWire the FRQ to a configured autograder.

Autograder URL is well-formed

What this checksThe autograder URL must use a valid scheme and not contain typos like the old /api/ prefix or doubled https://.

Why it mattersMalformed URLs silently fail to grade — the platform does not warn you.

How to fixVerify the URL against the canonical grader endpoint format.

FRQ marked as required response

What this checksThe interaction element must have required="true" so students can't skip it.

Why it mattersWithout required="true", students can advance without responding.

How to fixSet the required attribute on the FRQ interaction.

Expected response length set

What this checksThe FRQ should declare an expected number of lines so the response box is sized correctly.

Why it mattersWithout a sizing hint, students see a tiny box for a long-essay prompt and write less than they should.

How to fixAdd the expected-lines attribute matching the rubric's expected response length.

FRQ outcome declarations are canonical

What this checksThe FRQ should declare the canonical set of outcome variables (API_RESPONSE, FEEDBACK_VISIBILITY, GENERATED_FEEDBACK, SCORE).

Why it mattersNon-canonical outcomes prevent the grader from writing scores back or showing generated feedback to students.

How to fixMigrate the FRQ XML to the canonical outcome declaration pattern.

Rubric is in the correct QTI location

What this checksThe rubric must live in <qti-rubric-block> inside <qti-item-body> per the QTI spec — not in metadata.rubric or metadata.modelAnswer.

Why it mattersRubrics in non-standard fields are invisible to the platform's grading tooling and to graders who follow the spec — the rubric content might exist but won't be applied where it counts.

How to fixMove the rubric content from metadata.rubric (or metadata.modelAnswer) into a <qti-rubric-block use="ext:criteria" view="scorer"> element inside <qti-item-body>, with one block per criterion.

FRQ uses autograding (preferred)

What this checksThe FRQ should be configured for autograding rather than manual/human scoring.

Why it mattersManual scoring is allowed, but autograded FRQs scale better and give students faster feedback — autograding is preferred where feasible.

How to fixIf the FRQ is currently set to scoringType="manual" or requiresHumanScoring=true, consider building an autograder for it.

FRQ prompt is clear

What this checksAn AI reviewer confirms the prompt is unambiguous and well-specified (clear task verb, scope, expected response form).

Why it mattersAmbiguous prompts get a wide range of responses that the rubric can't fairly score.

How to fixTighten the prompt's task verb, add scope constraints, and specify response form (essay/list/diagram).

Rubric scores what the prompt asks

What this checksAn AI reviewer confirms the rubric criteria map to what the prompt asks (no orphan criteria, no unscored prompt parts).

Why it mattersMisalignment means students do what the prompt asks but get scored on something else.

How to fixWalk through the prompt and rubric line by line; ensure each prompt part has a scoring path.

FRQs collectively serve the course goal

What this checksAn AI curriculum reviewer judges whether the FRQs in the course (taken together) actually work toward the course's named goal — and that no FRQ ships with broken/placeholder content.

Why it mattersEven if individual FRQs are fine, an off-target FRQ set or any single broken FRQ undermines the whole course.

How to fixRe-author off-target FRQs to match the gap; fix or remove any broken FRQs.