Course Quality Check

Run the same QC pipeline used in the team-wide audit on a single Timeback course before shipping it to students.

Stella Cole -- APWH Intervention

course-04124f7fd5 · AP World History · Hole-filling course
3 issues to address before shipping:
  • Longest answer isn't the correct one too often
  • Correct answers aren't systematically longer
  • No pattern-matching shortcuts
208
Significant

Detailed findings

Each section below explains what was checked, what we found, and what to do about it. Lower scores are better; the score combines all check results.

What the AI reviewer said about the course

The course's biggest strength is its tight, evidence-based remediation design: it cites the student's exact diagnostic errors, names the misconceptions, and drills production of precise AP terminology through coordinated articles and FRQs. Its biggest weakness is severe length/elaboration bias in MCQ construction—correct answers average 31 characters longer than distractors and are the unique longest option 52.5% of the time, giving a test-savvy student a reliable shortcut that bypasses the very content knowledge the course aims to build.

Course pedagogy review (5 dimensions)

Passes
Course is right-scoped for the gap
The course is tightly scoped to the specific gaps identified on Stella's diagnostic. The welcome article explicitly cites her 46/55 MCQ vs 21/55 writing score and lists the exact question numbers (Q11-Q51) where she struggled. Each session targets concrete misses—Tamerlane's collapse mechanism, Mughal political marriages, the Enlightenment-as-engine, the tributary system, the Ottoman-WWI causal chain—and articles repeatedly reference her specific wrong answers (e.g., "you wrote 'to get more land'") to correct them. This is a remediation course, not a full APWH rebuild.
Passes
Articles teach what practice tests
Each session follows a consistent pattern: a brief preview article naming the concepts, a content article that walks through each diagnostic error with the correct answer, and then MCQs and FRQs that test exactly those concepts. Practice items map cleanly back to the article content—e.g., Session 5 articles teach the tributary/Canton systems, and Q30, Q33, Q34, Q42 test those exact mechanisms. FRQ prompts even cite the student's diagnostic answers and ask her to name the precise term taught in the article.
Passes
Course names and corrects misconceptions
This is one of the course's strongest features. Articles explicitly quote the student's wrong answers and name the misconception (e.g., "you wrote 'to get more land'—that misses the crucial context: this was about recovery of losses, not new conquest"; "syncretism describes cultural blending; the wedding's significance was political"). MCQ distractors frequently embody the exact misconceptions she had on the diagnostic (Q14, Q41, Q56 are explicit "which best corrects this claim?" items). The mechanism-vs.-cause distinction for the slave trade is surfaced repeatedly.
Passes
Question demand matches the AP skill
The named gap is producing precise AP terminology rather than gesturing at concepts. The FRQs directly target this—they require Stella to "name the specific AP concept" in sentence 1 and explain the mechanism in sentence 2, which is exactly the production-from-memory skill the diagnostic flagged. MCQs go beyond term-matching by asking students to distinguish causes from mechanisms (Q23, Q49), select the corrective claim (Q14, Q41, Q47, Q56), or identify structural reasoning (Q5, Q34). Demand matches the gap.
Issue
No pattern-matching shortcuts
The deterministic measurements show strict-longest-correct rate at 52.5% (threshold 35%) and mean correct-answer length exceeds distractor length by 31.3 chars (threshold 8). Both metrics substantially exceed thresholds, indicating a test-savvy student could pick the longest, most elaborated option to score well above chance without content mastery. Specificity asymmetry (20%) and absolute-language asymmetry (2.5%) are within thresholds, but the length bias alone is severe enough to fail this check. Examples include Q1, Q5, Q8, Q9, Q29, Q34, Q43, Q50 where the correct answer is markedly longer and more elaborated.

Course-level statistical signals

Issue
Longest answer isn't the correct one too often
21/40 = 52.5% of items have correct = strictly longest with ≥8 char gap (threshold ≤35%, chance=25%)
Passes
Correct answers are spread across A/B/C/D
χ²=0.80 (crit p<.01: 11.34); distribution={'A': 10, 'B': 10, 'C': 8, 'D': 12}
Issue
Correct answers aren't systematically longer
mean correct=134.9 chars, mean distractor=103.6 chars, diff=+31.3, ratio=1.30 (thresholds: diff≤8, ratio≤1.25)
Passes
Correct answers don't systematically cite more factors than distractors
8/40 = 20.0% of items have correct answer citing ≥2 more factors than ALL distractors (threshold ≤25%)
Passes
Distractors don't carry absolute language the correct answer lacks
1/40 = 2.5% of items have ≥2 distractors with hard absolutes (always/never/only/all/none/every/must) while correct does not (threshold ≤25%)

Item-level issues found

IssueItems affected
No absolute language traps13 of 40 (32%)
No wildly long answer10 of 40 (25%)
Stem is a question6 of 40 (15%)
Correct answer is unambiguously correct9 of 40 (22%)
Wrong answers are plausible21 of 40 (52%)
Question stem is clear13 of 40 (32%)
Explanation actually teaches40 of 40 (100%)
Question demand matches lesson goal19 of 40 (47%)

FRQ checks (16 FRQs in this course)

IssueFRQs affected
FRQ prompt is clear6 of 16 (37%)
Rubric scores what the prompt asks3 of 15 (20%)
Passes
FRQs collectively serve the course goal
This is an exceptionally well-aligned gap-fill course. The diagnostic identified Stella's specific weakness — she recognizes content (84% MCQ) but cannot name precise AP mechanisms in writing (38% FRQ), consistently substituting vague descriptions or naming effects instead of causes/mechanisms. Every FRQ directly targets this gap with the correct cognitive demand. Each session pairs a "two-sentence rule" precision drill (FRQs 1, 3, 5, 7, 9, 11, 13, 15) — explicitly forcing her to name the precise AP term and explain the mechanism, often referencing her exact diagnostic errors ("Your diagnostic answer: 'idk ngl'") — with a stimulus-based analysis FRQ (2, 4, 6, 8, 10, 12, 14, 16) that requires her to deploy those terms in extended writing. The articles teach the same mechanisms (Tamerlane's internal fragmentation, Enlightenment as cause vs. reform as effect, plantation demand vs. slave-trade mechanism, tributary/Canton systems, decolonization vs. Cold War alliances, mandate system, Soviet dissolution factors) that the FRQs then test. The cognitive demand matches the gap exactly: precision-of-naming and cause-vs-mechanism discrimination, not generic synthesis. No orphan content, no format mismatches.

Glossary

Every check explained — what it looks at, why it matters, how to fix it.

Show all check definitions

Item structure & format

Right number of choices
What this checksEach MCQ should have exactly four answer choices.
Why it mattersAP exam MCQs always have four options; deviation breaks student expectations and platform rendering.
How to fixEdit the item to have exactly four choices.
Has an explanation
What this checksEvery MCQ must include an explanation of why the right answer is right.
Why it mattersStudents need to learn from wrong answers — without an explanation, the question only tests, it doesn't teach.
How to fixAdd an explanation field that walks through why the keyed answer is correct.
Correct-answer letter exists
What this checksThe keyed correct-answer letter must actually match one of the choices.
Why it mattersIf the key says 'C' but there's no choice C, students cannot ever get credit.
How to fixRe-check the answer key against the choices.
No placeholder text
What this checksThe item must not contain markers like [TODO], [INSERT], or [PLACEHOLDER].
Why it mattersPlaceholders that escape into production are visible to students and signal an unfinished item.
How to fixReplace the placeholder with the intended content.
No wildly long answer
What this checksNo single answer choice should be more than twice as long as the shortest choice in the same item.
Why it mattersLength asymmetry lets test-savvy students guess the correct answer without reading carefully.
How to fixTrim the long choice or expand the short ones so the four options are similar in length.
All choices distinct
What this checksAll four choices must be different from each other.
Why it mattersDuplicate choices waste a slot and confuse students.
How to fixRewrite duplicate choices.
Stem is a question
What this checksThe stem should end with a question mark or use clear question language ('Which...', 'What...', 'Best describes...').
Why it mattersStatement-style stems leave students guessing what they're being asked.
How to fixRewrite the stem as a clear question.
No 'all/none of the above'
What this checksChoices must not include 'all of the above' or 'none of the above'.
Why it mattersThese trap options test reading strategy more than content knowledge and aren't used on the AP exam.
How to fixReplace the option with a real distractor that targets a misconception.
No absolute language traps
What this checksChoices must not contain absolute words like 'always', 'never', 'only', 'must' (especially when only the wrong answers contain them).
Why it mattersTest-savvy students learn to eliminate any choice with an absolute, regardless of content.
How to fixSoften the absolute language, or move it into the correct answer too if it's content-relevant.
Wrong answers have rationale
What this checksEach wrong answer should be backed by a rationale explaining what misconception it targets.
Why it mattersDistractors written without a rationale tend to be implausible or arbitrary, missing the chance to diagnose student thinking.
How to fixFor each distractor, add a one-line note describing the misconception or partial-understanding it represents.
Wrong-answer feedback shown to students
What this checksWhen a student picks a wrong answer, they should see feedback explaining why it's wrong.
Why it mattersPer-distractor feedback is the most direct teaching moment — silence on a wrong answer is a missed opportunity.
How to fixAdd per-choice feedback so students see specific guidance when they choose wrong.

Course-level answer bias

Longest answer isn't the correct one too often
What this checksAcross the course, the correct answer should NOT be the strictly-longest option (with a meaningful ≥8-char gap to the next-longest) more than ~35% of the time. Chance baseline is 25%. Tiny gaps (1–4 chars) are filtered out as measurement noise — invisible to students.
Why it mattersIf the correct answer is consistently and noticeably the longest, students can pass the course by always picking the longest option without learning anything.
How to fixFor items where the correct answer is meaningfully longer, either trim the correct or expand the distractors so they match in length.
Correct answers are spread across A/B/C/D
What this checksAcross the course, correct-answer positions should be roughly evenly distributed across A/B/C/D (statistical chi-square test).
Why it mattersPosition skew lets students guess from a pattern (e.g., 'always pick C') instead of from understanding.
How to fixRe-shuffle the position of correct answers so they're balanced across A/B/C/D.
Correct answers aren't systematically longer
What this checksAcross the course, the average length of correct answers should be close to the average length of wrong answers (not more than ~25% longer).
Why it mattersEven if no single item triggers the per-item length check, a systematic length difference across the course rewards length-guessing strategies.
How to fixAudit the course for items where the correct answer is noticeably longer; equalize lengths.
Correct answers don't systematically cite more factors than distractors
What this checksAcross the course, no more than ~25% of items should have the correct answer enumerating ≥2 more factors/mechanisms (counted via 'and' + comma joiners) than ALL distractors. A +1 factor gap is treated as counting noise (e.g., appositive commas counted as list items) and ignored.
Why it mattersEven when option lengths are balanced, a systematic 'correct answer integrates more factors' pattern lets test-savvy students pick the most-elaborated option without engaging with content.
How to fixIn the distractor generator, instruct: distractors must cite the SAME NUMBER of factors/mechanisms/criteria as the correct answer — distractors are wrong because the factors are wrong, not because there are fewer of them.
Distractors don't carry absolute language the correct answer lacks
What this checksAcross the course, no more than ~25% of items should have ≥2 distractors containing hard absolutes (always, never, only, all, none, every, must) while the correct answer has none.
Why it mattersTest-savvy students eliminate options with absolutes by reflex — if your distractors carry them and your correct answers don't, students can pass without content knowledge.
How to fixEither (a) instruct the distractor generator to avoid hard absolutes entirely, or (b) ensure the correct answer carries an absolute when distractors do (rare but valid for content reasons).

Question quality (AI review)

Correct answer is unambiguously correct
What this checksAn AI reviewer (Claude Sonnet) confirms that the keyed answer is the unambiguous best answer (no other choice is equally defensible).
Why it mattersItems with two defensible answers frustrate strong students and undermine the validity of the question.
How to fixTighten the stem or the distractors so only one answer is defensible.
Wrong answers are plausible
What this checksAn AI reviewer confirms that each wrong answer would be picked by a student with a real misconception (not absurd or trivially obvious).
Why it mattersImplausible distractors make the item easy to guess and don't help diagnose what students don't understand.
How to fixReplace weak distractors with ones grounded in known student misconceptions for the topic.
Question stem is clear
What this checksAn AI reviewer confirms the stem is self-contained, unambiguous, and doesn't cue the answer.
Why it mattersAmbiguous stems test reading skill rather than the targeted content knowledge.
How to fixRewrite the stem to be specific, complete, and free of giveaway cues.
Explanation actually teaches
What this checksAn AI reviewer confirms the explanation explains WHY the correct answer is correct AND WHY each distractor is wrong (not just 'Correct!').
Why it mattersTrivial explanations like 'Correct!' miss the most valuable teaching moment in the entire item.
How to fixRewrite the explanation to walk through the reasoning for the correct answer and address the most likely wrong-answer choices.

Lesson↔question alignment (AI review)

Question demand matches lesson goal
What this checksAn AI reviewer compares the cognitive level the question demands (recall vs analyze vs synthesize) to the level the lesson is meant to build.
Why it mattersIf the lesson teaches mechanism but the questions only test vocabulary, students 'pass' without learning the targeted skill.
How to fixEither upgrade the question to require the targeted cognitive level, or move the question to a recall-focused lesson.
Article supports the question
What this checksFor items linked to an article/passage, an AI reviewer confirms the article contains what's needed to answer the question.
Why it mattersIf the article doesn't teach what the question tests, the practice is broken — students are being tested on content they weren't given.
How to fixEither expand the article to cover the missing content, or move the question to a different article that teaches it.

Course pedagogy (AI review)

Course is right-scoped for the gap
What this checksAn AI curriculum reviewer judges whether the course is tightly scoped to the named gap (not over-bloated, not so thin it misses the gap).
Why it mattersHole-filling courses should fill the specific gap; rebuilds and stretched-thin courses both miss the point.
How to fixTrim irrelevant content if over-broad, or add focused articles/practice if under-scoped.
Articles teach what practice tests
What this checksAn AI reviewer judges whether articles teach a concept first, then practice tests that concept, in a clear sequence.
Why it mattersRandom ordering or orphan questions break the teaching loop — students get tested on things they were never taught.
How to fixReorder so each concept appears in an article before the related practice item.
Course names and corrects misconceptions
What this checksAn AI reviewer judges whether the course explicitly names the likely wrong-thinking patterns and corrects them.
Why it mattersHole-filling exists because the student got something wrong — generic content review without naming the misconception is unlikely to fix it.
How to fixAdd explicit 'common mistake' or 'why students confuse X with Y' sections to articles.
Question demand matches the AP skill
What this checksAn AI reviewer judges whether the cognitive demand of the questions matches the AP skill being remediated.
Why it mattersA 'mechanism' gap requires explain-the-mechanism items; a 'discrimination' gap requires distinguishing-cases items.
How to fixRe-author questions to target the specific AP skill type the course names.
No pattern-matching shortcuts
What this checksAn AI reviewer judges whether students CAN'T pass the exit ticket via shortcuts like answer-length bias or keyword cueing.
Why it mattersIf the student can pass without learning the concept, the course's value is zero regardless of how good the teaching is.
How to fixAudit for length bias, keyword cueing in stems, and vocabulary lifted verbatim from articles into correct answers.

FRQ quality

Prompt is real (not a placeholder)
What this checksFRQ stems must not be placeholders like 'test prompt' or 'TODO'.
Why it mattersPlaceholder prompts ship to students and make the course visibly broken.
How to fixReplace the placeholder with the actual FRQ prompt.
FRQ has a rubric
What this checksEvery FRQ must have a non-empty rubric.
Why it mattersWithout a rubric, the AI grader has no scoring criteria and student responses cannot be graded.
How to fixAuthor a rubric with point allocations and accepted answer paths.
Rubric uses criteria language
What this checksThe rubric should mention point allocations (e.g., '1 point') or scoring criteria language.
Why it mattersFree-form rubric text without explicit criteria is hard for the grader to apply consistently.
How to fixRestructure the rubric into explicit criteria with point values.
Autograder URL is set
What this checksEach FRQ must point to an autograder URL.
Why it mattersWithout a grader URL, FRQ submissions go nowhere — students don't get scores or feedback.
How to fixWire the FRQ to a configured autograder.
Autograder URL is well-formed
What this checksThe autograder URL must use a valid scheme and not contain typos like the old /api/ prefix or doubled https://.
Why it mattersMalformed URLs silently fail to grade — the platform does not warn you.
How to fixVerify the URL against the canonical grader endpoint format.
FRQ marked as required response
What this checksThe interaction element must have required="true" so students can't skip it.
Why it mattersWithout required="true", students can advance without responding.
How to fixSet the required attribute on the FRQ interaction.
Expected response length set
What this checksThe FRQ should declare an expected number of lines so the response box is sized correctly.
Why it mattersWithout a sizing hint, students see a tiny box for a long-essay prompt and write less than they should.
How to fixAdd the expected-lines attribute matching the rubric's expected response length.
FRQ outcome declarations are canonical
What this checksThe FRQ should declare the canonical set of outcome variables (API_RESPONSE, FEEDBACK_VISIBILITY, GENERATED_FEEDBACK, SCORE).
Why it mattersNon-canonical outcomes prevent the grader from writing scores back or showing generated feedback to students.
How to fixMigrate the FRQ XML to the canonical outcome declaration pattern.
Rubric is in the correct QTI location
What this checksThe rubric must live in <qti-rubric-block> inside <qti-item-body> per the QTI spec — not in metadata.rubric or metadata.modelAnswer.
Why it mattersRubrics in non-standard fields are invisible to the platform's grading tooling and to graders who follow the spec — the rubric content might exist but won't be applied where it counts.
How to fixMove the rubric content from metadata.rubric (or metadata.modelAnswer) into a <qti-rubric-block use="ext:criteria" view="scorer"> element inside <qti-item-body>, with one block per criterion.
FRQ uses autograding (preferred)
What this checksThe FRQ should be configured for autograding rather than manual/human scoring.
Why it mattersManual scoring is allowed, but autograded FRQs scale better and give students faster feedback — autograding is preferred where feasible.
How to fixIf the FRQ is currently set to scoringType="manual" or requiresHumanScoring=true, consider building an autograder for it.
FRQ prompt is clear
What this checksAn AI reviewer confirms the prompt is unambiguous and well-specified (clear task verb, scope, expected response form).
Why it mattersAmbiguous prompts get a wide range of responses that the rubric can't fairly score.
How to fixTighten the prompt's task verb, add scope constraints, and specify response form (essay/list/diagram).
Rubric scores what the prompt asks
What this checksAn AI reviewer confirms the rubric criteria map to what the prompt asks (no orphan criteria, no unscored prompt parts).
Why it mattersMisalignment means students do what the prompt asks but get scored on something else.
How to fixWalk through the prompt and rubric line by line; ensure each prompt part has a scoring path.
FRQs collectively serve the course goal
What this checksAn AI curriculum reviewer judges whether the FRQs in the course (taken together) actually work toward the course's named goal — and that no FRQ ships with broken/placeholder content.
Why it mattersEven if individual FRQs are fine, an off-target FRQ set or any single broken FRQ undermines the whole course.
How to fixRe-author off-target FRQs to match the gap; fix or remove any broken FRQs.