Course Quality Check

Run the same QC pipeline used in the team-wide audit on a single Timeback course before shipping it to students.

AP Human Geography — Hole-Filler (f51e74e2)

hf-f51e74e2-20260427-0902 · AP Human Geography · Hole-filling course
✓ No critical headline issues — review item-level details below to spot any remaining concerns.
166
Moderate

Detailed findings

Each section below explains what was checked, what we found, and what to do about it. Lower scores are better; the score combines all check results.

What the AI reviewer said about the course

The course's biggest strength is its disciplined misconception-first design: each article names the specific wrong-thinking pattern (height-wins, falling-line-falls-globally, aging-equals-growth) and the practice items are engineered to catch students who hold that misconception, with clean answer-engineering that doesn't leak the answer. The biggest weakness is the complete absence of explanations on every question — students who miss an item get no feedback explaining why their distractor was wrong, which undercuts the otherwise strong remediation design.

Course pedagogy review (5 dimensions)

Passes
Course is right-scoped for the gap
The course targets five specific, commonly-confused AP HuG concepts (site/situation, two-line trend graph reading, aging populations/dependency ratio, bid-rent/infrastructure, and diffusion types) with one focused article per gap plus an exit ticket. This is exactly the right scope for a hole-filler — not a full Unit 1-7 rebuild, but each chosen topic represents a high-yield discriminator where students with broad mastery often slip. The articles stay tight on their named gap rather than wandering into adjacent units.
Passes
Articles teach what practice tests
Each article teaches a specific concept with worked examples and an explicit misconception, and the practice questions for that section directly test that concept (e.g., the trend-graph article's four-step protocol maps cleanly onto questions 7-12; the bid-rent "height ≠ winner" point maps onto questions 19, 22, 23, 37). The exit ticket revisits all five gaps in sequence. Practice does not stray into untaught material.
Passes
Course names and corrects misconceptions
Every article contains an explicit "Misconception Alert" or wrong/right contrast table (Chicago-as-hub-is-site, falling-line-means-global-decline, aging-means-fast-growth, tallest-tower-wins, hierarchical-only-means-top-down). Many questions are explicitly built around catching these misconceptions — e.g., Q6 asks why a student's site/situation reasoning is wrong, Q8 names the weighting misconception, Q15 baits the doubling-time inversion, Q17 catches the population-decline-means-low-dependency error.
Passes
Question demand matches the AP skill
The gaps are discrimination/classification gaps (which concept applies, which diffusion type, which factor explains the pattern), and the questions are appropriately scenario-based discrimination items rather than vocabulary recall. Items present novel cities (Bruges, Durban, Vladivostok, São Paulo), novel data scenarios, and ask students to apply the framework — matching the AP stimulus-based MCQ demand level.
Passes
No pattern-matching shortcuts
All deterministic metrics fall well below thresholds — strict-longest-correct is 0% (vs 35% threshold), specificity asymmetry 2.5% (vs 25%), absolute-language asymmetry 15% (vs 25%), and mean length difference is -1.4 chars (correct answers are actually slightly shorter on average). No systematic shortcut exists for a test-savvy student to exploit.

Course-level statistical signals

Passes
Longest answer isn't the correct one too often
0/40 = 0.0% of items have correct = strictly longest with ≥8 char gap (threshold ≤35%, chance=25%)
Passes
Correct answers are spread across A/B/C/D
χ²=5.00 (crit p<.01: 11.34); distribution={'A': 6, 'B': 7, 'C': 13, 'D': 14}
Passes
Correct answers aren't systematically longer
mean correct=76.9 chars, mean distractor=78.3 chars, diff=-1.4, ratio=0.98 (thresholds: diff≤8, ratio≤1.25)
Passes
Correct answers don't systematically cite more factors than distractors
1/40 = 2.5% of items have correct answer citing ≥2 more factors than ALL distractors (threshold ≤25%)
Passes
Distractors don't carry absolute language the correct answer lacks
6/40 = 15.0% of items have ≥2 distractors with hard absolutes (always/never/only/all/none/every/must) while correct does not (threshold ≤25%)

Item-level issues found

IssueItems affected
Has an explanation40 of 40 (100%)
Wrong answers have rationale40 of 40 (100%)
Wrong-answer feedback shown to students40 of 40 (100%)
No absolute language traps11 of 40 (27%)
Stem is a question1 of 40 (2%)
Correct answer is unambiguously correct1 of 40 (2%)
Wrong answers are plausible14 of 38 (36%)
Question stem is clear9 of 38 (23%)
Explanation actually teaches38 of 38 (100%)
Question demand matches lesson goal10 of 40 (25%)

Glossary

Every check explained — what it looks at, why it matters, how to fix it.

Show all check definitions

Item structure & format

Right number of choices
What this checksEach MCQ should have exactly four answer choices.
Why it mattersAP exam MCQs always have four options; deviation breaks student expectations and platform rendering.
How to fixEdit the item to have exactly four choices.
Has an explanation
What this checksEvery MCQ must include an explanation of why the right answer is right.
Why it mattersStudents need to learn from wrong answers — without an explanation, the question only tests, it doesn't teach.
How to fixAdd an explanation field that walks through why the keyed answer is correct.
Correct-answer letter exists
What this checksThe keyed correct-answer letter must actually match one of the choices.
Why it mattersIf the key says 'C' but there's no choice C, students cannot ever get credit.
How to fixRe-check the answer key against the choices.
No placeholder text
What this checksThe item must not contain markers like [TODO], [INSERT], or [PLACEHOLDER].
Why it mattersPlaceholders that escape into production are visible to students and signal an unfinished item.
How to fixReplace the placeholder with the intended content.
No wildly long answer
What this checksNo single answer choice should be more than twice as long as the shortest choice in the same item.
Why it mattersLength asymmetry lets test-savvy students guess the correct answer without reading carefully.
How to fixTrim the long choice or expand the short ones so the four options are similar in length.
All choices distinct
What this checksAll four choices must be different from each other.
Why it mattersDuplicate choices waste a slot and confuse students.
How to fixRewrite duplicate choices.
Stem is a question
What this checksThe stem should end with a question mark or use clear question language ('Which...', 'What...', 'Best describes...').
Why it mattersStatement-style stems leave students guessing what they're being asked.
How to fixRewrite the stem as a clear question.
No 'all/none of the above'
What this checksChoices must not include 'all of the above' or 'none of the above'.
Why it mattersThese trap options test reading strategy more than content knowledge and aren't used on the AP exam.
How to fixReplace the option with a real distractor that targets a misconception.
No absolute language traps
What this checksChoices must not contain absolute words like 'always', 'never', 'only', 'must' (especially when only the wrong answers contain them).
Why it mattersTest-savvy students learn to eliminate any choice with an absolute, regardless of content.
How to fixSoften the absolute language, or move it into the correct answer too if it's content-relevant.
Wrong answers have rationale
What this checksEach wrong answer should be backed by a rationale explaining what misconception it targets.
Why it mattersDistractors written without a rationale tend to be implausible or arbitrary, missing the chance to diagnose student thinking.
How to fixFor each distractor, add a one-line note describing the misconception or partial-understanding it represents.
Wrong-answer feedback shown to students
What this checksWhen a student picks a wrong answer, they should see feedback explaining why it's wrong.
Why it mattersPer-distractor feedback is the most direct teaching moment — silence on a wrong answer is a missed opportunity.
How to fixAdd per-choice feedback so students see specific guidance when they choose wrong.

Course-level answer bias

Longest answer isn't the correct one too often
What this checksAcross the course, the correct answer should NOT be the strictly-longest option (with a meaningful ≥8-char gap to the next-longest) more than ~35% of the time. Chance baseline is 25%. Tiny gaps (1–4 chars) are filtered out as measurement noise — invisible to students.
Why it mattersIf the correct answer is consistently and noticeably the longest, students can pass the course by always picking the longest option without learning anything.
How to fixFor items where the correct answer is meaningfully longer, either trim the correct or expand the distractors so they match in length.
Correct answers are spread across A/B/C/D
What this checksAcross the course, correct-answer positions should be roughly evenly distributed across A/B/C/D (statistical chi-square test).
Why it mattersPosition skew lets students guess from a pattern (e.g., 'always pick C') instead of from understanding.
How to fixRe-shuffle the position of correct answers so they're balanced across A/B/C/D.
Correct answers aren't systematically longer
What this checksAcross the course, the average length of correct answers should be close to the average length of wrong answers (not more than ~25% longer).
Why it mattersEven if no single item triggers the per-item length check, a systematic length difference across the course rewards length-guessing strategies.
How to fixAudit the course for items where the correct answer is noticeably longer; equalize lengths.
Correct answers don't systematically cite more factors than distractors
What this checksAcross the course, no more than ~25% of items should have the correct answer enumerating ≥2 more factors/mechanisms (counted via 'and' + comma joiners) than ALL distractors. A +1 factor gap is treated as counting noise (e.g., appositive commas counted as list items) and ignored.
Why it mattersEven when option lengths are balanced, a systematic 'correct answer integrates more factors' pattern lets test-savvy students pick the most-elaborated option without engaging with content.
How to fixIn the distractor generator, instruct: distractors must cite the SAME NUMBER of factors/mechanisms/criteria as the correct answer — distractors are wrong because the factors are wrong, not because there are fewer of them.
Distractors don't carry absolute language the correct answer lacks
What this checksAcross the course, no more than ~25% of items should have ≥2 distractors containing hard absolutes (always, never, only, all, none, every, must) while the correct answer has none.
Why it mattersTest-savvy students eliminate options with absolutes by reflex — if your distractors carry them and your correct answers don't, students can pass without content knowledge.
How to fixEither (a) instruct the distractor generator to avoid hard absolutes entirely, or (b) ensure the correct answer carries an absolute when distractors do (rare but valid for content reasons).

Question quality (AI review)

Correct answer is unambiguously correct
What this checksAn AI reviewer (Claude Sonnet) confirms that the keyed answer is the unambiguous best answer (no other choice is equally defensible).
Why it mattersItems with two defensible answers frustrate strong students and undermine the validity of the question.
How to fixTighten the stem or the distractors so only one answer is defensible.
Wrong answers are plausible
What this checksAn AI reviewer confirms that each wrong answer would be picked by a student with a real misconception (not absurd or trivially obvious).
Why it mattersImplausible distractors make the item easy to guess and don't help diagnose what students don't understand.
How to fixReplace weak distractors with ones grounded in known student misconceptions for the topic.
Question stem is clear
What this checksAn AI reviewer confirms the stem is self-contained, unambiguous, and doesn't cue the answer.
Why it mattersAmbiguous stems test reading skill rather than the targeted content knowledge.
How to fixRewrite the stem to be specific, complete, and free of giveaway cues.
Explanation actually teaches
What this checksAn AI reviewer confirms the explanation explains WHY the correct answer is correct AND WHY each distractor is wrong (not just 'Correct!').
Why it mattersTrivial explanations like 'Correct!' miss the most valuable teaching moment in the entire item.
How to fixRewrite the explanation to walk through the reasoning for the correct answer and address the most likely wrong-answer choices.

Lesson↔question alignment (AI review)

Question demand matches lesson goal
What this checksAn AI reviewer compares the cognitive level the question demands (recall vs analyze vs synthesize) to the level the lesson is meant to build.
Why it mattersIf the lesson teaches mechanism but the questions only test vocabulary, students 'pass' without learning the targeted skill.
How to fixEither upgrade the question to require the targeted cognitive level, or move the question to a recall-focused lesson.
Article supports the question
What this checksFor items linked to an article/passage, an AI reviewer confirms the article contains what's needed to answer the question.
Why it mattersIf the article doesn't teach what the question tests, the practice is broken — students are being tested on content they weren't given.
How to fixEither expand the article to cover the missing content, or move the question to a different article that teaches it.

Course pedagogy (AI review)

Course is right-scoped for the gap
What this checksAn AI curriculum reviewer judges whether the course is tightly scoped to the named gap (not over-bloated, not so thin it misses the gap).
Why it mattersHole-filling courses should fill the specific gap; rebuilds and stretched-thin courses both miss the point.
How to fixTrim irrelevant content if over-broad, or add focused articles/practice if under-scoped.
Articles teach what practice tests
What this checksAn AI reviewer judges whether articles teach a concept first, then practice tests that concept, in a clear sequence.
Why it mattersRandom ordering or orphan questions break the teaching loop — students get tested on things they were never taught.
How to fixReorder so each concept appears in an article before the related practice item.
Course names and corrects misconceptions
What this checksAn AI reviewer judges whether the course explicitly names the likely wrong-thinking patterns and corrects them.
Why it mattersHole-filling exists because the student got something wrong — generic content review without naming the misconception is unlikely to fix it.
How to fixAdd explicit 'common mistake' or 'why students confuse X with Y' sections to articles.
Question demand matches the AP skill
What this checksAn AI reviewer judges whether the cognitive demand of the questions matches the AP skill being remediated.
Why it mattersA 'mechanism' gap requires explain-the-mechanism items; a 'discrimination' gap requires distinguishing-cases items.
How to fixRe-author questions to target the specific AP skill type the course names.
No pattern-matching shortcuts
What this checksAn AI reviewer judges whether students CAN'T pass the exit ticket via shortcuts like answer-length bias or keyword cueing.
Why it mattersIf the student can pass without learning the concept, the course's value is zero regardless of how good the teaching is.
How to fixAudit for length bias, keyword cueing in stems, and vocabulary lifted verbatim from articles into correct answers.

FRQ quality

Prompt is real (not a placeholder)
What this checksFRQ stems must not be placeholders like 'test prompt' or 'TODO'.
Why it mattersPlaceholder prompts ship to students and make the course visibly broken.
How to fixReplace the placeholder with the actual FRQ prompt.
FRQ has a rubric
What this checksEvery FRQ must have a non-empty rubric.
Why it mattersWithout a rubric, the AI grader has no scoring criteria and student responses cannot be graded.
How to fixAuthor a rubric with point allocations and accepted answer paths.
Rubric uses criteria language
What this checksThe rubric should mention point allocations (e.g., '1 point') or scoring criteria language.
Why it mattersFree-form rubric text without explicit criteria is hard for the grader to apply consistently.
How to fixRestructure the rubric into explicit criteria with point values.
Autograder URL is set
What this checksEach FRQ must point to an autograder URL.
Why it mattersWithout a grader URL, FRQ submissions go nowhere — students don't get scores or feedback.
How to fixWire the FRQ to a configured autograder.
Autograder URL is well-formed
What this checksThe autograder URL must use a valid scheme and not contain typos like the old /api/ prefix or doubled https://.
Why it mattersMalformed URLs silently fail to grade — the platform does not warn you.
How to fixVerify the URL against the canonical grader endpoint format.
FRQ marked as required response
What this checksThe interaction element must have required="true" so students can't skip it.
Why it mattersWithout required="true", students can advance without responding.
How to fixSet the required attribute on the FRQ interaction.
Expected response length set
What this checksThe FRQ should declare an expected number of lines so the response box is sized correctly.
Why it mattersWithout a sizing hint, students see a tiny box for a long-essay prompt and write less than they should.
How to fixAdd the expected-lines attribute matching the rubric's expected response length.
FRQ outcome declarations are canonical
What this checksThe FRQ should declare the canonical set of outcome variables (API_RESPONSE, FEEDBACK_VISIBILITY, GENERATED_FEEDBACK, SCORE).
Why it mattersNon-canonical outcomes prevent the grader from writing scores back or showing generated feedback to students.
How to fixMigrate the FRQ XML to the canonical outcome declaration pattern.
Rubric is in the correct QTI location
What this checksThe rubric must live in <qti-rubric-block> inside <qti-item-body> per the QTI spec — not in metadata.rubric or metadata.modelAnswer.
Why it mattersRubrics in non-standard fields are invisible to the platform's grading tooling and to graders who follow the spec — the rubric content might exist but won't be applied where it counts.
How to fixMove the rubric content from metadata.rubric (or metadata.modelAnswer) into a <qti-rubric-block use="ext:criteria" view="scorer"> element inside <qti-item-body>, with one block per criterion.
FRQ uses autograding (preferred)
What this checksThe FRQ should be configured for autograding rather than manual/human scoring.
Why it mattersManual scoring is allowed, but autograded FRQs scale better and give students faster feedback — autograding is preferred where feasible.
How to fixIf the FRQ is currently set to scoringType="manual" or requiresHumanScoring=true, consider building an autograder for it.
FRQ prompt is clear
What this checksAn AI reviewer confirms the prompt is unambiguous and well-specified (clear task verb, scope, expected response form).
Why it mattersAmbiguous prompts get a wide range of responses that the rubric can't fairly score.
How to fixTighten the prompt's task verb, add scope constraints, and specify response form (essay/list/diagram).
Rubric scores what the prompt asks
What this checksAn AI reviewer confirms the rubric criteria map to what the prompt asks (no orphan criteria, no unscored prompt parts).
Why it mattersMisalignment means students do what the prompt asks but get scored on something else.
How to fixWalk through the prompt and rubric line by line; ensure each prompt part has a scoring path.
FRQs collectively serve the course goal
What this checksAn AI curriculum reviewer judges whether the FRQs in the course (taken together) actually work toward the course's named goal — and that no FRQ ships with broken/placeholder content.
Why it mattersEven if individual FRQs are fine, an off-target FRQ set or any single broken FRQ undermines the whole course.
How to fixRe-author off-target FRQs to match the gap; fix or remove any broken FRQs.