Course Quality Check

Run the same QC pipeline used in the team-wide audit on a single Timeback course before shipping it to students.

Luca Sanchez — APHG MCQ Practice

course-2dd8fbed74 · AP Human Geography · MCQ practice bank
1 issue to address before shipping:
  • Course is right-scoped for the gap
105
Low / no issues

Detailed findings

Each section below explains what was checked, what we found, and what to do about it. Lower scores are better; the score combines all check results.

What the AI reviewer said about the course

The course's biggest strength is well-constructed, AP-aligned stimulus-based MCQs with appropriate cognitive demand and no exploitable answer-pattern shortcuts. Its biggest weakness is that it is not a hole-filling remediation course at all — it has zero instructional articles, no targeted gap, no misconception remediation in feedback, and spans the entire APHG curriculum, making it function as a generic practice bank rather than a focused brush-up.

Course pedagogy review (5 dimensions)

Issue
Course is right-scoped for the gap
The course is named "APHG MCQ Practice" with no specific gap identified, and the 100 questions span essentially the entire APHG curriculum — population/migration (Units 2-3), cultural geography (Unit 3), political geography (Unit 4), agriculture (Unit 5), urban geography (Unit 6), and industrialization/development (Unit 7). This is a full-course practice bank, not a tightly-scoped hole-filling remediation. For a brush-up course, this breadth violates the compactness virtue — there is no identifiable gap being targeted, just generic comprehensive MCQ practice.
n/a
Articles teach what practice tests — not applicable for this archetype
n/a
Course names and corrects misconceptions — not applicable for this archetype
Passes
Question demand matches the AP skill
The questions consistently operate at the application/analysis level appropriate to AP Human Geography MCQs. Items require students to apply models (gravity, Burgess, Hoyt, multiple nuclei, DTM stages), interpret data tables and stimuli, and discriminate between similar concepts (e.g., contagious vs. hierarchical diffusion in item 75; nation vs. state in item 60; pull factors vs. natural increase in item 61). The cognitive demand matches what the AP exam requires for stimulus-based MCQs.
Passes
No pattern-matching shortcuts
All deterministic metrics are well within thresholds — strict-longest-correct rate is 0.0% (far below 35%), specificity asymmetry is 4.0% (below 25%), absolute-language asymmetry is 0.0% (below 25%), and the mean length difference is only +0.4 characters (well below 8). Position distribution shows mild C-skew (32%) but not exploitable. No systematic shortcut allows a test-savvy student to bypass content knowledge.

Course-level statistical signals

Passes
Longest answer isn't the correct one too often
0/100 = 0.0% of items have correct = strictly longest with ≥8 char gap (threshold ≤35%, chance=25%)
Passes
Correct answers are spread across A/B/C/D
χ²=3.92 (crit p<.01: 11.34); distribution={'A': 25, 'B': 25, 'C': 32, 'D': 18}
Passes
Correct answers aren't systematically longer
mean correct=102.5 chars, mean distractor=102.1 chars, diff=+0.4, ratio=1.00 (thresholds: diff≤8, ratio≤1.25)
Passes
Correct answers don't systematically cite more factors than distractors
4/100 = 4.0% of items have correct answer citing ≥2 more factors than ALL distractors (threshold ≤25%)
Passes
Distractors don't carry absolute language the correct answer lacks
0/100 = 0.0% of items have ≥2 distractors with hard absolutes (always/never/only/all/none/every/must) while correct does not (threshold ≤25%)

Item-level issues found

IssueItems affected
No absolute language traps16 of 100 (16%)
Stem is a question3 of 100 (3%)
Correct answer is unambiguously correct4 of 100 (4%)
Wrong answers are plausible15 of 100 (15%)
Question stem is clear17 of 100 (17%)
Explanation actually teaches97 of 100 (97%)
Question demand matches lesson goal34 of 100 (34%)

Glossary

Every check explained — what it looks at, why it matters, how to fix it.

Show all check definitions

Item structure & format

Right number of choices
What this checksEach MCQ should have exactly four answer choices.
Why it mattersAP exam MCQs always have four options; deviation breaks student expectations and platform rendering.
How to fixEdit the item to have exactly four choices.
Has an explanation
What this checksEvery MCQ must include an explanation of why the right answer is right.
Why it mattersStudents need to learn from wrong answers — without an explanation, the question only tests, it doesn't teach.
How to fixAdd an explanation field that walks through why the keyed answer is correct.
Correct-answer letter exists
What this checksThe keyed correct-answer letter must actually match one of the choices.
Why it mattersIf the key says 'C' but there's no choice C, students cannot ever get credit.
How to fixRe-check the answer key against the choices.
No placeholder text
What this checksThe item must not contain markers like [TODO], [INSERT], or [PLACEHOLDER].
Why it mattersPlaceholders that escape into production are visible to students and signal an unfinished item.
How to fixReplace the placeholder with the intended content.
No wildly long answer
What this checksNo single answer choice should be more than twice as long as the shortest choice in the same item.
Why it mattersLength asymmetry lets test-savvy students guess the correct answer without reading carefully.
How to fixTrim the long choice or expand the short ones so the four options are similar in length.
All choices distinct
What this checksAll four choices must be different from each other.
Why it mattersDuplicate choices waste a slot and confuse students.
How to fixRewrite duplicate choices.
Stem is a question
What this checksThe stem should end with a question mark or use clear question language ('Which...', 'What...', 'Best describes...').
Why it mattersStatement-style stems leave students guessing what they're being asked.
How to fixRewrite the stem as a clear question.
No 'all/none of the above'
What this checksChoices must not include 'all of the above' or 'none of the above'.
Why it mattersThese trap options test reading strategy more than content knowledge and aren't used on the AP exam.
How to fixReplace the option with a real distractor that targets a misconception.
No absolute language traps
What this checksChoices must not contain absolute words like 'always', 'never', 'only', 'must' (especially when only the wrong answers contain them).
Why it mattersTest-savvy students learn to eliminate any choice with an absolute, regardless of content.
How to fixSoften the absolute language, or move it into the correct answer too if it's content-relevant.
Wrong answers have rationale
What this checksEach wrong answer should be backed by a rationale explaining what misconception it targets.
Why it mattersDistractors written without a rationale tend to be implausible or arbitrary, missing the chance to diagnose student thinking.
How to fixFor each distractor, add a one-line note describing the misconception or partial-understanding it represents.
Wrong-answer feedback shown to students
What this checksWhen a student picks a wrong answer, they should see feedback explaining why it's wrong.
Why it mattersPer-distractor feedback is the most direct teaching moment — silence on a wrong answer is a missed opportunity.
How to fixAdd per-choice feedback so students see specific guidance when they choose wrong.

Course-level answer bias

Longest answer isn't the correct one too often
What this checksAcross the course, the correct answer should NOT be the strictly-longest option (with a meaningful ≥8-char gap to the next-longest) more than ~35% of the time. Chance baseline is 25%. Tiny gaps (1–4 chars) are filtered out as measurement noise — invisible to students.
Why it mattersIf the correct answer is consistently and noticeably the longest, students can pass the course by always picking the longest option without learning anything.
How to fixFor items where the correct answer is meaningfully longer, either trim the correct or expand the distractors so they match in length.
Correct answers are spread across A/B/C/D
What this checksAcross the course, correct-answer positions should be roughly evenly distributed across A/B/C/D (statistical chi-square test).
Why it mattersPosition skew lets students guess from a pattern (e.g., 'always pick C') instead of from understanding.
How to fixRe-shuffle the position of correct answers so they're balanced across A/B/C/D.
Correct answers aren't systematically longer
What this checksAcross the course, the average length of correct answers should be close to the average length of wrong answers (not more than ~25% longer).
Why it mattersEven if no single item triggers the per-item length check, a systematic length difference across the course rewards length-guessing strategies.
How to fixAudit the course for items where the correct answer is noticeably longer; equalize lengths.
Correct answers don't systematically cite more factors than distractors
What this checksAcross the course, no more than ~25% of items should have the correct answer enumerating ≥2 more factors/mechanisms (counted via 'and' + comma joiners) than ALL distractors. A +1 factor gap is treated as counting noise (e.g., appositive commas counted as list items) and ignored.
Why it mattersEven when option lengths are balanced, a systematic 'correct answer integrates more factors' pattern lets test-savvy students pick the most-elaborated option without engaging with content.
How to fixIn the distractor generator, instruct: distractors must cite the SAME NUMBER of factors/mechanisms/criteria as the correct answer — distractors are wrong because the factors are wrong, not because there are fewer of them.
Distractors don't carry absolute language the correct answer lacks
What this checksAcross the course, no more than ~25% of items should have ≥2 distractors containing hard absolutes (always, never, only, all, none, every, must) while the correct answer has none.
Why it mattersTest-savvy students eliminate options with absolutes by reflex — if your distractors carry them and your correct answers don't, students can pass without content knowledge.
How to fixEither (a) instruct the distractor generator to avoid hard absolutes entirely, or (b) ensure the correct answer carries an absolute when distractors do (rare but valid for content reasons).

Question quality (AI review)

Correct answer is unambiguously correct
What this checksAn AI reviewer (Claude Sonnet) confirms that the keyed answer is the unambiguous best answer (no other choice is equally defensible).
Why it mattersItems with two defensible answers frustrate strong students and undermine the validity of the question.
How to fixTighten the stem or the distractors so only one answer is defensible.
Wrong answers are plausible
What this checksAn AI reviewer confirms that each wrong answer would be picked by a student with a real misconception (not absurd or trivially obvious).
Why it mattersImplausible distractors make the item easy to guess and don't help diagnose what students don't understand.
How to fixReplace weak distractors with ones grounded in known student misconceptions for the topic.
Question stem is clear
What this checksAn AI reviewer confirms the stem is self-contained, unambiguous, and doesn't cue the answer.
Why it mattersAmbiguous stems test reading skill rather than the targeted content knowledge.
How to fixRewrite the stem to be specific, complete, and free of giveaway cues.
Explanation actually teaches
What this checksAn AI reviewer confirms the explanation explains WHY the correct answer is correct AND WHY each distractor is wrong (not just 'Correct!').
Why it mattersTrivial explanations like 'Correct!' miss the most valuable teaching moment in the entire item.
How to fixRewrite the explanation to walk through the reasoning for the correct answer and address the most likely wrong-answer choices.

Lesson↔question alignment (AI review)

Question demand matches lesson goal
What this checksAn AI reviewer compares the cognitive level the question demands (recall vs analyze vs synthesize) to the level the lesson is meant to build.
Why it mattersIf the lesson teaches mechanism but the questions only test vocabulary, students 'pass' without learning the targeted skill.
How to fixEither upgrade the question to require the targeted cognitive level, or move the question to a recall-focused lesson.
Article supports the question
What this checksFor items linked to an article/passage, an AI reviewer confirms the article contains what's needed to answer the question.
Why it mattersIf the article doesn't teach what the question tests, the practice is broken — students are being tested on content they weren't given.
How to fixEither expand the article to cover the missing content, or move the question to a different article that teaches it.

Course pedagogy (AI review)

Course is right-scoped for the gap
What this checksAn AI curriculum reviewer judges whether the course is tightly scoped to the named gap (not over-bloated, not so thin it misses the gap).
Why it mattersHole-filling courses should fill the specific gap; rebuilds and stretched-thin courses both miss the point.
How to fixTrim irrelevant content if over-broad, or add focused articles/practice if under-scoped.
Articles teach what practice tests
What this checksAn AI reviewer judges whether articles teach a concept first, then practice tests that concept, in a clear sequence.
Why it mattersRandom ordering or orphan questions break the teaching loop — students get tested on things they were never taught.
How to fixReorder so each concept appears in an article before the related practice item.
Course names and corrects misconceptions
What this checksAn AI reviewer judges whether the course explicitly names the likely wrong-thinking patterns and corrects them.
Why it mattersHole-filling exists because the student got something wrong — generic content review without naming the misconception is unlikely to fix it.
How to fixAdd explicit 'common mistake' or 'why students confuse X with Y' sections to articles.
Question demand matches the AP skill
What this checksAn AI reviewer judges whether the cognitive demand of the questions matches the AP skill being remediated.
Why it mattersA 'mechanism' gap requires explain-the-mechanism items; a 'discrimination' gap requires distinguishing-cases items.
How to fixRe-author questions to target the specific AP skill type the course names.
No pattern-matching shortcuts
What this checksAn AI reviewer judges whether students CAN'T pass the exit ticket via shortcuts like answer-length bias or keyword cueing.
Why it mattersIf the student can pass without learning the concept, the course's value is zero regardless of how good the teaching is.
How to fixAudit for length bias, keyword cueing in stems, and vocabulary lifted verbatim from articles into correct answers.

FRQ quality

Prompt is real (not a placeholder)
What this checksFRQ stems must not be placeholders like 'test prompt' or 'TODO'.
Why it mattersPlaceholder prompts ship to students and make the course visibly broken.
How to fixReplace the placeholder with the actual FRQ prompt.
FRQ has a rubric
What this checksEvery FRQ must have a non-empty rubric.
Why it mattersWithout a rubric, the AI grader has no scoring criteria and student responses cannot be graded.
How to fixAuthor a rubric with point allocations and accepted answer paths.
Rubric uses criteria language
What this checksThe rubric should mention point allocations (e.g., '1 point') or scoring criteria language.
Why it mattersFree-form rubric text without explicit criteria is hard for the grader to apply consistently.
How to fixRestructure the rubric into explicit criteria with point values.
Autograder URL is set
What this checksEach FRQ must point to an autograder URL.
Why it mattersWithout a grader URL, FRQ submissions go nowhere — students don't get scores or feedback.
How to fixWire the FRQ to a configured autograder.
Autograder URL is well-formed
What this checksThe autograder URL must use a valid scheme and not contain typos like the old /api/ prefix or doubled https://.
Why it mattersMalformed URLs silently fail to grade — the platform does not warn you.
How to fixVerify the URL against the canonical grader endpoint format.
FRQ marked as required response
What this checksThe interaction element must have required="true" so students can't skip it.
Why it mattersWithout required="true", students can advance without responding.
How to fixSet the required attribute on the FRQ interaction.
Expected response length set
What this checksThe FRQ should declare an expected number of lines so the response box is sized correctly.
Why it mattersWithout a sizing hint, students see a tiny box for a long-essay prompt and write less than they should.
How to fixAdd the expected-lines attribute matching the rubric's expected response length.
FRQ outcome declarations are canonical
What this checksThe FRQ should declare the canonical set of outcome variables (API_RESPONSE, FEEDBACK_VISIBILITY, GENERATED_FEEDBACK, SCORE).
Why it mattersNon-canonical outcomes prevent the grader from writing scores back or showing generated feedback to students.
How to fixMigrate the FRQ XML to the canonical outcome declaration pattern.
Rubric is in the correct QTI location
What this checksThe rubric must live in <qti-rubric-block> inside <qti-item-body> per the QTI spec — not in metadata.rubric or metadata.modelAnswer.
Why it mattersRubrics in non-standard fields are invisible to the platform's grading tooling and to graders who follow the spec — the rubric content might exist but won't be applied where it counts.
How to fixMove the rubric content from metadata.rubric (or metadata.modelAnswer) into a <qti-rubric-block use="ext:criteria" view="scorer"> element inside <qti-item-body>, with one block per criterion.
FRQ uses autograding (preferred)
What this checksThe FRQ should be configured for autograding rather than manual/human scoring.
Why it mattersManual scoring is allowed, but autograded FRQs scale better and give students faster feedback — autograding is preferred where feasible.
How to fixIf the FRQ is currently set to scoringType="manual" or requiresHumanScoring=true, consider building an autograder for it.
FRQ prompt is clear
What this checksAn AI reviewer confirms the prompt is unambiguous and well-specified (clear task verb, scope, expected response form).
Why it mattersAmbiguous prompts get a wide range of responses that the rubric can't fairly score.
How to fixTighten the prompt's task verb, add scope constraints, and specify response form (essay/list/diagram).
Rubric scores what the prompt asks
What this checksAn AI reviewer confirms the rubric criteria map to what the prompt asks (no orphan criteria, no unscored prompt parts).
Why it mattersMisalignment means students do what the prompt asks but get scored on something else.
How to fixWalk through the prompt and rubric line by line; ensure each prompt part has a scoring path.
FRQs collectively serve the course goal
What this checksAn AI curriculum reviewer judges whether the FRQs in the course (taken together) actually work toward the course's named goal — and that no FRQ ships with broken/placeholder content.
Why it mattersEven if individual FRQs are fine, an off-target FRQ set or any single broken FRQ undermines the whole course.
How to fixRe-author off-target FRQs to match the gap; fix or remove any broken FRQs.