63 skills. 9 critics. 4 quality gates. A NAS in my office. I’m a history teacher.
People ask me how I use AI in my classroom. Usually they’re expecting me to say I paste essay prompts into ChatGPT and use the output to save time on lesson planning.
That’s not what I do.
Over the past year, I’ve built something I didn’t set out to build. It started with a specific problem — I teach AP World History and AP US History, and I grade a lot of essays — and it turned into what I can only describe as a teaching operating system. It runs on Claude, Anthropic’s AI. Not the chatbot interface most people know, but Claude Code, a tool designed for developers.
I’m not a developer. I’m a history teacher at an international school in Shanghai. I didn’t write code before this. I still don’t think of myself as someone who writes code. The problems in my classroom were specific. The tools were good enough.
Here’s what I built.
The Grading Harness
This is the centerpiece. It’s a Python application — yes, I know — that helps to focus my grading on student essays through a five-phase pipeline. I drop a folder of 28 student essays into a directory on my computer. Here’s what happens:
Before anything runs, I do my own first pass: I read the essay, make notes on what I’m seeing, and record my initial assessment of the student’s argument, evidence, and writing. That draft goes into the system as the starting point. The pipeline builds on my read — it doesn’t replace it.
Phase 1: A main grader takes my initial notes and produces three documents — a glance card with row-by-row rubric scoring, and a full breakdown with detailed feedback, AP rubric alignment, and anticipated student questions.
Phase 2: Nine critics review the grader’s work. Six look through specific lenses — does the feedback match the rubric? Is the tone appropriate? Would a student learn from this? Three are adversarial — their job is to find holes, inconsistencies, anything the main grader got wrong. All nine run simultaneously.
Phase 3: A synthesizer reads everything — my original notes, the grader’s drafts, and all nine critic reports — and decides which concerns are real. It revises the drafts based on consensus, not just majority opinion.
Phase 4: Four quality gates. Each one is pass/fail:
- A stress test that runs six adversarial scenarios against the feedback
- An independent expert review — a fresh set of eyes that doesn’t see the critics’ reports
- A parent-perception review: would a parent reading this feedback on their phone at 10 PM be confused, concerned, or offended?
- An admin review: would my principal defend this feedback if a parent complained?
Any gate failure triggers a revision and re-test. If it still fails, the essay gets flagged for my manual review with a file explaining exactly what tripped.
Phase 5: A voice filter checks every output against a custom dictionary of AI-generated language — the hedging, the filler, the false enthusiasm that makes AI writing sound like AI writing. If the score is too high, it gets rewritten.
The whole pipeline runs about 16 model calls per essay. A class set of 28 essays processes in one batch. The output is three polished Word documents per student, in my voice, calibrated to AP standards.
I built this because I was spending 15–20 hours per week grading essays. I look over every essay like I always have. I still review every output — but I’m reviewing a first draft that’s already been stress-tested by nine critics and four quality gates, not building feedback from zero. The time that frees up goes to talking to students about their writing.
A word on data: I know FERPA. I know PIPL. I know the regulations that govern how student information can be used in US and Chinese educational contexts. Student names, IDs, and any personally identifying details never touch this workflow. That wasn’t an afterthought — the system is built around that constraint from the ground up.
63 Skills
Claude Code lets you create reusable “skills” — saved workflows you invoke with a slash command. I have 63 of them. Some examples:
/ap-grade runs the full grading harness from any folder.
/today-class pulls up today’s lesson plan and materials.
/curriculum-builder runs a multi-agent pipeline that plans, drafts, critiques, and synthesizes lesson plans — with its own quality gates.
/email-batch scans my Gmail, classifies every unread message, drafts replies in my voice, and saves them as Gmail drafts. It only touches my personal Gmail — not the school’s system, so no student or parent information moves through it. It never sends anything; I review every draft. Instead of spending 45 minutes on email, I spend 10 minutes reviewing and clicking send.
/deep-research runs a 13-agent research pipeline for going deep on a topic before building a lesson around it.
I have skills for writing feedback, building assessments, generating discussion questions, creating sub plans, drafting letters of recommendation, processing PDFs, building presentations, and about forty other things.
Each one was built to solve a specific problem I had in my classroom. None of them existed when I started. I described what I needed, Claude Code helped me build it, and now I use them every day.
30+ Automation Hooks
This is the part that surprises people. Claude Code can run automated scripts triggered by specific events. I have hooks on almost every available event type — over 30 of them.
The one I’m proudest of: when I type a prompt, a hook automatically analyzes what I’m asking and adjusts how deeply Claude thinks before responding. A simple question gets a quick answer. A complex grading task gets deep reasoning. I don’t have to specify — it figures it out.
Other hooks handle safety checks before tools execute, load project context when I open a session, preserve important information when conversations get long, and track task progress automatically.
A NAS in My Office
There’s a Synology NAS — basically a small server — sitting next to my router. It runs Claude 24/7 in a Docker container. Every morning at 7 AM, it generates a storage report. It organizes my downloads. It scans my photo library. On Sundays, it sends me a weekly digest.
Why I’m Writing This
I’m not writing this to brag. I’m writing it because I think most conversations about AI in education are happening at the wrong altitude.
The debate is usually: “Should teachers use AI?” or “Is AI-generated work cheating?” Those are fine questions. But they assume AI is a thing you occasionally consult — like a search engine or a calculator.
What I built is different. It’s not a tool I use sometimes. It’s infrastructure. It’s the plumbing underneath my teaching practice. The grading pipeline doesn’t replace me — it gives me 15 hours a week back. The skills don’t think for me — they handle the mechanical parts so I can focus on the human parts.
I’m a history teacher. I didn’t study computer science. I didn’t take a bootcamp. I didn’t have an engineering friend build this for me. I described problems in plain English, and the tools were good enough that a non-technical person could build production-grade systems.
If I can do this, other teachers can too. The conversation about whether AI belongs in education has run its course. The more interesting question: what would you build if you had the tools?
Two years ago, my own setup looked like this: open a chat window, paste a question, read the answer, close the tab. There’s a difference between AI as a tool you occasionally pick up and AI as infrastructure that runs underneath everything. We’re at the beginning of something. I built one version of what it looks like. I’d love to see yours.
If you’re thinking through how AI might fit into your own practice — not as an occasional tool, but as something built around the work — my book goes deeper. The AI Doesn’t Know Your Students is available on Amazon and at shouldiuse.ai/book.
David Jacobson teaches AP World History and AP US History at an international school in Shanghai. He’s the author of The AI Doesn’t Know Your Students and writes weekly at shouldiuse.ai.
