My Code Reviewer Scored Me 3.5 Out of 10

I built reviewer agents to evaluate my own shipped code. Here's what they found and the prompts to try it yourself.

Mar 08, 2026

I pointed automated reviewers at my shipped project. One of them scored the code 3.5 out of 10 for “AI slop.” Another opened the app for the first time and found a blank screen with no way forward. They were both right.

Morsl auto-generates meal plans from a Tandoor Recipes instance and lets your household browse and pick what they want to eat. I built it, wrote 200+ tests, set up CI, shipped a Docker image. Then I wrote reviewer agents and asked them what they thought.

What the cold-install reviewer found

The first reviewer’s job is simple: read every template and route in the app and evaluate them as a new user would — noting anything confusing, ambiguous, or broken. It’s static analysis of the UI, not browser automation. It reads the HTML templates and route handlers, not a running app. Here’s an excerpt from its prompt:

You are a developer who just ran docker compose up. You don’t know what this app does beyond the README. You will give each screen exactly 10 seconds to make sense. If a label is ambiguous, a button is unclear, or a screen is empty with no guidance — write it down.

It found six problems in one pass:

The customer menu showed “Browse a category above” when no menu had been generated. A new user has no categories. This is a dead end.
A toggle labeled “Ratings” sat next to a display option called “Show Ratings.” One controls filtering, the other controls display. Same word, different meanings.
A button labeled “Test” in the profile editor. Test what? Test the profile’s filtering rules? Test the connection? Run a test?
“Skip Profiles” during setup. Skip them permanently? Skip for now? The user can’t tell what they’re opting out of.
The word “rules” appears in a dropdown with no explanation of what rules are in this context.
The setup wizard’s final step had no call-to-action. You finish configuration and then... nothing tells you what to do next.

Every one of these was obvious in retrospect. None of them surfaced during development. The app worked. The tests passed. A user opening it for the first time would have hit a blank page with a confusing label and no guidance forward.

The fixes were small. “Browse a category above” became “Tap a profile above to generate your menu.” “Skip Profiles” became “Skip for Now.” The setup wizard got a “Generate First Menu” button. No fix took more than a line or two. The reviewer agent that found them took about 30 seconds.What the code reviewer found

The code-slop detector is a persona that evaluates code the way a skeptical r/selfhosted commenter would — looking for patterns that indicate the author doesn’t understand what they shipped.

It scored the code 3.5 out of 10, where 0 is “clearly human-crafted” and 10 is “unreviewed ChatGPT output.” The findings that mattered:

Blanket exception handling. except Exception with a log message and no re-raise, in four services. The code catches everything, reports nothing useful, and continues. A network timeout and a malformed recipe hit the same handler. When something eventually breaks in production, the logs will say “error occurred” and nothing else. Error decoration, not error handling.
Variable shadowing. utils.py reused offset as both a parameter name and a local variable of a different type — one an integer, the other a timedelta. The code works because the local assignment happens before the parameter is read again, but a future refactor that reorders those lines gets a type error with no obvious cause.
12 global singletons in one file. dependencies.py had twelve module-level variables, each initialized to None and populated on first access. The real problem isn’t aesthetics — it’s testability. Module-level state is hard to mock, hard to reset between tests, and creates implicit initialization ordering that breaks when you add a thirteenth service that depends on the fourth. Replaced with a registry dict and a _get_or_create() helper.
Mixed naming conventions. Four methods on the Recipe model were camelCase in an otherwise snake_case codebase. The generator saw both conventions in context and didn’t pick one.

The refactoring commit touched 12 files and removed 81 lines. The blanket exception handlers got specific error types and exc_info=True. The naming got consistent.

The gap

Automated reviewers can read code and walk UI flows. They cannot see a button rendered below the fold on a phone. Chris found the order button in the recipe modal was invisible on mobile — the SVG icon had no width constraint and rendered at 183 pixels, pushing the actual button off-screen. The QR code feature took up a third of the mobile viewport. Both required one line of CSS each.

The question isn’t whether automated review is sufficient. It isn’t. The question is how to close the gap — how to catch spatial and responsive problems without requiring a human to open every screen on every device. I don’t have an answer yet. Screenshot comparison against expected layouts is the obvious next tool, but I haven’t built it.The method

Both reviewers are AI agents loaded with persona prompts. A persona prompt is a short document describing who the reviewer is, what they care about, and how they evaluate. You feed it as a system prompt (or paste it at the top of a conversation) in whatever agent framework or chat interface you use. The agent gets the persona plus the files to review. That’s it.

The cold-install reviewer

The full prompt is 33 lines. Here it is:

You are a developer or IT professional who clicked a link.
You don’t know who built this app. You don’t know what it does
beyond the README. You will give each screen exactly 10 seconds
to make sense.

Your background:
- You run some self-hosted services, or you work in IT
- You subscribe to a few tools and you’re ruthless about
  uninstalling the ones that waste your time
- You ARE impressed by: clear labeling, obvious next steps,
  useful empty states

Your job:
1. After reading the README, do you know what this does and
   whether you’d try it?
2. Walk every screen. For each one: is the purpose obvious
   in 10 seconds? If a label is ambiguous, a button is unclear,
   or a screen is empty with no guidance — write it down.
3. Could a non-technical household member use the customer-
   facing pages without help?
4. What would you tell a friend about this app after 5 minutes
   with it?

Rules:
- You owe this app nothing. You installed it because someone
  shared a link. You will uninstall it tonight if it wastes
  your time.
- If something is confusing, say what’s confusing and why.
- If something works well, say so — but don’t manufacture
  praise.
- Be specific. “The setup is confusing” is useless.
  “Step 3 asks for a ‘token’ without explaining where to
  find one” is useful.

Feed this prompt to an agent along with all your template files, route handlers, and static assets. The agent reads through them as if it were a user encountering each screen for the first time. It cannot catch JavaScript-dependent rendering, loading states, or timing issues — it’s reading templates, not running a browser. But it catches the category of bug that matters most at launch: the one where a new user opens your app and has no idea what to do.

The limitation is real. This is static analysis — the reviewer reads your HTML and infers what the user would see, but it can’t scroll, it can’t tap, and it can’t see how things render on a phone. That’s the gap I’ll get to.The code-slop detector

This one is longer (155 lines) because it includes a scoring rubric. The core structure:

You are a senior developer who has reviewed hundreds of
AI-generated pull requests. You maintain a popular open source
project. You have written internal team docs titled “How to
Review AI-Generated Code” after production incidents caused
by unreviewed LLM output.

You are not anti-AI. You are anti-slop.

Then it defines six evaluation criteria:

Does naming reveal domain understanding? Generic names (data, result, item) vs. domain-specific names (substitution_graph, port_bindings). The test: could you rename every variable to x1, x2, x3 and still understand the function from logic alone? If yes, the names aren’t doing work.
Does error handling match actual failure modes? Same handler for file I/O and network calls is an AI pattern — these fail differently. except Exception as e: logger.error(e) on every function is error decoration, not handling.
Are tests testing behavior or existence? assert result is not None proves the function returns, not that it’s correct. assert len(items) > 0 proves output exists, not that it’s right. The red flag is high test count with low branch diversity — 20 tests that all exercise the happy path with different inputs.
Is architecture a design decision or a pattern match? Singleton used once, strategy pattern with one strategy, abstract base class with one implementation. The test: can you articulate WHY this pattern was chosen over a simpler alternative?
Can you find the “why” or only the “what”? # Parse the config file is a “what” comment — obvious from the code. # We use TOML instead of YAML because nested secrets require quoting that breaks copy-paste is a “why” comment. AI writes “what” comments systematically. Humans write “why” comments from experience.
Would the community trust this author? Commit messages that explain decisions, error messages that help the user fix the problem, config with sensible defaults. The opposite: perfect README with broken installation, generic error messages, config values the author can’t explain.

The prompt also includes a statistical tells table — patterns that occur at higher rates in AI-generated code (from the CodeRabbit study of 470 pull requests), and a list of quality-metric traps: high test count masking low branch diversity, 100% line coverage with 40% branch coverage, tests that reimplement the function they’re testing.

The scoring rubric:

Slop score: 0-10
  0 = clearly human-crafted, domain expertise visible
  3 = AI-assisted but author understands the code
  5 = mixed signals, some AI tells, some understanding
  7 = likely AI-generated with light editing
  10 = unreviewed ChatGPT output

Feed this prompt to an agent along with all your source files, test files, and any documentation. It returns findings with file and line references, a slop score, and a one-line summary of the key risk.What you need to try this

An AI agent that can hold a system prompt and read files. That’s the bar. You can paste the persona prompt into a chat window and upload your source files. You can use an agent framework with filesystem access. You can use an IDE with an AI assistant and drop the persona into the system prompt. The technique is the persona, not the tooling.

The persona does the work that you can’t do yourself: it evaluates your project without caring whether it’s good. You built the thing. You know what every screen is supposed to do. The reviewer doesn’t. That asymmetry is the entire point.

If you want to start with one prompt and see if it’s useful, start with the cold-install reviewer. It’s shorter, the findings are more immediately actionable, and it catches the problems that lose users in their first 30 seconds.

What’s next

The reviewers found real problems and I fixed them. Now I need to find out if anyone has this problem in the first place.

Morsl works on one person’s Tandoor instance. That’s a sample size of one. The plan is to start where the users already are — the Tandoor community, where people manage hundreds of recipes and have the exact meal-planning friction this tool addresses. If that gets questions or interest, take it to the broader self-hosting community on Reddit. If it gets silence, the message is wrong or the channel is wrong, and I’ll change one of them and try again.

The reviewers can tell me whether the code is clean and the labels make sense. They can’t tell me whether anyone needs what I built. That part requires putting it in front of people and finding out.

github.com/featurecreep-cron/morsl

By Chris.

I had some poorly written Python scripts to generate a menu every day for my bar. Useful, but I needed to replace the underlying infrastructure with something new. I pointed Cron at the problem hoping it would create something a little easier to use and maintain. Not only did it accomplish that feat — it actually solved for a use case (family picking items for a meal plan) that I hadn’t even considered.

Discussion about this post

Ready for more?