Give Any AI Model a Verify Loop Before It Answers
The exact four-step verbatim instruction to paste at the top of any complex prompt, in any AI tool
What matters today
The exact four-step verbatim instruction to paste at the top of any complex prompt, in any AI tool
Key points
- The Instruction That Forces a Plan Before an Answer
- What Each of the Four Steps Actually Does
- Why This Beats a First-Draft Answer
- The Five-Minute Before-and-After Test
- When This Is the Wrong Tool
What You'll Learn
- The exact four-step verbatim instruction to paste at the top of any complex prompt, in any AI tool
- Why forcing a model to comprehend, plan, execute, and check outperforms asking for a first-draft answer
- A five-minute before-and-after test method that shows the difference on your own work
- Exactly which tasks this is overkill for, so it does not slow down simple requests
- The documented research behind why plan-then-verify prompting changes output quality
Somewhere in the last year, an Executive reading this sent an AI-drafted board summary, contract clause, or financial projection straight through without a second look, because the first paragraph read clean and confident. Then someone on the other end caught the error. Maybe it was a misread clause, a stat that did not match the source document, or a plan that ignored a constraint mentioned two messages earlier. The output looked finished. It was not checked.
That gap between "looks done" and "is correct" is where high-stakes AI work actually fails. A model asked a complex question tends to produce its first coherent line of reasoning and run with it, the same way a person under time pressure grabs the first plausible answer instead of the best one. For a one-line lookup that is fine. For a pricing model, a client-facing memo, or a legal summary, that first draft is exactly the answer that should not go out unexamined.
What follows is a single block of text, roughly 60 words, that changes how a model approaches a hard question before it writes a single sentence of the answer. It costs nothing, works in ChatGPT, Claude, Gemini, or Copilot, and takes less time to add than it took to read this paragraph.
The Instruction That Forces a Plan Before an Answer
Paste this at the top of any complex prompt, before the actual question or task. It works in any chat interface, free or paid, because it is plain instruction text, not a special mode or setting.
Before you answer, work through four steps silently: comprehend what I actually need and note any ambiguity, sketch two possible approaches and pick the stronger one, execute carefully and show your work where it matters, then check your own answer against my original request before you reply. Give me the final answer, plus a short list of what you verified and what I should still check myself.
Comprehend
Note what is actually being asked and flag any ambiguity before writing a word of the answer.
Plan
Sketch two possible approaches and commit to the stronger one before executing either.
Execute
Carry out the chosen approach and show the work where it matters, not just the final line.
Verify
Check the answer against the original request and hand back what still needs a human look.
This mirrors the deliberate reasoning discipline built into Anthropic's more advanced reasoning-focused models, generally described as comprehend, strategize, execute, and verify. Those systems get that structure through training and architecture. An Executive using ChatGPT Plus, Claude Pro, Copilot, or Gemini Advanced gets a close approximation of it just by asking for it in plain language, on every model, every time, for free.
What Each of the Four Steps Actually Does
Comprehend. Most bad AI answers are not reasoning failures, they are misread requests. The model answers a nearby question instead of the one asked, because the prompt left something ambiguous and the model picked an interpretation without saying so. Asking it to flag ambiguity first surfaces that gap before it becomes a wrong answer. Research on ambiguous requests backs this up directly: models generally know when a question is underspecified, but left to their own judgment they answer directly instead of flagging it more than 95 percent of the time. Telling the model explicitly to check for false premises and state assumptions has been shown to noticeably change that behavior, because clarifying is not the model's default move unless asked for.
Strategize. Sketching two approaches and picking the stronger one is the plan-then-solve step. A 2023 paper from the Association for Computational Linguistics, "Plan-and-Solve Prompting" by Wang and colleagues, tested exactly this against standard chain-of-thought prompting (the "let's think step by step" instruction) on math and logic benchmarks. Plain chain-of-thought reasoning tends to fail in three specific ways: calculation errors, missing steps, and misreading what the question asked. Forcing a plan before execution consistently reduced those errors, particularly the missing-step and misread-question categories, because the model commits to a structure before it starts writing prose.
Execute. "Show your work where it matters" is doing two jobs. It keeps the model from skipping steps in the middle of a complex answer, and it gives an Executive reading the output a visible trail to check against, instead of a confident paragraph with no way to audit how it got there.
Check. This is the step most people skip when writing their own prompts, and it is the one with the clearest documented evidence behind it. Anthropic's own engineering team tested a dedicated "think" step, structurally the same idea as the check step here, against a customer service benchmark called tau-bench. Adding a structured pause to verify tool outputs and policy compliance before responding took the airline-domain score from 0.370 to 0.570 on the pass^1 metric, a 54 percent relative improvement, and the gain held up even when the same task was repeated multiple times in a row, which is the harder bar to clear. On a separate coding benchmark (SWE-bench), the same verify-before-responding step produced a measured, statistically significant improvement (Welch's t-test, p less than .001) across 174 trials. That is not a marketing number. It is a controlled benchmark result Anthropic published with the methodology attached.
54%
Relative score improvement from adding a verify step, on Anthropic's tau-bench airline benchmark
Why This Beats a First-Draft Answer
A separate and widely cited paper, "Self-Consistency Improves Chain of Thought Reasoning" (Wang et al., published at ICLR 2023), found that when a model is prompted to work through a problem multiple independent times and the answers are compared for agreement, accuracy on grade-school math problems improved by close to 18 percentage points over standard single-pass chain-of-thought prompting. The mechanism is straightforward: a genuinely correct answer tends to be reachable by more than one line of reasoning, while a wrong answer usually only survives one flawed path. The four-step instruction above builds a lighter version of that same idea into a single pass, by making the model consider two approaches and then explicitly check its own conclusion, instead of relying on one uninterrupted first attempt.
It is worth being direct about the limits here too. Self-verification is not perfect, and more than one research paper has found that a model checking its own work without any outside reference can miss its own errors, sometimes even making an answer worse by second-guessing a correct result. The fix in the instruction above is the "what I should still check myself" line at the end. It does not ask the model to claim certainty. It asks the model to hand back a short, honest list of what it actually verified against your original request, and where you, not the model, are the final check.
The Five-Minute Before-and-After Test
Do not take this on faith. Test it on a real task this week, using this exact method:
Pick a complex prompt already sitting in your queue, something like a client proposal, a hiring decision writeup, a pricing analysis, or a contract summary. Run it once as-is, in a fresh chat window, and read the answer once, exactly as a first pass. Note whether you would send that answer without editing it. Then open a second fresh chat window, paste the four-step instruction above the same exact prompt, and run it again. Compare the two outputs side by side against one question only: which one would you actually send, unedited, right now. The difference usually shows up in three places: the second answer states an assumption the first one silently made, the second answer flags a constraint the first one missed, or the second answer includes a verification note that catches something before it reaches a client.
Total setup time: about 20 seconds to paste the instruction, plus the time it takes to run the prompt twice, roughly 2 to 4 minutes depending on task length. No new tool, no subscription upgrade, no settings change.
When This Is the Wrong Tool
Do not use this for a one-line factual lookup, like confirming a date, converting a currency, or checking a simple definition. The four-step structure adds real latency and, in most chat interfaces, real cost, since the model generates more tokens working through the steps even when you only see the final answer. Anthropic's own guidance on a structurally similar technique is explicit on this point: a verification step shows no measurable benefit on single, non-sequential requests where the model's default answer is already good enough.
The instruction earns its cost on prompts with more than one right way to approach them, real ambiguity in what is being asked, or a consequence if the answer is wrong and nobody catches it. It is dead weight on anything short, simple, and low-stakes:
- Use it for: a financial model, a legal summary, a hiring memo, a client proposal, or a strategy document, anything with real ambiguity or a real cost if the answer is wrong and nobody catches it
- Skip it for: a quick internal Slack message, a one-paragraph email reply, a calendar check, or anything where being wrong costs 30 seconds to fix, not a client relationship or a compliance problem
Action Steps Summary
1. Save the instruction. Store the four-step verbatim prompt somewhere you can paste it in two seconds, a notes app, a text expander, or a pinned message.
2. Run the before-and-after test once. Pick one real, complex prompt from this week and compare the answer with and without the instruction, using the "would I send this unedited" standard.
3. Apply it only to high-stakes prompts. Use it on anything with real ambiguity or real consequences, and skip it entirely on quick lookups and short replies.
4. Read the verification list, not just the answer. The "what you verified and what I should still check" section is the part that tells you where the model's confidence ends and your own review needs to start.
5. Treat it as a default for anything client-facing. Once tested, make this the standing instruction for any prompt whose output goes to a client, a board, or a regulator.
Three deep dives. Four useful moves. One email worth opening.
PromptHacker turns the AI firehose into practical next steps for work, health, family, and everything time keeps trying to steal.