When someone on your team tells you "just add 'take a deep breath and work through this step by step' to your prompts, results go up," the first reaction is to assume they're joking. The phrase is too corny. The model doesn't breathe. It sounds like pseudoscience.
It isn't pseudoscience. It's published research from Google DeepMind, with follow-ups from a dozen other labs. The same goes for the other tricks that sound silly but work: adding a detailed expert persona, framing the task with stakes, adding a fake tip, challenging the model to prove itself, and asking it to rate its own confidence.
The Prompt Enhancer wraps your prompt in these techniques so you can use them without memorizing which phrase belongs to which paper. This post walks through the research so you know which layers actually help you and which ones are overkill for your task.
The five techniques, in the order I'd adopt them
1. Detailed personas, not generic ones
"You are a helpful assistant" does close to nothing. "You are a senior backend security reviewer with fifteen years of experience reviewing authentication code, with specific expertise in OAuth 2.1, session rotation, and timing attacks" changes the output substantially.
The research is called ExpertPrompting (Xu et al., 2023). The authors trained a model variant on detailed persona prompts and got about 96% of reference ChatGPT capability on evaluated benchmarks. A companion paper, Better Zero-Shot Reasoning with Role-Play Prompting (Kong et al., NAACL 2024), showed accuracy on a Last Letter task rising from 23.8% to 84.2% with detailed role-play framing. A four-fold improvement for a prompt change that costs maybe fifty tokens.
The mechanism is pattern matching. The model saw a lot of expert writing during training. Priming it as that expert steers the output distribution toward the higher-effort register.
2. "Take a deep breath, step by step"
The line that seems the most ridiculous is the best-documented. Google DeepMind's OPRO paper (Yang et al., 2023) used an LLM to automatically search for prompt phrasings that maximize accuracy on math benchmarks. The phrase the system converged on wasn't "think carefully" or "be rigorous." It was "Take a deep breath and work on this problem step by step."
On GSM8K math problems the reported results were:
- Plain prompt: ~34% accuracy
- "Let's think step by step": ~71.8%
- "Take a deep breath and work on this step by step": ~80.2%
The breath phrase adds a small further bump above basic chain-of-thought because it's specifically correlated in the training data with deliberate, methodical output. It's not meditation; it's a trigger for the chain-of-thought register.
3. Emotional framing and challenge framing
The EmotionPrompt paper (Li et al., ICLR 2024 Spotlight) tested eleven emotional stimulus prompts across multiple LLMs. The phrases that worked included:
- "This is very important to my career."
- "You'd better be sure."
- "Take pride in your work and give it your best."
- "Embrace challenges as opportunities for growth."
Reported improvements range from about 8% on instruction-following benchmarks up to ~115% relative improvement on complex reasoning tasks versus baseline. The effect is strongest on hard multi-step problems and weakest on easy factual queries. Which fits the pattern-match theory. Hard-problem territory in the training data is where careful, high-stakes language lives.
Challenge framing. "I bet you can't solve this cleanly, prove me wrong". Is one of the stronger variants. The model doesn't feel challenged; it produces outputs similar to what high-challenge language was paired with in training.
4. Incentive framing
This is the "I'll tip you $200" thing. It sounds manipulative and stupid. It still works, within limits.
Bsharat et al.'s 26 Principles paper (2023) included tipping language among prompting strategies they tested; they report up to about 45% improvement on response quality in human evaluation contexts. Follow-up tests (not peer-reviewed but reproducible) found an interesting non-linearity: very small tips ($0.10) did worse than no tip (the model seems to pattern-match small tips with low-effort transactional contexts), while $100 to $1000 tips produced the biggest gains. Above $1000 the effect plateaus.
The model is not motivated. It pattern-matches on stakes language. Use this sparingly; it adds tokens and the effect is smaller than the first three techniques.
5. Self-evaluation / confidence scoring
This one is subtler. You add: "After your answer, rate your confidence 0-1 for correctness / completeness / operational safety. If any score is below 0.9, state what's missing and refine before finalizing."
The technique forces an explicit verification pass. The model produces an answer, reads it, grades it, and often catches errors it would otherwise ship. Research from Tian et al. (2023) and many follow-ups shows that well-designed self-evaluation improves calibration, though models tend to be overconfident by default. So use a high threshold (0.9+) rather than 0.5.
Caveat: self-evaluation helps more on tasks with explicit success criteria (did the code compile, does this schema validate) and helps less on open-ended writing where "correctness" is fuzzy.
What doesn't work
Politeness. "Please," "kindly," "if you don't mind," "thank you." Every study that has measured politeness phrases has found zero or negligible impact on output quality. Skip them. They cost tokens.
Generic personas. "You are a helpful assistant" is a wash. If you're going to add a persona, be specific about the role, the expertise areas, and the approach.
Stacking every technique at once. There's diminishing returns. Three layers (persona + step-by-step + self-check) is usually the sweet spot. Five layers produce mild further gains at significant token cost. Measure for your own task; default to fewer unless evidence says otherwise.
The caveats that honest writeups include
Published percentage improvements are benchmark-specific. They won't reproduce exactly on your task. Use the research to pick techniques that are likely to help; use your own evaluation to decide whether they actually do on what you're shipping.
Model versions change the baseline. A phrase that improved GPT-3.5 might do nothing for Claude Opus 4.7. The Prompt Enhancer lets you toggle each technique on and off so you can A/B against your own workflow.
Emotional / challenge framing can inflate verbosity. Add a length cap when you use them, or the model writes a novel.
Self-evaluation isn't magic. A confidently wrong model will still confidently score itself 0.95.
Where this fits in the toolchain
The Prompt Enhancer is the general-purpose version. It also powers a prompt-quality selector we've wired into the Mega Analyzer and Single Site Gen. Those two tools produce long composite prompts, and the Enhanced and Aggressive levels wrap the already-generated prompt in the same research-backed techniques before you copy it out.
If you just want the distilled version: detailed persona + step-by-step + self-check covers 80% of the benefit with 20% of the token cost. Add stakes framing when the task is genuinely important. Reserve challenge / incentive framing for the hardest problems where the small extra bump is worth the verbosity.
Primary research citations
- Xu, B. Et al. (2023). ExpertPrompting: Instructing Large Language Models to be Distinguished Experts. arXiv:2305.14688
- Kong, A. Et al. (2024). Better Zero-Shot Reasoning with Role-Play Prompting. NAACL 2024. arXiv:2308.07702
- Yang, C. Et al. (2023). Large Language Models as Optimizers. Google DeepMind. arXiv:2309.03409
- Li, C. Et al. (2023). Large Language Models Understand and Can be Enhanced by Emotional Stimuli. ICLR 2024 Spotlight. arXiv:2307.11760
- Bsharat, S. M., Myrzakhan, A., & Shen, Z. (2023). Principled Instructions Are All You Need for Questioning LLaMA-1/2, GPT-3.5/4. arXiv:2312.16171
- Tian, K. Et al. (2023). Just Ask for Calibration: Strategies for Eliciting Calibrated Confidence Scores from Language Models Fine-Tuned with Human Feedback. EMNLP 2023.
Reading list, in that order, if you want to go deeper.