Back to Blog

The Reflection Pattern: Building Self-Correcting AI Systems

AI AgentsAlin Dobra12 min read

The Reflection Pattern: Building Self-Correcting AI Systems

Here's a uncomfortable truth: your LLM's first answer is rarely its best answer.

Ask GPT-4 to write code, and it works—mostly. Ask it to review that same code, and it finds bugs. Ask it to fix those bugs, and you get better code. This isn't magic. It's the reflection pattern.

Reflection is simple: make the AI critique its own work, then improve based on that critique. The result? Dramatically better outputs with minimal extra cost.

What Is the Reflection Pattern?

Reflection adds a self-review loop to AI generation:

text
1
2
                                                             
3
                        
4
     Generate  Critique  Improve         
5
                       
6
                                                           
7
                                                           
8
                                                
9
                                    Good                  
10
                              No Enough?               
11
                                                
12
                                          Yes              
13
                                                           
14
                                       Output               
15
                                                            
16
           
17
                      (iterate)                              
18
19
 

Instead of:

text
1
User  LLM  Output
2
 

You get:

text
1
User  LLM  Draft  LLM (critic)  Feedback  LLM  Improved  ...  Final Output
2
 

The same model that makes mistakes can often catch those mistakes when asked to look again with fresh eyes.

Why Reflection Works

1. Different Prompts Activate Different Capabilities

When you ask an LLM to "write code," it's in generation mode—optimizing for producing something that looks right. When you ask it to "review this code for bugs," it's in analysis mode—optimizing for finding problems.

These are different cognitive tasks that activate different patterns in the model.

2. Reduced Cognitive Load

Generating AND critiquing simultaneously is hard. Separating them lets the model focus:

Single PassWith Reflection
Generate correct codeGenerate code (any code)
While avoiding bugsThen: Find bugs
While being efficientThen: Optimize
While handling edge casesThen: Check edge cases

3. Explicit Reasoning

Reflection forces the model to articulate what's wrong and why. This explicit reasoning often surfaces issues that implicit reasoning misses.

Basic Reflection Implementation

Here's a minimal but complete implementation:

python
1
import openai
2
from dataclasses import dataclass
3
 
4
@dataclass
5
class ReflectionResult:
6
    final_output: str
7
    iterations: int
8
    critiques: list[str]
9
    improvements: list[str]
10
 
11
class ReflectionAgent:
12
    def __init__(self, max_iterations: int = 3):
13
        self.client = openai.OpenAI()
14
        self.max_iterations = max_iterations
15
    
16
    def generate(self, task: str) -> ReflectionResult:
17
        """Generate with reflection loop"""
18
        
19
        # Initial generation
20
        current_output = self._initial_generate(task)
21
        
22
        critiques = []
23
        improvements = []
24
        
25
        for i in range(self.max_iterations):
26
            # Critique the current output
27
            critique = self._critique(task, current_output)
28
            critiques.append(critique)
29
            
30
            # Check if good enough
31
            if self._is_satisfactory(critique):
32
                break
33
            
34
            # Improve based on critique
35
            improved = self._improve(task, current_output, critique)
36
            improvements.append(improved)
37
            current_output = improved
38
        
39
        return ReflectionResult(
40
            final_output=current_output,
41
            iterations=i + 1,
42
            critiques=critiques,
43
            improvements=improvements
44
        )
45
    
46
    def _initial_generate(self, task: str) -> str:
47
        """First attempt at the task"""
48
        response = self.client.chat.completions.create(
49
            model="gpt-4o",
50
            messages=[{
51
                "role": "user",
52
                "content": task
53
            }]
54
        )
55
        return response.choices[0].message.content
56
    
57
    def _critique(self, task: str, output: str) -> str:
58
        """Critique the current output"""
59
        response = self.client.chat.completions.create(
60
            model="gpt-4o",
61
            messages=[{
62
                "role": "system",
63
                "content": """You are a critical reviewer. Analyze the output for:
64
1. Correctness - Are there any errors or bugs?
65
2. Completeness - Does it fully address the task?
66
3. Quality - Could it be clearer, more efficient, or better structured?
67
4. Edge cases - Are there scenarios not handled?
68
 
69
Be specific and actionable. If the output is excellent, say "APPROVED" and explain why."""
70
            }, {
71
                "role": "user",
72
                "content": f"Task: {task}\n\nOutput to review:\n{output}"
73
            }]
74
        )
75
        return response.choices[0].message.content
76
    
77
    def _is_satisfactory(self, critique: str) -> bool:
78
        """Check if the critique indicates approval"""
79
        return "APPROVED" in critique.upper()
80
    
81
    def _improve(self, task: str, current: str, critique: str) -> str:
82
        """Improve based on critique"""
83
        response = self.client.chat.completions.create(
84
            model="gpt-4o",
85
            messages=[{
86
                "role": "system",
87
                "content": "Improve the output based on the critique. Address all issues raised."
88
            }, {
89
                "role": "user",
90
                "content": f"""Original task: {task}
91
 
92
Current output:
93
{current}
94
 
95
Critique:
96
{critique}
97
 
98
Please provide an improved version that addresses all the issues."""
99
            }]
100
        )
101
        return response.choices[0].message.content
102
 
103
 
104
# Usage
105
agent = ReflectionAgent(max_iterations=3)
106
result = agent.generate(
107
    "Write a Python function to find the longest palindromic substring in a string."
108
)
109
 
110
print(f"Final output after {result.iterations} iterations:")
111
print(result.final_output)
112
 

Reflection Patterns

Pattern 1: Self-Reflection (Single Model)

The same model generates and critiques:

python
1
def self_reflect(task: str) -> str:
2
    # Generate
3
    output = generate(task)
4
    
5
    # Self-critique
6
    critique = generate(f"Review this output for issues:\n{output}")
7
    
8
    # Self-improve
9
    if needs_improvement(critique):
10
        output = generate(f"Improve this based on feedback:\n{output}\n\nFeedback:\n{critique}")
11
    
12
    return output
13
 

Pros: Simple, cheap, fast
Cons: Same blind spots in generation and critique

Pattern 2: Critic Model (Different Persona)

Use different system prompts to create distinct "personas":

python
1
def critic_reflect(task: str) -> str:
2
    # Generator persona
3
    output = call_llm(
4
        system="You are an expert programmer. Write clean, efficient code.",
5
        user=task
6
    )
7
    
8
    # Critic persona (different mindset)
9
    critique = call_llm(
10
        system="""You are a senior code reviewer known for finding subtle bugs.
11
        You never approve code without thorough analysis.
12
        Look for: bugs, edge cases, performance issues, security vulnerabilities.""",
13
        user=f"Review this code:\n{output}"
14
    )
15
    
16
    # Improver persona
17
    if not is_approved(critique):
18
        output = call_llm(
19
            system="You are a developer responding to code review feedback.",
20
            user=f"Address this feedback:\n{critique}\n\nOriginal code:\n{output}"
21
        )
22
    
23
    return output
24
 

Pros: Different perspectives, catches more issues
Cons: More prompt engineering required

Pattern 3: Multi-Model Reflection

Use different models for generation and critique:

python
1
def multi_model_reflect(task: str) -> str:
2
    # Fast model for generation
3
    output = call_llm(model="gpt-4o-mini", prompt=task)
4
    
5
    # Powerful model for critique
6
    critique = call_llm(
7
        model="gpt-4o",
8
        prompt=f"Carefully review this for correctness:\n{output}"
9
    )
10
    
11
    # Fast model implements fixes
12
    if needs_improvement(critique):
13
        output = call_llm(
14
            model="gpt-4o-mini",
15
            prompt=f"Fix these issues:\n{critique}\n\nCode:\n{output}"
16
        )
17
    
18
    return output
19
 

Pros: Cost-effective, leverages model strengths
Cons: More complex orchestration

Pattern 4: Verified Reflection (with Code Execution)

Don't just critique—actually test:

python
1
from hopx import Sandbox
2
 
3
def verified_reflect(task: str) -> str:
4
    output = generate_code(task)
5
    
6
    for attempt in range(3):
7
        # Actually run the code
8
        sandbox = Sandbox.create(template="code-interpreter")
9
        
10
        try:
11
            sandbox.files.write("/app/solution.py", output)
12
            result = sandbox.commands.run("python /app/solution.py")
13
            
14
            if result.exit_code == 0:
15
                # Code runs - but is it correct?
16
                verification = verify_output(result.stdout, task)
17
                if verification.passed:
18
                    return output
19
                critique = verification.feedback
20
            else:
21
                critique = f"Code failed with error:\n{result.stderr}"
22
            
23
            # Improve based on actual execution feedback
24
            output = improve_code(output, critique)
25
            
26
        finally:
27
            sandbox.kill()
28
    
29
    return output
30
 

Pros: Ground truth verification, catches runtime errors
Cons: Requires sandboxed execution, slower

Advanced: Structured Reflection

For complex tasks, use structured critique formats:

python
1
import json
2
from pydantic import BaseModel
3
from typing import Literal
4
 
5
class CritiqueItem(BaseModel):
6
    category: Literal["correctness", "completeness", "efficiency", "style", "security"]
7
    severity: Literal["critical", "major", "minor", "suggestion"]
8
    description: str
9
    location: str  # Line number or section
10
    suggested_fix: str
11
 
12
class StructuredCritique(BaseModel):
13
    approved: bool
14
    summary: str
15
    issues: list[CritiqueItem]
16
    
17
def structured_reflect(task: str, output: str) -> StructuredCritique:
18
    response = client.chat.completions.create(
19
        model="gpt-4o",
20
        messages=[{
21
            "role": "system",
22
            "content": """Analyze the output and provide structured feedback.
23
            Return JSON matching this schema:
24
            {
25
                "approved": boolean,
26
                "summary": "overall assessment",
27
                "issues": [
28
                    {
29
                        "category": "correctness|completeness|efficiency|style|security",
30
                        "severity": "critical|major|minor|suggestion",
31
                        "description": "what's wrong",
32
                        "location": "where in the code",
33
                        "suggested_fix": "how to fix it"
34
                    }
35
                ]
36
            }"""
37
        }, {
38
            "role": "user",
39
            "content": f"Task: {task}\n\nOutput:\n{output}"
40
        }],
41
        response_format={"type": "json_object"}
42
    )
43
    
44
    return StructuredCritique(**json.loads(response.choices[0].message.content))
45
 
46
 
47
# Usage with prioritized fixes
48
def reflect_with_priority(task: str) -> str:
49
    output = generate(task)
50
    
51
    for _ in range(3):
52
        critique = structured_reflect(task, output)
53
        
54
        if critique.approved:
55
            break
56
        
57
        # Fix critical issues first
58
        critical = [i for i in critique.issues if i.severity == "critical"]
59
        major = [i for i in critique.issues if i.severity == "major"]
60
        
61
        if critical:
62
            output = fix_issues(output, critical)
63
        elif major:
64
            output = fix_issues(output, major)
65
        else:
66
            break  # Only minor issues remain
67
    
68
    return output
69
 

Real-World Example: Code Generation with Testing

Here's a complete example that generates code, writes tests, runs them, and iterates:

python
1
from hopx import Sandbox
2
import openai
3
import json
4
 
5
class TestDrivenReflection:
6
    def __init__(self):
7
        self.client = openai.OpenAI()
8
    
9
    def generate_with_tests(self, task: str) -> dict:
10
        """Generate code that passes tests"""
11
        
12
        # Step 1: Generate initial code
13
        code = self._generate_code(task)
14
        
15
        # Step 2: Generate tests
16
        tests = self._generate_tests(task, code)
17
        
18
        # Step 3: Run and iterate
19
        for attempt in range(5):
20
            result = self._run_tests(code, tests)
21
            
22
            if result["passed"]:
23
                return {
24
                    "code": code,
25
                    "tests": tests,
26
                    "attempts": attempt + 1,
27
                    "status": "success"
28
                }
29
            
30
            # Reflect and improve
31
            code = self._improve_from_failure(task, code, tests, result["error"])
32
        
33
        return {
34
            "code": code,
35
            "tests": tests,
36
            "attempts": 5,
37
            "status": "max_attempts_reached"
38
        }
39
    
40
    def _generate_code(self, task: str) -> str:
41
        response = self.client.chat.completions.create(
42
            model="gpt-4o",
43
            messages=[{
44
                "role": "system",
45
                "content": "Write clean, well-documented Python code. Include type hints."
46
            }, {
47
                "role": "user",
48
                "content": task
49
            }]
50
        )
51
        return self._extract_code(response.choices[0].message.content)
52
    
53
    def _generate_tests(self, task: str, code: str) -> str:
54
        response = self.client.chat.completions.create(
55
            model="gpt-4o",
56
            messages=[{
57
                "role": "system",
58
                "content": """Write pytest tests for this code. Include:
59
- Happy path tests
60
- Edge cases (empty input, large input, invalid input)
61
- Boundary conditions
62
Make tests thorough but not excessive."""
63
            }, {
64
                "role": "user",
65
                "content": f"Task: {task}\n\nCode:\n```python\n{code}\n```"
66
            }]
67
        )
68
        return self._extract_code(response.choices[0].message.content)
69
    
70
    def _run_tests(self, code: str, tests: str) -> dict:
71
        sandbox = Sandbox.create(template="code-interpreter")
72
        
73
        try:
74
            # Install pytest
75
            sandbox.commands.run("pip install pytest -q")
76
            
77
            # Write code and tests
78
            sandbox.files.write("/app/solution.py", code)
79
            sandbox.files.write("/app/test_solution.py", f"from solution import *\n\n{tests}")
80
            
81
            # Run tests
82
            result = sandbox.commands.run("cd /app && python -m pytest test_solution.py -v")
83
            
84
            return {
85
                "passed": result.exit_code == 0,
86
                "output": result.stdout,
87
                "error": result.stderr if result.exit_code != 0 else None
88
            }
89
        
90
        finally:
91
            sandbox.kill()
92
    
93
    def _improve_from_failure(self, task: str, code: str, tests: str, error: str) -> str:
94
        prompt = f"Task: {task}\n\nCurrent code:\n{code}\n\nTests:\n{tests}\n\nTest error:\n{error}\n\nProvide the fixed code only."
95
        
96
        response = self.client.chat.completions.create(
97
            model="gpt-4o",
98
            messages=[{
99
                "role": "system",
100
                "content": "The code failed tests. Analyze the error and fix the code. Focus on the specific failure."
101
            }, {
102
                "role": "user",
103
                "content": prompt
104
            }]
105
        )
106
        return self._extract_code(response.choices[0].message.content)
107
    
108
    def _extract_code(self, content: str) -> str:
109
        if "```python" in content:
110
            return content.split("```python")[1].split("```")[0].strip()
111
        elif "```" in content:
112
            return content.split("```")[1].split("```")[0].strip()
113
        return content.strip()
114
 
115
 
116
# Usage
117
agent = TestDrivenReflection()
118
result = agent.generate_with_tests(
119
    "Write a function `merge_sorted_lists(list1, list2)` that merges two sorted lists into one sorted list."
120
)
121
 
122
print(f"Status: {result['status']}")
123
print(f"Attempts: {result['attempts']}")
124
print(f"\nFinal code:\n{result['code']}")
125
 

Reflection for Different Tasks

Writing Tasks

python
1
def reflect_on_writing(draft: str, requirements: str) -> str:
2
    critique_prompt = f"""Review this writing for:
3
1. Clarity - Is it easy to understand?
4
2. Accuracy - Are all facts correct?
5
3. Completeness - Does it cover all requirements?
6
4. Tone - Is it appropriate for the audience?
7
5. Structure - Is it well-organized?
8
6. Grammar - Any errors?
9
 
10
Requirements: {requirements}
11
 
12
Draft:
13
{draft}"""
14
 
15
    critique = generate(critique_prompt)
16
    
17
    if needs_revision(critique):
18
        improved = generate(f"Revise based on this feedback:\n{critique}\n\nDraft:\n{draft}")
19
        return improved
20
    
21
    return draft
22
 

Data Analysis

python
1
def reflect_on_analysis(analysis: str, data_description: str) -> str:
2
    critique_prompt = f"""Review this data analysis for:
3
1. Statistical validity - Are methods appropriate?
4
2. Interpretation - Are conclusions supported by data?
5
3. Completeness - Are there unexplored angles?
6
4. Clarity - Would a non-expert understand?
7
5. Visualization - Are charts appropriate and clear?
8
 
9
Data: {data_description}
10
 
11
Analysis:
12
{analysis}"""
13
 
14
    critique = generate(critique_prompt)
15
    # ... improve based on critique
16
 

Decision Making

python
1
def reflect_on_decision(decision: str, context: str) -> str:
2
    critique_prompt = f"""Play devil's advocate on this decision:
3
1. What could go wrong?
4
2. What alternatives weren't considered?
5
3. What assumptions might be wrong?
6
4. Who might be negatively affected?
7
5. What's the worst-case scenario?
8
 
9
Context: {context}
10
 
11
Proposed decision:
12
{decision}"""
13
 
14
    critique = generate(critique_prompt)
15
    
16
    # Generate balanced view
17
    balanced = generate(f"""
18
Given this decision and critique, provide a balanced recommendation.
19
 
20
Decision: {decision}
21
Critique: {critique}
22
 
23
Should we proceed, modify, or reconsider?""")
24
    
25
    return balanced
26
 

When NOT to Use Reflection

Reflection isn't always worth the cost:

Skip Reflection WhenWhy
Simple factual queries"What's the capital of France?" doesn't need review
Time-critical responsesLatency matters more than perfection
Creative brainstormingCritique can kill creativity
The task is trivialOverhead exceeds benefit
You're already using CoTChain-of-thought includes implicit reflection

Cost Consideration

Reflection typically 2-3x your token usage:

text
1
Without reflection: 1 LLM call
2
With 2 iterations:  5 LLM calls (generate + critique + improve + critique + improve)
3
 

Use reflection when quality matters more than cost.

Optimizing Reflection

1. Early Exit

Stop as soon as output is good enough:

python
1
def optimized_reflect(task: str) -> str:
2
    output = generate(task)
3
    
4
    # Quick check - is it obviously good?
5
    quick_check = generate(f"Rate this output 1-10:\n{output}")
6
    if int(quick_check) >= 9:
7
        return output  # Skip detailed critique
8
    
9
    # Full critique only if needed
10
    critique = detailed_critique(output)
11
    # ...
12
 

2. Targeted Critique

Don't critique everything—focus on what matters:

python
1
def targeted_critique(task: str, output: str) -> str:
2
    # Determine what's important for this task
3
    if "code" in task.lower():
4
        focus = "correctness, edge cases, efficiency"
5
    elif "write" in task.lower():
6
        focus = "clarity, accuracy, engagement"
7
    else:
8
        focus = "relevance, completeness"
9
    
10
    return generate(f"Critique focusing on {focus}:\n{output}")
11
 

3. Parallel Critique

Run multiple critiques in parallel:

python
1
import concurrent.futures
2
 
3
def parallel_critique(output: str) -> list[str]:
4
    aspects = [
5
        "correctness and bugs",
6
        "performance and efficiency", 
7
        "readability and style",
8
        "security vulnerabilities"
9
    ]
10
    
11
    with concurrent.futures.ThreadPoolExecutor() as executor:
12
        futures = [
13
            executor.submit(critique_aspect, output, aspect)
14
            for aspect in aspects
15
        ]
16
        return [f.result() for f in futures]
17
 

Measuring Reflection Effectiveness

Track these metrics:

python
1
@dataclass
2
class ReflectionMetrics:
3
    initial_score: float      # Quality before reflection
4
    final_score: float        # Quality after reflection
5
    iterations_used: int      # How many loops
6
    tokens_used: int          # Cost
7
    time_taken: float         # Latency
8
    
9
    @property
10
    def improvement(self) -> float:
11
        return (self.final_score - self.initial_score) / self.initial_score
12
    
13
    @property
14
    def efficiency(self) -> float:
15
        return self.improvement / self.tokens_used
16
 

If reflection isn't improving outputs by at least 15-20%, reconsider your critique prompts.

Conclusion

Reflection is one of the highest-impact patterns you can add to AI systems:

  • Simple to implement — Just add a critique step
  • Significant quality gains — 20-50% improvement is common
  • Works everywhere — Code, writing, analysis, decisions
  • Compounds with other patterns — Combine with prompt chaining for even better results

Start with basic self-reflection. Add verified reflection (with code execution) for code tasks. Measure the improvement, and tune your critique prompts.

The AI that reviews its work beats the AI that doesn't. Every time.


Ready to add verified reflection with code execution? Get started with HopX — sandboxes that let you test AI-generated code safely.

Further Reading