The Reflection Pattern: Building Self-Correcting AI Systems

Here's a uncomfortable truth: your LLM's first answer is rarely its best answer.

Ask GPT-4 to write code, and it works—mostly. Ask it to review that same code, and it finds bugs. Ask it to fix those bugs, and you get better code. This isn't magic. It's the reflection pattern.

Reflection is simple: make the AI critique its own work, then improve based on that critique. The result? Dramatically better outputs with minimal extra cost.

What Is the Reflection Pattern?

Reflection adds a self-review loop to AI generation:

text

1	┌─────────────────────────────────────────────────────────────┐
2	│ │
3	│ ┌──────────┐ ┌──────────┐ ┌──────────┐ │
4	│ │ Generate │───▶│ Critique │───▶│ Improve │────┐ │
5	│ └──────────┘ └──────────┘ └──────────┘ │ │
6	│ │ │ │
7	│ ▼ │ │
8	│ ┌──────────┐ │ │
9	│ │Good │ │ │
10	│ No ◄──│Enough? │ │ │
11	│ └──────────┘ │ │
12	│ │Yes │ │
13	│ ▼ │ │
14	│ Output │ │
15	│ │ │
16	│ ◄────────────────────────────────────────────────┘ │
17	│ (iterate) │
18	└─────────────────────────────────────────────────────────────┘
19

Instead of:

text

1	User → LLM → Output
2

You get:

text

1	User → LLM → Draft → LLM (critic) → Feedback → LLM → Improved → ... → Final Output
2

The same model that makes mistakes can often catch those mistakes when asked to look again with fresh eyes.

Why Reflection Works

1. Different Prompts Activate Different Capabilities

When you ask an LLM to "write code," it's in generation mode—optimizing for producing something that looks right. When you ask it to "review this code for bugs," it's in analysis mode—optimizing for finding problems.

These are different cognitive tasks that activate different patterns in the model.

2. Reduced Cognitive Load

Generating AND critiquing simultaneously is hard. Separating them lets the model focus:

Single Pass	With Reflection
Generate correct code	Generate code (any code)
While avoiding bugs	Then: Find bugs
While being efficient	Then: Optimize
While handling edge cases	Then: Check edge cases

3. Explicit Reasoning

Reflection forces the model to articulate what's wrong and why. This explicit reasoning often surfaces issues that implicit reasoning misses.

Basic Reflection Implementation

Here's a minimal but complete implementation:

python

import openai
from dataclasses import dataclass
 
@dataclass
class ReflectionResult:
    final_output: str
    iterations: int
    critiques: list[str]
    improvements: list[str]
 
class ReflectionAgent:
    def __init__(self, max_iterations: int = 3):
        self.client = openai.OpenAI()
        self.max_iterations = max_iterations
    
    def generate(self, task: str) -> ReflectionResult:
        """Generate with reflection loop"""
        
        # Initial generation
        current_output = self._initial_generate(task)
        
        critiques = []
        improvements = []
        
        for i in range(self.max_iterations):
            # Critique the current output
            critique = self._critique(task, current_output)
            critiques.append(critique)
            
            # Check if good enough
            if self._is_satisfactory(critique):
                break
            
            # Improve based on critique
            improved = self._improve(task, current_output, critique)
            improvements.append(improved)
            current_output = improved
        
        return ReflectionResult(
            final_output=current_output,
            iterations=i + 1,
            critiques=critiques,
            improvements=improvements
        )
    
    def _initial_generate(self, task: str) -> str:
        """First attempt at the task"""
        response = self.client.chat.completions.create(
            model="gpt-4o",
            messages=[{
                "role": "user",
                "content": task
            }]
        )
        return response.choices[0].message.content
    
    def _critique(self, task: str, output: str) -> str:
        """Critique the current output"""
        response = self.client.chat.completions.create(
            model="gpt-4o",
            messages=[{
                "role": "system",
                "content": """You are a critical reviewer. Analyze the output for:
1. Correctness - Are there any errors or bugs?
2. Completeness - Does it fully address the task?
3. Quality - Could it be clearer, more efficient, or better structured?
4. Edge cases - Are there scenarios not handled?
 
Be specific and actionable. If the output is excellent, say "APPROVED" and explain why."""
            }, {
                "role": "user",
                "content": f"Task: {task}\n\nOutput to review:\n{output}"
            }]
        )
        return response.choices[0].message.content
    
    def _is_satisfactory(self, critique: str) -> bool:
        """Check if the critique indicates approval"""
        return "APPROVED" in critique.upper()
    
    def _improve(self, task: str, current: str, critique: str) -> str:
        """Improve based on critique"""
        response = self.client.chat.completions.create(
            model="gpt-4o",
            messages=[{
                "role": "system",
                "content": "Improve the output based on the critique. Address all issues raised."
            }, {
                "role": "user",
                "content": f"""Original task: {task}
 
Current output:
{current}
 
Critique:
{critique}
 
Please provide an improved version that addresses all the issues."""
            }]
        )
        return response.choices[0].message.content
 
 
# Usage
agent = ReflectionAgent(max_iterations=3)
result = agent.generate(
    "Write a Python function to find the longest palindromic substring in a string."
)
 
print(f"Final output after {result.iterations} iterations:")
print(result.final_output)
 

Reflection Patterns

Pattern 1: Self-Reflection (Single Model)

The same model generates and critiques:

python

def self_reflect(task: str) -> str:
    # Generate
    output = generate(task)
    
    # Self-critique
    critique = generate(f"Review this output for issues:\n{output}")
    
    # Self-improve
    if needs_improvement(critique):
        output = generate(f"Improve this based on feedback:\n{output}\n\nFeedback:\n{critique}")
    
    return output
 

Pros: Simple, cheap, fast
Cons: Same blind spots in generation and critique

Pattern 2: Critic Model (Different Persona)

Use different system prompts to create distinct "personas":

python

def critic_reflect(task: str) -> str:
    # Generator persona
    output = call_llm(
        system="You are an expert programmer. Write clean, efficient code.",
        user=task
    )
    
    # Critic persona (different mindset)
    critique = call_llm(
        system="""You are a senior code reviewer known for finding subtle bugs.
        You never approve code without thorough analysis.
        Look for: bugs, edge cases, performance issues, security vulnerabilities.""",
        user=f"Review this code:\n{output}"
    )
    
    # Improver persona
    if not is_approved(critique):
        output = call_llm(
            system="You are a developer responding to code review feedback.",
            user=f"Address this feedback:\n{critique}\n\nOriginal code:\n{output}"
        )
    
    return output
 

Pros: Different perspectives, catches more issues
Cons: More prompt engineering required

Pattern 3: Multi-Model Reflection

Use different models for generation and critique:

python

def multi_model_reflect(task: str) -> str:
    # Fast model for generation
    output = call_llm(model="gpt-4o-mini", prompt=task)
    
    # Powerful model for critique
    critique = call_llm(
        model="gpt-4o",
        prompt=f"Carefully review this for correctness:\n{output}"
    )
    
    # Fast model implements fixes
    if needs_improvement(critique):
        output = call_llm(
            model="gpt-4o-mini",
            prompt=f"Fix these issues:\n{critique}\n\nCode:\n{output}"
        )
    
    return output
 

Pros: Cost-effective, leverages model strengths
Cons: More complex orchestration

Pattern 4: Verified Reflection (with Code Execution)

Don't just critique—actually test:

python

from hopx import Sandbox
 
def verified_reflect(task: str) -> str:
    output = generate_code(task)
    
    for attempt in range(3):
        # Actually run the code
        sandbox = Sandbox.create(template="code-interpreter")
        
        try:
            sandbox.files.write("/app/solution.py", output)
            result = sandbox.commands.run("python /app/solution.py")
            
            if result.exit_code == 0:
                # Code runs - but is it correct?
                verification = verify_output(result.stdout, task)
                if verification.passed:
                    return output
                critique = verification.feedback
            else:
                critique = f"Code failed with error:\n{result.stderr}"
            
            # Improve based on actual execution feedback
            output = improve_code(output, critique)
            
        finally:
            sandbox.kill()
    
    return output
 

Pros: Ground truth verification, catches runtime errors
Cons: Requires sandboxed execution, slower

Advanced: Structured Reflection

For complex tasks, use structured critique formats:

python

import json
from pydantic import BaseModel
from typing import Literal
 
class CritiqueItem(BaseModel):
    category: Literal["correctness", "completeness", "efficiency", "style", "security"]
    severity: Literal["critical", "major", "minor", "suggestion"]
    description: str
    location: str  # Line number or section
    suggested_fix: str
 
class StructuredCritique(BaseModel):
    approved: bool
    summary: str
    issues: list[CritiqueItem]
    
def structured_reflect(task: str, output: str) -> StructuredCritique:
    response = client.chat.completions.create(
        model="gpt-4o",
        messages=[{
            "role": "system",
            "content": """Analyze the output and provide structured feedback.
            Return JSON matching this schema:
            {
                "approved": boolean,
                "summary": "overall assessment",
                "issues": [
                    {
                        "category": "correctness|completeness|efficiency|style|security",
                        "severity": "critical|major|minor|suggestion",
                        "description": "what's wrong",
                        "location": "where in the code",
                        "suggested_fix": "how to fix it"
                    }
                ]
            }"""
        }, {
            "role": "user",
            "content": f"Task: {task}\n\nOutput:\n{output}"
        }],
        response_format={"type": "json_object"}
    )
    
    return StructuredCritique(**json.loads(response.choices[0].message.content))
 
 
# Usage with prioritized fixes
def reflect_with_priority(task: str) -> str:
    output = generate(task)
    
    for _ in range(3):
        critique = structured_reflect(task, output)
        
        if critique.approved:
            break
        
        # Fix critical issues first
        critical = [i for i in critique.issues if i.severity == "critical"]
        major = [i for i in critique.issues if i.severity == "major"]
        
        if critical:
            output = fix_issues(output, critical)
        elif major:
            output = fix_issues(output, major)
        else:
            break  # Only minor issues remain
    
    return output
 

Real-World Example: Code Generation with Testing

Here's a complete example that generates code, writes tests, runs them, and iterates:

python

from hopx import Sandbox
import openai
import json
 
class TestDrivenReflection:
    def __init__(self):
        self.client = openai.OpenAI()
    
    def generate_with_tests(self, task: str) -> dict:
        """Generate code that passes tests"""
        
        # Step 1: Generate initial code
        code = self._generate_code(task)
        
        # Step 2: Generate tests
        tests = self._generate_tests(task, code)
        
        # Step 3: Run and iterate
        for attempt in range(5):
            result = self._run_tests(code, tests)
            
            if result["passed"]:
                return {
                    "code": code,
                    "tests": tests,
                    "attempts": attempt + 1,
                    "status": "success"
                }
            
            # Reflect and improve
            code = self._improve_from_failure(task, code, tests, result["error"])
        
        return {
            "code": code,
            "tests": tests,
            "attempts": 5,
            "status": "max_attempts_reached"
        }
    
    def _generate_code(self, task: str) -> str:
        response = self.client.chat.completions.create(
            model="gpt-4o",
            messages=[{
                "role": "system",
                "content": "Write clean, well-documented Python code. Include type hints."
            }, {
                "role": "user",
                "content": task
            }]
        )
        return self._extract_code(response.choices[0].message.content)
    
    def _generate_tests(self, task: str, code: str) -> str:
        response = self.client.chat.completions.create(
            model="gpt-4o",
            messages=[{
                "role": "system",
                "content": """Write pytest tests for this code. Include:
- Happy path tests
- Edge cases (empty input, large input, invalid input)
- Boundary conditions
Make tests thorough but not excessive."""
            }, {
                "role": "user",
                "content": f"Task: {task}\n\nCode:\n```python\n{code}\n```"
            }]
        )
        return self._extract_code(response.choices[0].message.content)
    
    def _run_tests(self, code: str, tests: str) -> dict:
        sandbox = Sandbox.create(template="code-interpreter")
        
        try:
            # Install pytest
            sandbox.commands.run("pip install pytest -q")
            
            # Write code and tests
            sandbox.files.write("/app/solution.py", code)
            sandbox.files.write("/app/test_solution.py", f"from solution import *\n\n{tests}")
            
            # Run tests
            result = sandbox.commands.run("cd /app && python -m pytest test_solution.py -v")
            
            return {
                "passed": result.exit_code == 0,
                "output": result.stdout,
                "error": result.stderr if result.exit_code != 0 else None
            }
        
        finally:
            sandbox.kill()
    
    def _improve_from_failure(self, task: str, code: str, tests: str, error: str) -> str:
        prompt = f"Task: {task}\n\nCurrent code:\n{code}\n\nTests:\n{tests}\n\nTest error:\n{error}\n\nProvide the fixed code only."
        
        response = self.client.chat.completions.create(
            model="gpt-4o",
            messages=[{
                "role": "system",
                "content": "The code failed tests. Analyze the error and fix the code. Focus on the specific failure."
            }, {
                "role": "user",
                "content": prompt
            }]
        )
        return self._extract_code(response.choices[0].message.content)
    
    def _extract_code(self, content: str) -> str:
        if "```python" in content:
            return content.split("```python")[1].split("```")[0].strip()
        elif "```" in content:
            return content.split("```")[1].split("```")[0].strip()
        return content.strip()
 
 
# Usage
agent = TestDrivenReflection()
result = agent.generate_with_tests(
    "Write a function `merge_sorted_lists(list1, list2)` that merges two sorted lists into one sorted list."
)
 
print(f"Status: {result['status']}")
print(f"Attempts: {result['attempts']}")
print(f"\nFinal code:\n{result['code']}")
 

Reflection for Different Tasks

Writing Tasks

python

def reflect_on_writing(draft: str, requirements: str) -> str:
    critique_prompt = f"""Review this writing for:
1. Clarity - Is it easy to understand?
2. Accuracy - Are all facts correct?
3. Completeness - Does it cover all requirements?
4. Tone - Is it appropriate for the audience?
5. Structure - Is it well-organized?
6. Grammar - Any errors?
 
Requirements: {requirements}
 
Draft:
{draft}"""
 
    critique = generate(critique_prompt)
    
    if needs_revision(critique):
        improved = generate(f"Revise based on this feedback:\n{critique}\n\nDraft:\n{draft}")
        return improved
    
    return draft
 

Data Analysis

python

def reflect_on_analysis(analysis: str, data_description: str) -> str:
    critique_prompt = f"""Review this data analysis for:
1. Statistical validity - Are methods appropriate?
2. Interpretation - Are conclusions supported by data?
3. Completeness - Are there unexplored angles?
4. Clarity - Would a non-expert understand?
5. Visualization - Are charts appropriate and clear?
 
Data: {data_description}
 
Analysis:
{analysis}"""
 
    critique = generate(critique_prompt)
    # ... improve based on critique
 

Decision Making

python

def reflect_on_decision(decision: str, context: str) -> str:
    critique_prompt = f"""Play devil's advocate on this decision:
1. What could go wrong?
2. What alternatives weren't considered?
3. What assumptions might be wrong?
4. Who might be negatively affected?
5. What's the worst-case scenario?
 
Context: {context}
 
Proposed decision:
{decision}"""
 
    critique = generate(critique_prompt)
    
    # Generate balanced view
    balanced = generate(f"""
Given this decision and critique, provide a balanced recommendation.
 
Decision: {decision}
Critique: {critique}
 
Should we proceed, modify, or reconsider?""")
    
    return balanced
 

When NOT to Use Reflection

Reflection isn't always worth the cost:

Skip Reflection When	Why
Simple factual queries	"What's the capital of France?" doesn't need review
Time-critical responses	Latency matters more than perfection
Creative brainstorming	Critique can kill creativity
The task is trivial	Overhead exceeds benefit
You're already using CoT	Chain-of-thought includes implicit reflection

Cost Consideration

Reflection typically 2-3x your token usage:

text

1	Without reflection: 1 LLM call
2	With 2 iterations: 5 LLM calls (generate + critique + improve + critique + improve)
3

Use reflection when quality matters more than cost.

Optimizing Reflection

1. Early Exit

Stop as soon as output is good enough:

python

def optimized_reflect(task: str) -> str:
    output = generate(task)
    
    # Quick check - is it obviously good?
    quick_check = generate(f"Rate this output 1-10:\n{output}")
    if int(quick_check) >= 9:
        return output  # Skip detailed critique
    
    # Full critique only if needed
    critique = detailed_critique(output)
    # ...
 

2. Targeted Critique

Don't critique everything—focus on what matters:

python

def targeted_critique(task: str, output: str) -> str:
    # Determine what's important for this task
    if "code" in task.lower():
        focus = "correctness, edge cases, efficiency"
    elif "write" in task.lower():
        focus = "clarity, accuracy, engagement"
    else:
        focus = "relevance, completeness"
    
    return generate(f"Critique focusing on {focus}:\n{output}")
 

3. Parallel Critique

Run multiple critiques in parallel:

python

import concurrent.futures
 
def parallel_critique(output: str) -> list[str]:
    aspects = [
        "correctness and bugs",
        "performance and efficiency", 
        "readability and style",
        "security vulnerabilities"
    ]
    
    with concurrent.futures.ThreadPoolExecutor() as executor:
        futures = [
            executor.submit(critique_aspect, output, aspect)
            for aspect in aspects
        ]
        return [f.result() for f in futures]
 

Measuring Reflection Effectiveness

Track these metrics:

python

@dataclass
class ReflectionMetrics:
    initial_score: float      # Quality before reflection
    final_score: float        # Quality after reflection
    iterations_used: int      # How many loops
    tokens_used: int          # Cost
    time_taken: float         # Latency
    
    @property
    def improvement(self) -> float:
        return (self.final_score - self.initial_score) / self.initial_score
    
    @property
    def efficiency(self) -> float:
        return self.improvement / self.tokens_used
 

If reflection isn't improving outputs by at least 15-20%, reconsider your critique prompts.

Conclusion

Reflection is one of the highest-impact patterns you can add to AI systems:

Simple to implement — Just add a critique step
Significant quality gains — 20-50% improvement is common
Works everywhere — Code, writing, analysis, decisions
Compounds with other patterns — Combine with prompt chaining for even better results

Start with basic self-reflection. Add verified reflection (with code execution) for code tasks. Measure the improvement, and tune your critique prompts.

The AI that reviews its work beats the AI that doesn't. Every time.

Ready to add verified reflection with code execution? Get started with HopX — sandboxes that let you test AI-generated code safely.

The Reflection Pattern: Building Self-Correcting AI Systems

The Reflection Pattern: Building Self-Correcting AI Systems

What Is the Reflection Pattern?

Why Reflection Works

1. Different Prompts Activate Different Capabilities

2. Reduced Cognitive Load

3. Explicit Reasoning

Basic Reflection Implementation

Reflection Patterns

Pattern 1: Self-Reflection (Single Model)

Pattern 2: Critic Model (Different Persona)

Pattern 3: Multi-Model Reflection

Pattern 4: Verified Reflection (with Code Execution)

Advanced: Structured Reflection

Real-World Example: Code Generation with Testing

Reflection for Different Tasks

Writing Tasks

Data Analysis

Decision Making

When NOT to Use Reflection

Cost Consideration

Optimizing Reflection

1. Early Exit

2. Targeted Critique

3. Parallel Critique

Measuring Reflection Effectiveness

Conclusion

Further Reading

Related articles

Evaluator-Optimizer Loop: Continuous AI Agent Improvement

Human-in-the-Loop: Balancing AI Autonomy and Human Control

Memory for AI Agents: Short-term, Long-term, and RAG

1	import openai
2	from dataclasses import dataclass
3
4	@dataclass
5	class ReflectionResult:
6	final_output: str
7	iterations: int
8	critiques: list[str]
9	improvements: list[str]
10
11	class ReflectionAgent:
12	def __init__(self, max_iterations: int = 3):
13	self.client = openai.OpenAI()
14	self.max_iterations = max_iterations
15
16	def generate(self, task: str) -> ReflectionResult:
17	"""Generate with reflection loop"""
18
19	# Initial generation
20	current_output = self._initial_generate(task)
21
22	critiques = []
23	improvements = []
24
25	for i in range(self.max_iterations):
26	# Critique the current output
27	critique = self._critique(task, current_output)
28	critiques.append(critique)
29
30	# Check if good enough
31	if self._is_satisfactory(critique):
32	break
33
34	# Improve based on critique
35	improved = self._improve(task, current_output, critique)
36	improvements.append(improved)
37	current_output = improved
38
39	return ReflectionResult(
40	final_output=current_output,
41	iterations=i + 1,
42	critiques=critiques,
43	improvements=improvements
44	)
45
46	def _initial_generate(self, task: str) -> str:
47	"""First attempt at the task"""
48	response = self.client.chat.completions.create(
49	model="gpt-4o",
50	messages=[{
51	"role": "user",
52	"content": task
53	}]
54	)
55	return response.choices[0].message.content
56
57	def _critique(self, task: str, output: str) -> str:
58	"""Critique the current output"""
59	response = self.client.chat.completions.create(
60	model="gpt-4o",
61	messages=[{
62	"role": "system",
63	"content": """You are a critical reviewer. Analyze the output for:
64	1. Correctness - Are there any errors or bugs?
65	2. Completeness - Does it fully address the task?
66	3. Quality - Could it be clearer, more efficient, or better structured?
67	4. Edge cases - Are there scenarios not handled?
68
69	Be specific and actionable. If the output is excellent, say "APPROVED" and explain why."""
70	}, {
71	"role": "user",
72	"content": f"Task: {task}\n\nOutput to review:\n{output}"
73	}]
74	)
75	return response.choices[0].message.content
76
77	def _is_satisfactory(self, critique: str) -> bool:
78	"""Check if the critique indicates approval"""
79	return "APPROVED" in critique.upper()
80
81	def _improve(self, task: str, current: str, critique: str) -> str:
82	"""Improve based on critique"""
83	response = self.client.chat.completions.create(
84	model="gpt-4o",
85	messages=[{
86	"role": "system",
87	"content": "Improve the output based on the critique. Address all issues raised."
88	}, {
89	"role": "user",
90	"content": f"""Original task: {task}
91
92	Current output:
93	{current}
94
95	Critique:
96	{critique}
97
98	Please provide an improved version that addresses all the issues."""
99	}]
100	)
101	return response.choices[0].message.content
102
103
104	# Usage
105	agent = ReflectionAgent(max_iterations=3)
106	result = agent.generate(
107	"Write a Python function to find the longest palindromic substring in a string."
108	)
109
110	print(f"Final output after {result.iterations} iterations:")
111	print(result.final_output)
112

1	def self_reflect(task: str) -> str:
2	# Generate
3	output = generate(task)
4
5	# Self-critique
6	critique = generate(f"Review this output for issues:\n{output}")
7
8	# Self-improve
9	if needs_improvement(critique):
10	output = generate(f"Improve this based on feedback:\n{output}\n\nFeedback:\n{critique}")
11
12	return output
13

1	def critic_reflect(task: str) -> str:
2	# Generator persona
3	output = call_llm(
4	system="You are an expert programmer. Write clean, efficient code.",
5	user=task
6	)
7
8	# Critic persona (different mindset)
9	critique = call_llm(
10	system="""You are a senior code reviewer known for finding subtle bugs.
11	You never approve code without thorough analysis.
12	Look for: bugs, edge cases, performance issues, security vulnerabilities.""",
13	user=f"Review this code:\n{output}"
14	)
15
16	# Improver persona
17	if not is_approved(critique):
18	output = call_llm(
19	system="You are a developer responding to code review feedback.",
20	user=f"Address this feedback:\n{critique}\n\nOriginal code:\n{output}"
21	)
22
23	return output
24

1	def multi_model_reflect(task: str) -> str:
2	# Fast model for generation
3	output = call_llm(model="gpt-4o-mini", prompt=task)
4
5	# Powerful model for critique
6	critique = call_llm(
7	model="gpt-4o",
8	prompt=f"Carefully review this for correctness:\n{output}"
9	)
10
11	# Fast model implements fixes
12	if needs_improvement(critique):
13	output = call_llm(
14	model="gpt-4o-mini",
15	prompt=f"Fix these issues:\n{critique}\n\nCode:\n{output}"
16	)
17
18	return output
19

1	from hopx import Sandbox
2
3	def verified_reflect(task: str) -> str:
4	output = generate_code(task)
5
6	for attempt in range(3):
7	# Actually run the code
8	sandbox = Sandbox.create(template="code-interpreter")
9
10	try:
11	sandbox.files.write("/app/solution.py", output)
12	result = sandbox.commands.run("python /app/solution.py")
13
14	if result.exit_code == 0:
15	# Code runs - but is it correct?
16	verification = verify_output(result.stdout, task)
17	if verification.passed:
18	return output
19	critique = verification.feedback
20	else:
21	critique = f"Code failed with error:\n{result.stderr}"
22
23	# Improve based on actual execution feedback
24	output = improve_code(output, critique)
25
26	finally:
27	sandbox.kill()
28
29	return output
30

1	import json
2	from pydantic import BaseModel
3	from typing import Literal
4
5	class CritiqueItem(BaseModel):
6	category: Literal["correctness", "completeness", "efficiency", "style", "security"]
7	severity: Literal["critical", "major", "minor", "suggestion"]
8	description: str
9	location: str # Line number or section
10	suggested_fix: str
11
12	class StructuredCritique(BaseModel):
13	approved: bool
14	summary: str
15	issues: list[CritiqueItem]
16
17	def structured_reflect(task: str, output: str) -> StructuredCritique:
18	response = client.chat.completions.create(
19	model="gpt-4o",
20	messages=[{
21	"role": "system",
22	"content": """Analyze the output and provide structured feedback.
23	Return JSON matching this schema:
24	{
25	"approved": boolean,
26	"summary": "overall assessment",
27	"issues": [
28	{
29	"category": "correctness\|completeness\|efficiency\|style\|security",
30	"severity": "critical\|major\|minor\|suggestion",
31	"description": "what's wrong",
32	"location": "where in the code",
33	"suggested_fix": "how to fix it"
34	}
35	]
36	}"""
37	}, {
38	"role": "user",
39	"content": f"Task: {task}\n\nOutput:\n{output}"
40	}],
41	response_format={"type": "json_object"}
42	)
43
44	return StructuredCritique(**json.loads(response.choices[0].message.content))
45
46
47	# Usage with prioritized fixes
48	def reflect_with_priority(task: str) -> str:
49	output = generate(task)
50
51	for _ in range(3):
52	critique = structured_reflect(task, output)
53
54	if critique.approved:
55	break
56
57	# Fix critical issues first
58	critical = [i for i in critique.issues if i.severity == "critical"]
59	major = [i for i in critique.issues if i.severity == "major"]
60
61	if critical:
62	output = fix_issues(output, critical)
63	elif major:
64	output = fix_issues(output, major)
65	else:
66	break # Only minor issues remain
67
68	return output
69

1	def reflect_on_writing(draft: str, requirements: str) -> str:
2	critique_prompt = f"""Review this writing for:
3	1. Clarity - Is it easy to understand?
4	2. Accuracy - Are all facts correct?
5	3. Completeness - Does it cover all requirements?
6	4. Tone - Is it appropriate for the audience?
7	5. Structure - Is it well-organized?
8	6. Grammar - Any errors?
9
10	Requirements: {requirements}
11
12	Draft:
13	{draft}"""
14
15	critique = generate(critique_prompt)
16
17	if needs_revision(critique):
18	improved = generate(f"Revise based on this feedback:\n{critique}\n\nDraft:\n{draft}")
19	return improved
20
21	return draft
22

1	def reflect_on_analysis(analysis: str, data_description: str) -> str:
2	critique_prompt = f"""Review this data analysis for:
3	1. Statistical validity - Are methods appropriate?
4	2. Interpretation - Are conclusions supported by data?
5	3. Completeness - Are there unexplored angles?
6	4. Clarity - Would a non-expert understand?
7	5. Visualization - Are charts appropriate and clear?
8
9	Data: {data_description}
10
11	Analysis:
12	{analysis}"""
13
14	critique = generate(critique_prompt)
15	# ... improve based on critique
16

1	def reflect_on_decision(decision: str, context: str) -> str:
2	critique_prompt = f"""Play devil's advocate on this decision:
3	1. What could go wrong?
4	2. What alternatives weren't considered?
5	3. What assumptions might be wrong?
6	4. Who might be negatively affected?
7	5. What's the worst-case scenario?
8
9	Context: {context}
10
11	Proposed decision:
12	{decision}"""
13
14	critique = generate(critique_prompt)
15
16	# Generate balanced view
17	balanced = generate(f"""
18	Given this decision and critique, provide a balanced recommendation.
19
20	Decision: {decision}
21	Critique: {critique}
22
23	Should we proceed, modify, or reconsider?""")
24
25	return balanced
26

1	def optimized_reflect(task: str) -> str:
2	output = generate(task)
3
4	# Quick check - is it obviously good?
5	quick_check = generate(f"Rate this output 1-10:\n{output}")
6	if int(quick_check) >= 9:
7	return output # Skip detailed critique
8
9	# Full critique only if needed
10	critique = detailed_critique(output)
11	# ...
12

1	def targeted_critique(task: str, output: str) -> str:
2	# Determine what's important for this task
3	if "code" in task.lower():
4	focus = "correctness, edge cases, efficiency"
5	elif "write" in task.lower():
6	focus = "clarity, accuracy, engagement"
7	else:
8	focus = "relevance, completeness"
9
10	return generate(f"Critique focusing on {focus}:\n{output}")
11

1	import concurrent.futures
2
3	def parallel_critique(output: str) -> list[str]:
4	aspects = [
5	"correctness and bugs",
6	"performance and efficiency",
7	"readability and style",
8	"security vulnerabilities"
9	]
10
11	with concurrent.futures.ThreadPoolExecutor() as executor:
12	futures = [
13	executor.submit(critique_aspect, output, aspect)
14	for aspect in aspects
15	]
16	return [f.result() for f in futures]
17

1	@dataclass
2	class ReflectionMetrics:
3	initial_score: float # Quality before reflection
4	final_score: float # Quality after reflection
5	iterations_used: int # How many loops
6	tokens_used: int # Cost
7	time_taken: float # Latency
8
9	@property
10	def improvement(self) -> float:
11	return (self.final_score - self.initial_score) / self.initial_score
12
13	@property
14	def efficiency(self) -> float:
15	return self.improvement / self.tokens_used
16