The Reflection Pattern: Building Self-Correcting AI Systems
Here's a uncomfortable truth: your LLM's first answer is rarely its best answer.
Ask GPT-4 to write code, and it works—mostly. Ask it to review that same code, and it finds bugs. Ask it to fix those bugs, and you get better code. This isn't magic. It's the reflection pattern.
Reflection is simple: make the AI critique its own work, then improve based on that critique. The result? Dramatically better outputs with minimal extra cost.
What Is the Reflection Pattern?
Reflection adds a self-review loop to AI generation:
| 1 | ┌─────────────────────────────────────────────────────────────┐ |
| 2 | │ │ |
| 3 | │ ┌──────────┐ ┌──────────┐ ┌──────────┐ │ |
| 4 | │ │ Generate │───▶│ Critique │───▶│ Improve │────┐ │ |
| 5 | │ └──────────┘ └──────────┘ └──────────┘ │ │ |
| 6 | │ │ │ │ |
| 7 | │ ▼ │ │ |
| 8 | │ ┌──────────┐ │ │ |
| 9 | │ │Good │ │ │ |
| 10 | │ No ◄──│Enough? │ │ │ |
| 11 | │ └──────────┘ │ │ |
| 12 | │ │Yes │ │ |
| 13 | │ ▼ │ │ |
| 14 | │ Output │ │ |
| 15 | │ │ │ |
| 16 | │ ◄────────────────────────────────────────────────┘ │ |
| 17 | │ (iterate) │ |
| 18 | └─────────────────────────────────────────────────────────────┘ |
| 19 | |
Instead of:
| 1 | User → LLM → Output |
| 2 | |
You get:
| 1 | User → LLM → Draft → LLM (critic) → Feedback → LLM → Improved → ... → Final Output |
| 2 | |
The same model that makes mistakes can often catch those mistakes when asked to look again with fresh eyes.
Why Reflection Works
1. Different Prompts Activate Different Capabilities
When you ask an LLM to "write code," it's in generation mode—optimizing for producing something that looks right. When you ask it to "review this code for bugs," it's in analysis mode—optimizing for finding problems.
These are different cognitive tasks that activate different patterns in the model.
2. Reduced Cognitive Load
Generating AND critiquing simultaneously is hard. Separating them lets the model focus:
| Single Pass | With Reflection |
|---|---|
| Generate correct code | Generate code (any code) |
| While avoiding bugs | Then: Find bugs |
| While being efficient | Then: Optimize |
| While handling edge cases | Then: Check edge cases |
3. Explicit Reasoning
Reflection forces the model to articulate what's wrong and why. This explicit reasoning often surfaces issues that implicit reasoning misses.
Basic Reflection Implementation
Here's a minimal but complete implementation:
| 1 | import openai |
| 2 | from dataclasses import dataclass |
| 3 | |
| 4 | @dataclass |
| 5 | class ReflectionResult: |
| 6 | final_output: str |
| 7 | iterations: int |
| 8 | critiques: list[str] |
| 9 | improvements: list[str] |
| 10 | |
| 11 | class ReflectionAgent: |
| 12 | def __init__(self, max_iterations: int = 3): |
| 13 | self.client = openai.OpenAI() |
| 14 | self.max_iterations = max_iterations |
| 15 | |
| 16 | def generate(self, task: str) -> ReflectionResult: |
| 17 | """Generate with reflection loop""" |
| 18 | |
| 19 | # Initial generation |
| 20 | current_output = self._initial_generate(task) |
| 21 | |
| 22 | critiques = [] |
| 23 | improvements = [] |
| 24 | |
| 25 | for i in range(self.max_iterations): |
| 26 | # Critique the current output |
| 27 | critique = self._critique(task, current_output) |
| 28 | critiques.append(critique) |
| 29 | |
| 30 | # Check if good enough |
| 31 | if self._is_satisfactory(critique): |
| 32 | break |
| 33 | |
| 34 | # Improve based on critique |
| 35 | improved = self._improve(task, current_output, critique) |
| 36 | improvements.append(improved) |
| 37 | current_output = improved |
| 38 | |
| 39 | return ReflectionResult( |
| 40 | final_output=current_output, |
| 41 | iterations=i + 1, |
| 42 | critiques=critiques, |
| 43 | improvements=improvements |
| 44 | ) |
| 45 | |
| 46 | def _initial_generate(self, task: str) -> str: |
| 47 | """First attempt at the task""" |
| 48 | response = self.client.chat.completions.create( |
| 49 | model="gpt-4o", |
| 50 | messages=[{ |
| 51 | "role": "user", |
| 52 | "content": task |
| 53 | }] |
| 54 | ) |
| 55 | return response.choices[0].message.content |
| 56 | |
| 57 | def _critique(self, task: str, output: str) -> str: |
| 58 | """Critique the current output""" |
| 59 | response = self.client.chat.completions.create( |
| 60 | model="gpt-4o", |
| 61 | messages=[{ |
| 62 | "role": "system", |
| 63 | "content": """You are a critical reviewer. Analyze the output for: |
| 64 | 1. Correctness - Are there any errors or bugs? |
| 65 | 2. Completeness - Does it fully address the task? |
| 66 | 3. Quality - Could it be clearer, more efficient, or better structured? |
| 67 | 4. Edge cases - Are there scenarios not handled? |
| 68 | |
| 69 | Be specific and actionable. If the output is excellent, say "APPROVED" and explain why.""" |
| 70 | }, { |
| 71 | "role": "user", |
| 72 | "content": f"Task: {task}\n\nOutput to review:\n{output}" |
| 73 | }] |
| 74 | ) |
| 75 | return response.choices[0].message.content |
| 76 | |
| 77 | def _is_satisfactory(self, critique: str) -> bool: |
| 78 | """Check if the critique indicates approval""" |
| 79 | return "APPROVED" in critique.upper() |
| 80 | |
| 81 | def _improve(self, task: str, current: str, critique: str) -> str: |
| 82 | """Improve based on critique""" |
| 83 | response = self.client.chat.completions.create( |
| 84 | model="gpt-4o", |
| 85 | messages=[{ |
| 86 | "role": "system", |
| 87 | "content": "Improve the output based on the critique. Address all issues raised." |
| 88 | }, { |
| 89 | "role": "user", |
| 90 | "content": f"""Original task: {task} |
| 91 | |
| 92 | Current output: |
| 93 | {current} |
| 94 | |
| 95 | Critique: |
| 96 | {critique} |
| 97 | |
| 98 | Please provide an improved version that addresses all the issues.""" |
| 99 | }] |
| 100 | ) |
| 101 | return response.choices[0].message.content |
| 102 | |
| 103 | |
| 104 | # Usage |
| 105 | agent = ReflectionAgent(max_iterations=3) |
| 106 | result = agent.generate( |
| 107 | "Write a Python function to find the longest palindromic substring in a string." |
| 108 | ) |
| 109 | |
| 110 | print(f"Final output after {result.iterations} iterations:") |
| 111 | print(result.final_output) |
| 112 | |
Reflection Patterns
Pattern 1: Self-Reflection (Single Model)
The same model generates and critiques:
| 1 | def self_reflect(task: str) -> str: |
| 2 | # Generate |
| 3 | output = generate(task) |
| 4 | |
| 5 | # Self-critique |
| 6 | critique = generate(f"Review this output for issues:\n{output}") |
| 7 | |
| 8 | # Self-improve |
| 9 | if needs_improvement(critique): |
| 10 | output = generate(f"Improve this based on feedback:\n{output}\n\nFeedback:\n{critique}") |
| 11 | |
| 12 | return output |
| 13 | |
Pros: Simple, cheap, fast
Cons: Same blind spots in generation and critique
Pattern 2: Critic Model (Different Persona)
Use different system prompts to create distinct "personas":
| 1 | def critic_reflect(task: str) -> str: |
| 2 | # Generator persona |
| 3 | output = call_llm( |
| 4 | system="You are an expert programmer. Write clean, efficient code.", |
| 5 | user=task |
| 6 | ) |
| 7 | |
| 8 | # Critic persona (different mindset) |
| 9 | critique = call_llm( |
| 10 | system="""You are a senior code reviewer known for finding subtle bugs. |
| 11 | You never approve code without thorough analysis. |
| 12 | Look for: bugs, edge cases, performance issues, security vulnerabilities.""", |
| 13 | user=f"Review this code:\n{output}" |
| 14 | ) |
| 15 | |
| 16 | # Improver persona |
| 17 | if not is_approved(critique): |
| 18 | output = call_llm( |
| 19 | system="You are a developer responding to code review feedback.", |
| 20 | user=f"Address this feedback:\n{critique}\n\nOriginal code:\n{output}" |
| 21 | ) |
| 22 | |
| 23 | return output |
| 24 | |
Pros: Different perspectives, catches more issues
Cons: More prompt engineering required
Pattern 3: Multi-Model Reflection
Use different models for generation and critique:
| 1 | def multi_model_reflect(task: str) -> str: |
| 2 | # Fast model for generation |
| 3 | output = call_llm(model="gpt-4o-mini", prompt=task) |
| 4 | |
| 5 | # Powerful model for critique |
| 6 | critique = call_llm( |
| 7 | model="gpt-4o", |
| 8 | prompt=f"Carefully review this for correctness:\n{output}" |
| 9 | ) |
| 10 | |
| 11 | # Fast model implements fixes |
| 12 | if needs_improvement(critique): |
| 13 | output = call_llm( |
| 14 | model="gpt-4o-mini", |
| 15 | prompt=f"Fix these issues:\n{critique}\n\nCode:\n{output}" |
| 16 | ) |
| 17 | |
| 18 | return output |
| 19 | |
Pros: Cost-effective, leverages model strengths
Cons: More complex orchestration
Pattern 4: Verified Reflection (with Code Execution)
Don't just critique—actually test:
| 1 | from hopx import Sandbox |
| 2 | |
| 3 | def verified_reflect(task: str) -> str: |
| 4 | output = generate_code(task) |
| 5 | |
| 6 | for attempt in range(3): |
| 7 | # Actually run the code |
| 8 | sandbox = Sandbox.create(template="code-interpreter") |
| 9 | |
| 10 | try: |
| 11 | sandbox.files.write("/app/solution.py", output) |
| 12 | result = sandbox.commands.run("python /app/solution.py") |
| 13 | |
| 14 | if result.exit_code == 0: |
| 15 | # Code runs - but is it correct? |
| 16 | verification = verify_output(result.stdout, task) |
| 17 | if verification.passed: |
| 18 | return output |
| 19 | critique = verification.feedback |
| 20 | else: |
| 21 | critique = f"Code failed with error:\n{result.stderr}" |
| 22 | |
| 23 | # Improve based on actual execution feedback |
| 24 | output = improve_code(output, critique) |
| 25 | |
| 26 | finally: |
| 27 | sandbox.kill() |
| 28 | |
| 29 | return output |
| 30 | |
Pros: Ground truth verification, catches runtime errors
Cons: Requires sandboxed execution, slower
Advanced: Structured Reflection
For complex tasks, use structured critique formats:
| 1 | import json |
| 2 | from pydantic import BaseModel |
| 3 | from typing import Literal |
| 4 | |
| 5 | class CritiqueItem(BaseModel): |
| 6 | category: Literal["correctness", "completeness", "efficiency", "style", "security"] |
| 7 | severity: Literal["critical", "major", "minor", "suggestion"] |
| 8 | description: str |
| 9 | location: str # Line number or section |
| 10 | suggested_fix: str |
| 11 | |
| 12 | class StructuredCritique(BaseModel): |
| 13 | approved: bool |
| 14 | summary: str |
| 15 | issues: list[CritiqueItem] |
| 16 | |
| 17 | def structured_reflect(task: str, output: str) -> StructuredCritique: |
| 18 | response = client.chat.completions.create( |
| 19 | model="gpt-4o", |
| 20 | messages=[{ |
| 21 | "role": "system", |
| 22 | "content": """Analyze the output and provide structured feedback. |
| 23 | Return JSON matching this schema: |
| 24 | { |
| 25 | "approved": boolean, |
| 26 | "summary": "overall assessment", |
| 27 | "issues": [ |
| 28 | { |
| 29 | "category": "correctness|completeness|efficiency|style|security", |
| 30 | "severity": "critical|major|minor|suggestion", |
| 31 | "description": "what's wrong", |
| 32 | "location": "where in the code", |
| 33 | "suggested_fix": "how to fix it" |
| 34 | } |
| 35 | ] |
| 36 | }""" |
| 37 | }, { |
| 38 | "role": "user", |
| 39 | "content": f"Task: {task}\n\nOutput:\n{output}" |
| 40 | }], |
| 41 | response_format={"type": "json_object"} |
| 42 | ) |
| 43 | |
| 44 | return StructuredCritique(**json.loads(response.choices[0].message.content)) |
| 45 | |
| 46 | |
| 47 | # Usage with prioritized fixes |
| 48 | def reflect_with_priority(task: str) -> str: |
| 49 | output = generate(task) |
| 50 | |
| 51 | for _ in range(3): |
| 52 | critique = structured_reflect(task, output) |
| 53 | |
| 54 | if critique.approved: |
| 55 | break |
| 56 | |
| 57 | # Fix critical issues first |
| 58 | critical = [i for i in critique.issues if i.severity == "critical"] |
| 59 | major = [i for i in critique.issues if i.severity == "major"] |
| 60 | |
| 61 | if critical: |
| 62 | output = fix_issues(output, critical) |
| 63 | elif major: |
| 64 | output = fix_issues(output, major) |
| 65 | else: |
| 66 | break # Only minor issues remain |
| 67 | |
| 68 | return output |
| 69 | |
Real-World Example: Code Generation with Testing
Here's a complete example that generates code, writes tests, runs them, and iterates:
| 1 | from hopx import Sandbox |
| 2 | import openai |
| 3 | import json |
| 4 | |
| 5 | class TestDrivenReflection: |
| 6 | def __init__(self): |
| 7 | self.client = openai.OpenAI() |
| 8 | |
| 9 | def generate_with_tests(self, task: str) -> dict: |
| 10 | """Generate code that passes tests""" |
| 11 | |
| 12 | # Step 1: Generate initial code |
| 13 | code = self._generate_code(task) |
| 14 | |
| 15 | # Step 2: Generate tests |
| 16 | tests = self._generate_tests(task, code) |
| 17 | |
| 18 | # Step 3: Run and iterate |
| 19 | for attempt in range(5): |
| 20 | result = self._run_tests(code, tests) |
| 21 | |
| 22 | if result["passed"]: |
| 23 | return { |
| 24 | "code": code, |
| 25 | "tests": tests, |
| 26 | "attempts": attempt + 1, |
| 27 | "status": "success" |
| 28 | } |
| 29 | |
| 30 | # Reflect and improve |
| 31 | code = self._improve_from_failure(task, code, tests, result["error"]) |
| 32 | |
| 33 | return { |
| 34 | "code": code, |
| 35 | "tests": tests, |
| 36 | "attempts": 5, |
| 37 | "status": "max_attempts_reached" |
| 38 | } |
| 39 | |
| 40 | def _generate_code(self, task: str) -> str: |
| 41 | response = self.client.chat.completions.create( |
| 42 | model="gpt-4o", |
| 43 | messages=[{ |
| 44 | "role": "system", |
| 45 | "content": "Write clean, well-documented Python code. Include type hints." |
| 46 | }, { |
| 47 | "role": "user", |
| 48 | "content": task |
| 49 | }] |
| 50 | ) |
| 51 | return self._extract_code(response.choices[0].message.content) |
| 52 | |
| 53 | def _generate_tests(self, task: str, code: str) -> str: |
| 54 | response = self.client.chat.completions.create( |
| 55 | model="gpt-4o", |
| 56 | messages=[{ |
| 57 | "role": "system", |
| 58 | "content": """Write pytest tests for this code. Include: |
| 59 | - Happy path tests |
| 60 | - Edge cases (empty input, large input, invalid input) |
| 61 | - Boundary conditions |
| 62 | Make tests thorough but not excessive.""" |
| 63 | }, { |
| 64 | "role": "user", |
| 65 | "content": f"Task: {task}\n\nCode:\n```python\n{code}\n```" |
| 66 | }] |
| 67 | ) |
| 68 | return self._extract_code(response.choices[0].message.content) |
| 69 | |
| 70 | def _run_tests(self, code: str, tests: str) -> dict: |
| 71 | sandbox = Sandbox.create(template="code-interpreter") |
| 72 | |
| 73 | try: |
| 74 | # Install pytest |
| 75 | sandbox.commands.run("pip install pytest -q") |
| 76 | |
| 77 | # Write code and tests |
| 78 | sandbox.files.write("/app/solution.py", code) |
| 79 | sandbox.files.write("/app/test_solution.py", f"from solution import *\n\n{tests}") |
| 80 | |
| 81 | # Run tests |
| 82 | result = sandbox.commands.run("cd /app && python -m pytest test_solution.py -v") |
| 83 | |
| 84 | return { |
| 85 | "passed": result.exit_code == 0, |
| 86 | "output": result.stdout, |
| 87 | "error": result.stderr if result.exit_code != 0 else None |
| 88 | } |
| 89 | |
| 90 | finally: |
| 91 | sandbox.kill() |
| 92 | |
| 93 | def _improve_from_failure(self, task: str, code: str, tests: str, error: str) -> str: |
| 94 | prompt = f"Task: {task}\n\nCurrent code:\n{code}\n\nTests:\n{tests}\n\nTest error:\n{error}\n\nProvide the fixed code only." |
| 95 | |
| 96 | response = self.client.chat.completions.create( |
| 97 | model="gpt-4o", |
| 98 | messages=[{ |
| 99 | "role": "system", |
| 100 | "content": "The code failed tests. Analyze the error and fix the code. Focus on the specific failure." |
| 101 | }, { |
| 102 | "role": "user", |
| 103 | "content": prompt |
| 104 | }] |
| 105 | ) |
| 106 | return self._extract_code(response.choices[0].message.content) |
| 107 | |
| 108 | def _extract_code(self, content: str) -> str: |
| 109 | if "```python" in content: |
| 110 | return content.split("```python")[1].split("```")[0].strip() |
| 111 | elif "```" in content: |
| 112 | return content.split("```")[1].split("```")[0].strip() |
| 113 | return content.strip() |
| 114 | |
| 115 | |
| 116 | # Usage |
| 117 | agent = TestDrivenReflection() |
| 118 | result = agent.generate_with_tests( |
| 119 | "Write a function `merge_sorted_lists(list1, list2)` that merges two sorted lists into one sorted list." |
| 120 | ) |
| 121 | |
| 122 | print(f"Status: {result['status']}") |
| 123 | print(f"Attempts: {result['attempts']}") |
| 124 | print(f"\nFinal code:\n{result['code']}") |
| 125 | |
Reflection for Different Tasks
Writing Tasks
| 1 | def reflect_on_writing(draft: str, requirements: str) -> str: |
| 2 | critique_prompt = f"""Review this writing for: |
| 3 | 1. Clarity - Is it easy to understand? |
| 4 | 2. Accuracy - Are all facts correct? |
| 5 | 3. Completeness - Does it cover all requirements? |
| 6 | 4. Tone - Is it appropriate for the audience? |
| 7 | 5. Structure - Is it well-organized? |
| 8 | 6. Grammar - Any errors? |
| 9 | |
| 10 | Requirements: {requirements} |
| 11 | |
| 12 | Draft: |
| 13 | {draft}""" |
| 14 | |
| 15 | critique = generate(critique_prompt) |
| 16 | |
| 17 | if needs_revision(critique): |
| 18 | improved = generate(f"Revise based on this feedback:\n{critique}\n\nDraft:\n{draft}") |
| 19 | return improved |
| 20 | |
| 21 | return draft |
| 22 | |
Data Analysis
| 1 | def reflect_on_analysis(analysis: str, data_description: str) -> str: |
| 2 | critique_prompt = f"""Review this data analysis for: |
| 3 | 1. Statistical validity - Are methods appropriate? |
| 4 | 2. Interpretation - Are conclusions supported by data? |
| 5 | 3. Completeness - Are there unexplored angles? |
| 6 | 4. Clarity - Would a non-expert understand? |
| 7 | 5. Visualization - Are charts appropriate and clear? |
| 8 | |
| 9 | Data: {data_description} |
| 10 | |
| 11 | Analysis: |
| 12 | {analysis}""" |
| 13 | |
| 14 | critique = generate(critique_prompt) |
| 15 | # ... improve based on critique |
| 16 | |
Decision Making
| 1 | def reflect_on_decision(decision: str, context: str) -> str: |
| 2 | critique_prompt = f"""Play devil's advocate on this decision: |
| 3 | 1. What could go wrong? |
| 4 | 2. What alternatives weren't considered? |
| 5 | 3. What assumptions might be wrong? |
| 6 | 4. Who might be negatively affected? |
| 7 | 5. What's the worst-case scenario? |
| 8 | |
| 9 | Context: {context} |
| 10 | |
| 11 | Proposed decision: |
| 12 | {decision}""" |
| 13 | |
| 14 | critique = generate(critique_prompt) |
| 15 | |
| 16 | # Generate balanced view |
| 17 | balanced = generate(f""" |
| 18 | Given this decision and critique, provide a balanced recommendation. |
| 19 | |
| 20 | Decision: {decision} |
| 21 | Critique: {critique} |
| 22 | |
| 23 | Should we proceed, modify, or reconsider?""") |
| 24 | |
| 25 | return balanced |
| 26 | |
When NOT to Use Reflection
Reflection isn't always worth the cost:
| Skip Reflection When | Why |
|---|---|
| Simple factual queries | "What's the capital of France?" doesn't need review |
| Time-critical responses | Latency matters more than perfection |
| Creative brainstorming | Critique can kill creativity |
| The task is trivial | Overhead exceeds benefit |
| You're already using CoT | Chain-of-thought includes implicit reflection |
Cost Consideration
Reflection typically 2-3x your token usage:
| 1 | Without reflection: 1 LLM call |
| 2 | With 2 iterations: 5 LLM calls (generate + critique + improve + critique + improve) |
| 3 | |
Use reflection when quality matters more than cost.
Optimizing Reflection
1. Early Exit
Stop as soon as output is good enough:
| 1 | def optimized_reflect(task: str) -> str: |
| 2 | output = generate(task) |
| 3 | |
| 4 | # Quick check - is it obviously good? |
| 5 | quick_check = generate(f"Rate this output 1-10:\n{output}") |
| 6 | if int(quick_check) >= 9: |
| 7 | return output # Skip detailed critique |
| 8 | |
| 9 | # Full critique only if needed |
| 10 | critique = detailed_critique(output) |
| 11 | # ... |
| 12 | |
2. Targeted Critique
Don't critique everything—focus on what matters:
| 1 | def targeted_critique(task: str, output: str) -> str: |
| 2 | # Determine what's important for this task |
| 3 | if "code" in task.lower(): |
| 4 | focus = "correctness, edge cases, efficiency" |
| 5 | elif "write" in task.lower(): |
| 6 | focus = "clarity, accuracy, engagement" |
| 7 | else: |
| 8 | focus = "relevance, completeness" |
| 9 | |
| 10 | return generate(f"Critique focusing on {focus}:\n{output}") |
| 11 | |
3. Parallel Critique
Run multiple critiques in parallel:
| 1 | import concurrent.futures |
| 2 | |
| 3 | def parallel_critique(output: str) -> list[str]: |
| 4 | aspects = [ |
| 5 | "correctness and bugs", |
| 6 | "performance and efficiency", |
| 7 | "readability and style", |
| 8 | "security vulnerabilities" |
| 9 | ] |
| 10 | |
| 11 | with concurrent.futures.ThreadPoolExecutor() as executor: |
| 12 | futures = [ |
| 13 | executor.submit(critique_aspect, output, aspect) |
| 14 | for aspect in aspects |
| 15 | ] |
| 16 | return [f.result() for f in futures] |
| 17 | |
Measuring Reflection Effectiveness
Track these metrics:
| 1 | @dataclass |
| 2 | class ReflectionMetrics: |
| 3 | initial_score: float # Quality before reflection |
| 4 | final_score: float # Quality after reflection |
| 5 | iterations_used: int # How many loops |
| 6 | tokens_used: int # Cost |
| 7 | time_taken: float # Latency |
| 8 | |
| 9 | @property |
| 10 | def improvement(self) -> float: |
| 11 | return (self.final_score - self.initial_score) / self.initial_score |
| 12 | |
| 13 | @property |
| 14 | def efficiency(self) -> float: |
| 15 | return self.improvement / self.tokens_used |
| 16 | |
If reflection isn't improving outputs by at least 15-20%, reconsider your critique prompts.
Conclusion
Reflection is one of the highest-impact patterns you can add to AI systems:
- Simple to implement — Just add a critique step
- Significant quality gains — 20-50% improvement is common
- Works everywhere — Code, writing, analysis, decisions
- Compounds with other patterns — Combine with prompt chaining for even better results
Start with basic self-reflection. Add verified reflection (with code execution) for code tasks. Measure the improvement, and tune your critique prompts.
The AI that reviews its work beats the AI that doesn't. Every time.
Ready to add verified reflection with code execution? Get started with HopX — sandboxes that let you test AI-generated code safely.
Further Reading
- What Is an AI Agent? — The fundamentals of agentic systems
- Prompt Chaining — Combine with reflection for powerful pipelines
- Multi-Agent Architectures — Use separate agents for generation and critique
- Reflexion Paper — Academic foundation for reflection in LLMs