The Orchestrator Pattern: Coordinating Complex AI Agent Workflows

Single agents hit walls. They run out of context, lack specialized skills, and struggle with complex multi-step tasks. The solution? Don't build one super-agent—build an orchestrator that coordinates many specialized agents.

The Orchestrator Pattern is how you build AI systems that tackle enterprise-grade complexity: routing tasks to the right specialists, managing dependencies, handling failures, and synthesizing results.

This guide shows you how to build orchestrators that turn chaos into coordination.

What Is the Orchestrator Pattern?

An orchestrator is a meta-agent that doesn't do the work itself—it decides who should do the work and when:

text

1	┌─────────────────────────────────────────────────────────────┐
2	│ ORCHESTRATOR │
3	│ │
4	│ "Analyze sales data, create visualizations, │
5	│ and write an executive summary" │
6	│ │
7	│ │ │
8	│ ▼ │
9	│ ┌─────────────────────┐ │
10	│ │ Task Decomposer │ │
11	│ └──────────┬──────────┘ │
12	│ │ │
13	│ ┌───────────────┼───────────────┐ │
14	│ ▼ ▼ ▼ │
15	│ ┌───────────┐ ┌───────────┐ ┌───────────┐ │
16	│ │ Data │ │ Viz │ │ Writer │ │
17	│ │ Analyst │ │ Agent │ │ Agent │ │
18	│ └─────┬─────┘ └─────┬─────┘ └─────┬─────┘ │
19	│ │ │ │ │
20	│ └───────────────┼───────────────┘ │
21	│ ▼ │
22	│ ┌─────────────────────┐ │
23	│ │ Result Synthesizer │ │
24	│ └─────────────────────┘ │
25	│ │ │
26	│ ▼ │
27	│ Final Output │
28	│ │
29	└─────────────────────────────────────────────────────────────┘
30

The orchestrator handles:

Task decomposition: Breaking complex tasks into subtasks
Agent selection: Routing each subtask to the right specialist
Dependency management: Ensuring correct execution order
Result synthesis: Combining outputs into a coherent whole
Error handling: Retrying, rerouting, or escalating failures

Why Orchestration Matters

1. Specialization Beats Generalization

One agent trying to do everything:

text

1	❌ Jack of all trades, master of none
2	❌ Context window filled with irrelevant instructions
3	❌ Conflicting objectives in one prompt
4

Specialized agents with orchestration:

text

1	✅ Each agent masters its domain
2	✅ Focused context for each task
3	✅ Clear, single-purpose prompts
4

2. Scalability

text

1	Single Agent Orchestrated System
2	│ │
3	▼ ▼
4	┌─────────┐ ┌─────────────┐
5	│ One LLM │ │Orchestrator │
6	│ Call │ └──────┬──────┘
7	└─────────┘ │
8	┌─────────┼─────────┐
9	▼ ▼ ▼
10	┌───────┐ ┌───────┐ ┌───────┐
11	│Agent 1│ │Agent 2│ │Agent 3│
12	└───────┘ └───────┘ └───────┘
13	│ │ │
14	└─────────┼─────────┘
15	▼
16	Run in parallel = 3x faster
17

3. Fault Isolation

When one agent fails:

Without orchestration: Entire task fails
With orchestration: Retry, use backup agent, or gracefully degrade

Basic Orchestrator Implementation

Here's a complete, minimal orchestrator:

python

import openai
import json
from dataclasses import dataclass
from enum import Enum
from typing import Callable
import concurrent.futures
 
class TaskStatus(Enum):
    PENDING = "pending"
    RUNNING = "running"
    COMPLETED = "completed"
    FAILED = "failed"
 
@dataclass
class Task:
    id: str
    description: str
    agent_type: str
    dependencies: list[str]
    status: TaskStatus = TaskStatus.PENDING
    result: str = None
    error: str = None
 
@dataclass 
class Agent:
    name: str
    description: str
    execute: Callable[[str, dict], str]
 
class Orchestrator:
    def __init__(self, agents: dict[str, Agent]):
        self.client = openai.OpenAI()
        self.agents = agents
        self.tasks: dict[str, Task] = {}
        self.results: dict[str, str] = {}
    
    def run(self, goal: str) -> dict:
        """Orchestrate agents to achieve the goal"""
        
        # Phase 1: Decompose into tasks
        tasks = self._decompose(goal)
        self.tasks = {t.id: t for t in tasks}
        
        print(f"Created {len(tasks)} tasks")
        
        # Phase 2: Execute tasks respecting dependencies
        while not self._all_complete():
            # Find tasks ready to run
            ready = self._get_ready_tasks()
            
            if not ready:
                if self._has_failures():
                    break
                continue
            
            # Execute ready tasks in parallel
            with concurrent.futures.ThreadPoolExecutor(max_workers=5) as executor:
                futures = {
                    executor.submit(self._execute_task, task): task
                    for task in ready
                }
                
                for future in concurrent.futures.as_completed(futures):
                    task = futures[future]
                    try:
                        result = future.result()
                        task.status = TaskStatus.COMPLETED
                        task.result = result
                        self.results[task.id] = result
                    except Exception as e:
                        task.status = TaskStatus.FAILED
                        task.error = str(e)
        
        # Phase 3: Synthesize results
        if self._has_failures():
            return {
                "success": False,
                "completed": [t.id for t in self.tasks.values() if t.status == TaskStatus.COMPLETED],
                "failed": [t.id for t in self.tasks.values() if t.status == TaskStatus.FAILED],
                "partial_results": self.results
            }
        
        final_result = self._synthesize(goal, self.results)
        
        return {
            "success": True,
            "result": final_result,
            "tasks_completed": len(self.tasks)
        }
    
    def _decompose(self, goal: str) -> list[Task]:
        """Break goal into tasks with dependencies"""
        
        agent_descriptions = "\n".join([
            f"- {name}: {agent.description}"
            for name, agent in self.agents.items()
        ])
        
        response = self.client.chat.completions.create(
            model="gpt-4o",
            messages=[{
                "role": "system",
                "content": f"""Decompose this goal into tasks for available agents.
 
Available agents:
{agent_descriptions}
 
Return JSON:
{{
    "tasks": [
        {{
            "id": "task_1",
            "description": "What to do",
            "agent_type": "agent_name",
            "dependencies": []
        }},
        {{
            "id": "task_2",
            "description": "Next task",
            "agent_type": "agent_name",
            "dependencies": ["task_1"]
        }}
    ]
}}
 
Rules:
- Break into 2-8 tasks
- Each task should be focused and achievable
- List dependencies (tasks that must complete first)
- Assign to the most appropriate agent"""
            }, {
                "role": "user",
                "content": goal
            }],
            response_format={"type": "json_object"}
        )
        
        data = json.loads(response.choices[0].message.content)
        
        return [
            Task(
                id=t["id"],
                description=t["description"],
                agent_type=t["agent_type"],
                dependencies=t.get("dependencies", [])
            )
            for t in data["tasks"]
        ]
    
    def _get_ready_tasks(self) -> list[Task]:
        """Get tasks whose dependencies are all complete"""
        ready = []
        for task in self.tasks.values():
            if task.status != TaskStatus.PENDING:
                continue
            
            deps_complete = all(
                self.tasks[dep].status == TaskStatus.COMPLETED
                for dep in task.dependencies
            )
            
            if deps_complete:
                ready.append(task)
        
        return ready
    
    def _execute_task(self, task: Task) -> str:
        """Execute a single task using the appropriate agent"""
        task.status = TaskStatus.RUNNING
        
        agent = self.agents.get(task.agent_type)
        if not agent:
            raise ValueError(f"Unknown agent type: {task.agent_type}")
        
        # Gather context from dependencies
        context = {
            dep: self.results[dep]
            for dep in task.dependencies
        }
        
        return agent.execute(task.description, context)
    
    def _all_complete(self) -> bool:
        return all(
            t.status in [TaskStatus.COMPLETED, TaskStatus.FAILED]
            for t in self.tasks.values()
        )
    
    def _has_failures(self) -> bool:
        return any(t.status == TaskStatus.FAILED for t in self.tasks.values())
    
    def _synthesize(self, goal: str, results: dict) -> str:
        """Combine task results into final output"""
        
        results_text = "\n\n".join([
            f"=== {task_id} ===\n{result}"
            for task_id, result in results.items()
        ])
        
        response = self.client.chat.completions.create(
            model="gpt-4o",
            messages=[{
                "role": "system",
                "content": "Synthesize these task results into a coherent final response."
            }, {
                "role": "user",
                "content": f"Goal: {goal}\n\nTask Results:\n{results_text}"
            }]
        )
        
        return response.choices[0].message.content
 
 
# Define specialized agents
def create_data_analyst():
    client = openai.OpenAI()
    
    def execute(task: str, context: dict) -> str:
        response = client.chat.completions.create(
            model="gpt-4o",
            messages=[{
                "role": "system",
                "content": "You are a data analyst. Analyze data and provide insights."
            }, {
                "role": "user",
                "content": f"Task: {task}\n\nContext: {json.dumps(context)}"
            }]
        )
        return response.choices[0].message.content
    
    return Agent(
        name="data_analyst",
        description="Analyzes data, finds patterns, calculates statistics",
        execute=execute
    )
 
def create_writer():
    client = openai.OpenAI()
    
    def execute(task: str, context: dict) -> str:
        response = client.chat.completions.create(
            model="gpt-4o",
            messages=[{
                "role": "system",
                "content": "You are a professional writer. Create clear, engaging content."
            }, {
                "role": "user",
                "content": f"Task: {task}\n\nContext: {json.dumps(context)}"
            }]
        )
        return response.choices[0].message.content
    
    return Agent(
        name="writer",
        description="Writes reports, summaries, and documentation",
        execute=execute
    )
 
def create_coder():
    from hopx import Sandbox
    client = openai.OpenAI()
    
    def execute(task: str, context: dict) -> str:
        # Generate code
        response = client.chat.completions.create(
            model="gpt-4o",
            messages=[{
                "role": "system",
                "content": "Write Python code to accomplish the task. Output only code."
            }, {
                "role": "user",
                "content": f"Task: {task}\n\nContext: {json.dumps(context)}"
            }]
        )
        
        code = response.choices[0].message.content
        
        # Execute in sandbox
        sandbox = Sandbox.create(template="code-interpreter")
        try:
            sandbox.files.write("/app/task.py", code)
            result = sandbox.commands.run("python /app/task.py")
            return result.stdout if result.exit_code == 0 else f"Error: {result.stderr}"
        finally:
            sandbox.kill()
    
    return Agent(
        name="coder",
        description="Writes and executes Python code for data processing and analysis",
        execute=execute
    )
 
 
# Usage
orchestrator = Orchestrator({
    "data_analyst": create_data_analyst(),
    "writer": create_writer(),
    "coder": create_coder()
})
 
result = orchestrator.run(
    "Analyze our Q4 sales data, identify the top 3 trends, "
    "create visualizations, and write an executive summary."
)
 
print(result)
 

Orchestration Patterns

Pattern 1: Sequential Pipeline

Tasks flow in a fixed order:

text

1	Input → Agent A → Agent B → Agent C → Output
2

python

class PipelineOrchestrator:
    def __init__(self, stages: list[Agent]):
        self.stages = stages
    
    def run(self, input_data: str) -> str:
        current = input_data
        
        for stage in self.stages:
            print(f"Running stage: {stage.name}")
            current = stage.execute(current, {})
        
        return current
 
 
# Usage
pipeline = PipelineOrchestrator([
    extract_agent,    # Extract key information
    transform_agent,  # Transform data
    analyze_agent,    # Analyze patterns
    report_agent      # Generate report
])
 
result = pipeline.run(raw_document)
 

Pattern 2: Router/Dispatcher

Route tasks to specialized agents based on content:

python

class RouterOrchestrator:
    def __init__(self, agents: dict[str, Agent]):
        self.client = openai.OpenAI()
        self.agents = agents
    
    def run(self, task: str) -> str:
        # Classify the task
        agent_name = self._route(task)
        
        # Execute with selected agent
        agent = self.agents[agent_name]
        return agent.execute(task, {})
    
    def _route(self, task: str) -> str:
        agent_options = "\n".join([
            f"- {name}: {agent.description}"
            for name, agent in self.agents.items()
        ])
        
        response = self.client.chat.completions.create(
            model="gpt-4o",
            messages=[{
                "role": "user",
                "content": f"""Which agent should handle this task?
 
Task: {task}
 
Agents:
{agent_options}
 
Reply with just the agent name."""
            }]
        )
        
        return response.choices[0].message.content.strip()
 
 
# Usage
router = RouterOrchestrator({
    "code": code_agent,
    "writing": writing_agent,
    "research": research_agent,
    "math": math_agent
})
 
# Automatically routes to appropriate agent
result = router.run("Write a function to calculate compound interest")
 

Pattern 3: Hierarchical Orchestration

Orchestrators managing other orchestrators:

text

1	┌─────────────────┐
2	│ Master │
3	│ Orchestrator │
4	└────────┬────────┘
5	│
6	┌─────────────────┼─────────────────┐
7	▼ ▼ ▼
8	┌─────────────┐ ┌─────────────┐ ┌─────────────┐
9	│ Research │ │ Development │ │ QA │
10	│ Orchestrator│ │ Orchestrator│ │ Orchestrator│
11	└──────┬──────┘ └──────┬──────┘ └──────┬──────┘
12	│ │ │
13	┌───┼───┐ ┌───┼───┐ ┌───┼───┐
14	▼ ▼ ▼ ▼ ▼ ▼ ▼ ▼ ▼
15	A1 A2 A3 A4 A5 A6 A7 A8 A9
16

python

class HierarchicalOrchestrator:
    def __init__(self, sub_orchestrators: dict[str, Orchestrator]):
        self.client = openai.OpenAI()
        self.sub_orchestrators = sub_orchestrators
    
    def run(self, goal: str) -> dict:
        # Decompose into high-level phases
        phases = self._plan_phases(goal)
        
        results = {}
        for phase in phases:
            sub_orch = self.sub_orchestrators[phase["orchestrator"]]
            result = sub_orch.run(phase["goal"])
            results[phase["name"]] = result
        
        return self._synthesize(goal, results)
    
    def _plan_phases(self, goal: str) -> list[dict]:
        orchestrator_list = ", ".join(self.sub_orchestrators.keys())
        
        response = self.client.chat.completions.create(
            model="gpt-4o",
            messages=[{
                "role": "user",
                "content": f"""Break this goal into phases.
 
Goal: {goal}
 
Available orchestrators: {orchestrator_list}
 
Return JSON:
{{"phases": [{{"name": "phase_1", "orchestrator": "name", "goal": "sub-goal"}}]}}"""
            }],
            response_format={"type": "json_object"}
        )
        
        return json.loads(response.choices[0].message.content)["phases"]
 

Pattern 4: Dynamic Agent Creation

Create agents on-the-fly based on task requirements:

python

class DynamicOrchestrator:
    def __init__(self):
        self.client = openai.OpenAI()
        self.agent_cache = {}
    
    def run(self, goal: str) -> str:
        # Determine what agents we need
        agent_specs = self._design_agents(goal)
        
        # Create or retrieve agents
        agents = {}
        for spec in agent_specs:
            agent = self._get_or_create_agent(spec)
            agents[spec["name"]] = agent
        
        # Create orchestrator with these agents
        orchestrator = Orchestrator(agents)
        return orchestrator.run(goal)
    
    def _design_agents(self, goal: str) -> list[dict]:
        response = self.client.chat.completions.create(
            model="gpt-4o",
            messages=[{
                "role": "user",
                "content": f"""Design specialized agents for this goal.
 
Goal: {goal}
 
Return JSON:
{{
    "agents": [
        {{
            "name": "agent_name",
            "role": "expert role description",
            "capabilities": ["capability1", "capability2"],
            "system_prompt": "You are..."
        }}
    ]
}}"""
            }],
            response_format={"type": "json_object"}
        )
        
        return json.loads(response.choices[0].message.content)["agents"]
    
    def _get_or_create_agent(self, spec: dict) -> Agent:
        cache_key = spec["name"]
        
        if cache_key in self.agent_cache:
            return self.agent_cache[cache_key]
        
        def create_execute(system_prompt):
            def execute(task: str, context: dict) -> str:
                response = self.client.chat.completions.create(
                    model="gpt-4o",
                    messages=[
                        {"role": "system", "content": system_prompt},
                        {"role": "user", "content": f"Task: {task}\nContext: {context}"}
                    ]
                )
                return response.choices[0].message.content
            return execute
        
        agent = Agent(
            name=spec["name"],
            description=spec["role"],
            execute=create_execute(spec["system_prompt"])
        )
        
        self.agent_cache[cache_key] = agent
        return agent
 

Error Handling and Recovery

Retry with Backoff

python

class ResilientOrchestrator(Orchestrator):
    def __init__(self, agents, max_retries=3):
        super().__init__(agents)
        self.max_retries = max_retries
    
    def _execute_task(self, task: Task) -> str:
        last_error = None
        
        for attempt in range(self.max_retries):
            try:
                return super()._execute_task(task)
            except Exception as e:
                last_error = e
                wait_time = 2 ** attempt  # Exponential backoff
                print(f"Task {task.id} failed, retrying in {wait_time}s...")
                time.sleep(wait_time)
        
        raise last_error
 

Fallback Agents

python

class FallbackOrchestrator(Orchestrator):
    def __init__(self, agents, fallback_agents):
        super().__init__(agents)
        self.fallback_agents = fallback_agents
    
    def _execute_task(self, task: Task) -> str:
        try:
            return super()._execute_task(task)
        except Exception as primary_error:
            # Try fallback agent
            fallback = self.fallback_agents.get(task.agent_type)
            if fallback:
                print(f"Primary agent failed, using fallback for {task.id}")
                return fallback.execute(task.description, self._get_context(task))
            raise primary_error
 

Partial Results

python

class GracefulOrchestrator(Orchestrator):
    def run(self, goal: str) -> dict:
        result = super().run(goal)
        
        if not result["success"]:
            # Return what we could complete
            completed_results = {
                t.id: t.result 
                for t in self.tasks.values() 
                if t.status == TaskStatus.COMPLETED
            }
            
            return {
                "success": False,
                "partial_result": self._synthesize_partial(goal, completed_results),
                "completed_tasks": list(completed_results.keys()),
                "failed_tasks": [t.id for t in self.tasks.values() if t.status == TaskStatus.FAILED],
                "note": "Some tasks failed. Partial results provided."
            }
        
        return result
 

Production Orchestrator

A complete production-ready orchestrator with monitoring:

python

from hopx import Sandbox
import openai
import json
from datetime import datetime
from dataclasses import dataclass, field
import asyncio
from typing import Optional
import logging
 
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger("orchestrator")
 
@dataclass
class ExecutionMetrics:
    start_time: datetime
    end_time: Optional[datetime] = None
    tasks_total: int = 0
    tasks_completed: int = 0
    tasks_failed: int = 0
    total_tokens: int = 0
    
    @property
    def duration_seconds(self) -> float:
        if self.end_time:
            return (self.end_time - self.start_time).total_seconds()
        return 0
 
class ProductionOrchestrator:
    def __init__(
        self,
        agents: dict,
        max_parallel: int = 5,
        task_timeout: int = 300,
        enable_monitoring: bool = True
    ):
        self.client = openai.OpenAI()
        self.agents = agents
        self.max_parallel = max_parallel
        self.task_timeout = task_timeout
        self.enable_monitoring = enable_monitoring
        self.metrics = None
    
    async def run(self, goal: str, metadata: dict = None) -> dict:
        """Execute orchestrated workflow"""
        
        self.metrics = ExecutionMetrics(start_time=datetime.now())
        
        logger.info(f"Starting orchestration: {goal[:100]}...")
        
        try:
            # Decompose
            tasks = await self._decompose(goal)
            self.metrics.tasks_total = len(tasks)
            logger.info(f"Decomposed into {len(tasks)} tasks")
            
            # Execute
            results = await self._execute_all(tasks)
            
            # Synthesize
            final = await self._synthesize(goal, results)
            
            self.metrics.end_time = datetime.now()
            
            return {
                "success": True,
                "result": final,
                "metrics": self._get_metrics_dict(),
                "trace": self._get_execution_trace(tasks)
            }
            
        except Exception as e:
            logger.error(f"Orchestration failed: {e}")
            self.metrics.end_time = datetime.now()
            
            return {
                "success": False,
                "error": str(e),
                "metrics": self._get_metrics_dict()
            }
    
    async def _execute_all(self, tasks: list) -> dict:
        """Execute all tasks respecting dependencies"""
        
        task_map = {t["id"]: t for t in tasks}
        results = {}
        completed = set()
        
        while len(completed) < len(tasks):
            # Find ready tasks
            ready = [
                t for t in tasks
                if t["id"] not in completed
                and all(dep in completed for dep in t.get("dependencies", []))
            ]
            
            if not ready:
                pending = [t["id"] for t in tasks if t["id"] not in completed]
                raise RuntimeError(f"Deadlock detected. Pending: {pending}")
            
            # Execute batch in parallel
            batch_results = await asyncio.gather(*[
                self._execute_single(t, results)
                for t in ready[:self.max_parallel]
            ], return_exceptions=True)
            
            # Process results
            for task, result in zip(ready[:self.max_parallel], batch_results):
                if isinstance(result, Exception):
                    self.metrics.tasks_failed += 1
                    logger.error(f"Task {task['id']} failed: {result}")
                    raise result
                
                results[task["id"]] = result
                completed.add(task["id"])
                self.metrics.tasks_completed += 1
                logger.info(f"Completed: {task['id']}")
        
        return results
    
    async def _execute_single(self, task: dict, context: dict) -> str:
        """Execute single task with timeout"""
        
        agent = self.agents.get(task["agent"])
        if not agent:
            raise ValueError(f"Unknown agent: {task['agent']}")
        
        # Build context from dependencies
        dep_context = {
            dep: context[dep]
            for dep in task.get("dependencies", [])
            if dep in context
        }
        
        try:
            result = await asyncio.wait_for(
                asyncio.to_thread(agent.execute, task["description"], dep_context),
                timeout=self.task_timeout
            )
            return result
        except asyncio.TimeoutError:
            raise TimeoutError(f"Task {task['id']} timed out after {self.task_timeout}s")
    
    def _get_metrics_dict(self) -> dict:
        return {
            "duration_seconds": self.metrics.duration_seconds,
            "tasks_total": self.metrics.tasks_total,
            "tasks_completed": self.metrics.tasks_completed,
            "tasks_failed": self.metrics.tasks_failed,
            "success_rate": self.metrics.tasks_completed / max(self.metrics.tasks_total, 1)
        }
    
    def _get_execution_trace(self, tasks: list) -> list:
        return [
            {
                "id": t["id"],
                "agent": t["agent"],
                "description": t["description"][:100],
                "dependencies": t.get("dependencies", [])
            }
            for t in tasks
        ]
 
 
# Usage
async def main():
    orchestrator = ProductionOrchestrator(
        agents={
            "researcher": research_agent,
            "analyst": analyst_agent,
            "writer": writer_agent,
            "coder": coder_agent
        },
        max_parallel=3,
        task_timeout=120
    )
    
    result = await orchestrator.run(
        "Research the latest AI agent frameworks, analyze their features, "
        "create a comparison table, and write a recommendation report."
    )
    
    print(f"Success: {result['success']}")
    print(f"Duration: {result['metrics']['duration_seconds']:.1f}s")
    print(f"Tasks: {result['metrics']['tasks_completed']}/{result['metrics']['tasks_total']}")
    
    if result['success']:
        print(f"\nResult:\n{result['result']}")
 
# asyncio.run(main())
 

Best Practices

1. Keep Orchestrator Logic Simple

python

# ❌ Orchestrator doing too much
class BadOrchestrator:
    def run(self, goal):
        # Orchestrator shouldn't contain domain logic
        if "sales" in goal:
            return self._analyze_sales()
        elif "marketing" in goal:
            return self._analyze_marketing()
 
# ✅ Orchestrator focuses on coordination
class GoodOrchestrator:
    def run(self, goal):
        tasks = self._decompose(goal)  # What to do
        agents = self._select_agents(tasks)  # Who does it
        results = self._execute(tasks, agents)  # Coordination
        return self._synthesize(results)  # Combine results
 

2. Design Clear Agent Interfaces

python

# All agents should follow the same interface
class AgentInterface:
    def execute(self, task: str, context: dict) -> str:
        """
        Args:
            task: What to do
            context: Results from dependency tasks
        
        Returns:
            Result as string (or structured data as JSON string)
        """
        raise NotImplementedError
 

3. Monitor Everything

python

def _execute_task(self, task):
    start = time.time()
    
    try:
        result = self.agents[task.agent].execute(task.description, context)
        
        self.metrics.record({
            "task_id": task.id,
            "agent": task.agent,
            "duration": time.time() - start,
            "success": True,
            "result_size": len(result)
        })
        
        return result
    except Exception as e:
        self.metrics.record({
            "task_id": task.id,
            "agent": task.agent,
            "duration": time.time() - start,
            "success": False,
            "error": str(e)
        })
        raise
 

4. Enable Graceful Degradation

python

def run(self, goal: str) -> dict:
    try:
        return self._full_execution(goal)
    except Exception as e:
        logger.warning(f"Full execution failed: {e}")
        
        # Try simpler approach
        try:
            return self._simplified_execution(goal)
        except:
            # Last resort: single agent
            return self._single_agent_fallback(goal)
 

Conclusion

The Orchestrator Pattern is how you scale AI agents to enterprise complexity:

Task decomposition breaks big problems into manageable pieces
Agent specialization ensures each task is handled by an expert
Parallel execution maximizes throughput
Dependency management ensures correct ordering
Fault tolerance keeps systems running despite failures

Start with a simple pipeline orchestrator. Add routing when you have diverse task types. Move to hierarchical orchestration for truly complex workflows.

The system that orchestrates specialists outperforms the generalist. Every time.

Ready to orchestrate agents with secure code execution? Get started with HopX — sandboxes that give each agent isolated environments.

The Orchestrator Pattern: Coordinating Complex AI Agent Workflows

The Orchestrator Pattern: Coordinating Complex AI Agent Workflows

What Is the Orchestrator Pattern?

Why Orchestration Matters

1. Specialization Beats Generalization

2. Scalability

3. Fault Isolation

Basic Orchestrator Implementation

Orchestration Patterns

Pattern 1: Sequential Pipeline

Pattern 2: Router/Dispatcher

Pattern 3: Hierarchical Orchestration

Pattern 4: Dynamic Agent Creation

Error Handling and Recovery

Retry with Backoff

Fallback Agents

Partial Results

Production Orchestrator

Best Practices

1. Keep Orchestrator Logic Simple

2. Design Clear Agent Interfaces

3. Monitor Everything

4. Enable Graceful Degradation

Conclusion

Further Reading

Related articles

Evaluator-Optimizer Loop: Continuous AI Agent Improvement

Human-in-the-Loop: Balancing AI Autonomy and Human Control

Memory for AI Agents: Short-term, Long-term, and RAG

1	import openai
2	import json
3	from dataclasses import dataclass
4	from enum import Enum
5	from typing import Callable
6	import concurrent.futures
7
8	class TaskStatus(Enum):
9	PENDING = "pending"
10	RUNNING = "running"
11	COMPLETED = "completed"
12	FAILED = "failed"
13
14	@dataclass
15	class Task:
16	id: str
17	description: str
18	agent_type: str
19	dependencies: list[str]
20	status: TaskStatus = TaskStatus.PENDING
21	result: str = None
22	error: str = None
23
24	@dataclass
25	class Agent:
26	name: str
27	description: str
28	execute: Callable[[str, dict], str]
29
30	class Orchestrator:
31	def __init__(self, agents: dict[str, Agent]):
32	self.client = openai.OpenAI()
33	self.agents = agents
34	self.tasks: dict[str, Task] = {}
35	self.results: dict[str, str] = {}
36
37	def run(self, goal: str) -> dict:
38	"""Orchestrate agents to achieve the goal"""
39
40	# Phase 1: Decompose into tasks
41	tasks = self._decompose(goal)
42	self.tasks = {t.id: t for t in tasks}
43
44	print(f"Created {len(tasks)} tasks")
45
46	# Phase 2: Execute tasks respecting dependencies
47	while not self._all_complete():
48	# Find tasks ready to run
49	ready = self._get_ready_tasks()
50
51	if not ready:
52	if self._has_failures():
53	break
54	continue
55
56	# Execute ready tasks in parallel
57	with concurrent.futures.ThreadPoolExecutor(max_workers=5) as executor:
58	futures = {
59	executor.submit(self._execute_task, task): task
60	for task in ready
61	}
62
63	for future in concurrent.futures.as_completed(futures):
64	task = futures[future]
65	try:
66	result = future.result()
67	task.status = TaskStatus.COMPLETED
68	task.result = result
69	self.results[task.id] = result
70	except Exception as e:
71	task.status = TaskStatus.FAILED
72	task.error = str(e)
73
74	# Phase 3: Synthesize results
75	if self._has_failures():
76	return {
77	"success": False,
78	"completed": [t.id for t in self.tasks.values() if t.status == TaskStatus.COMPLETED],
79	"failed": [t.id for t in self.tasks.values() if t.status == TaskStatus.FAILED],
80	"partial_results": self.results
81	}
82
83	final_result = self._synthesize(goal, self.results)
84
85	return {
86	"success": True,
87	"result": final_result,
88	"tasks_completed": len(self.tasks)
89	}
90
91	def _decompose(self, goal: str) -> list[Task]:
92	"""Break goal into tasks with dependencies"""
93
94	agent_descriptions = "\n".join([
95	f"- {name}: {agent.description}"
96	for name, agent in self.agents.items()
97	])
98
99	response = self.client.chat.completions.create(
100	model="gpt-4o",
101	messages=[{
102	"role": "system",
103	"content": f"""Decompose this goal into tasks for available agents.
104
105	Available agents:
106	{agent_descriptions}
107
108	Return JSON:
109	{{
110	"tasks": [
111	{{
112	"id": "task_1",
113	"description": "What to do",
114	"agent_type": "agent_name",
115	"dependencies": []
116	}},
117	{{
118	"id": "task_2",
119	"description": "Next task",
120	"agent_type": "agent_name",
121	"dependencies": ["task_1"]
122	}}
123	]
124	}}
125
126	Rules:
127	- Break into 2-8 tasks
128	- Each task should be focused and achievable
129	- List dependencies (tasks that must complete first)
130	- Assign to the most appropriate agent"""
131	}, {
132	"role": "user",
133	"content": goal
134	}],
135	response_format={"type": "json_object"}
136	)
137
138	data = json.loads(response.choices[0].message.content)
139
140	return [
141	Task(
142	id=t["id"],
143	description=t["description"],
144	agent_type=t["agent_type"],
145	dependencies=t.get("dependencies", [])
146	)
147	for t in data["tasks"]
148	]
149
150	def _get_ready_tasks(self) -> list[Task]:
151	"""Get tasks whose dependencies are all complete"""
152	ready = []
153	for task in self.tasks.values():
154	if task.status != TaskStatus.PENDING:
155	continue
156
157	deps_complete = all(
158	self.tasks[dep].status == TaskStatus.COMPLETED
159	for dep in task.dependencies
160	)
161
162	if deps_complete:
163	ready.append(task)
164
165	return ready
166
167	def _execute_task(self, task: Task) -> str:
168	"""Execute a single task using the appropriate agent"""
169	task.status = TaskStatus.RUNNING
170
171	agent = self.agents.get(task.agent_type)
172	if not agent:
173	raise ValueError(f"Unknown agent type: {task.agent_type}")
174
175	# Gather context from dependencies
176	context = {
177	dep: self.results[dep]
178	for dep in task.dependencies
179	}
180
181	return agent.execute(task.description, context)
182
183	def _all_complete(self) -> bool:
184	return all(
185	t.status in [TaskStatus.COMPLETED, TaskStatus.FAILED]
186	for t in self.tasks.values()
187	)
188
189	def _has_failures(self) -> bool:
190	return any(t.status == TaskStatus.FAILED for t in self.tasks.values())
191
192	def _synthesize(self, goal: str, results: dict) -> str:
193	"""Combine task results into final output"""
194
195	results_text = "\n\n".join([
196	f"=== {task_id} ===\n{result}"
197	for task_id, result in results.items()
198	])
199
200	response = self.client.chat.completions.create(
201	model="gpt-4o",
202	messages=[{
203	"role": "system",
204	"content": "Synthesize these task results into a coherent final response."
205	}, {
206	"role": "user",
207	"content": f"Goal: {goal}\n\nTask Results:\n{results_text}"
208	}]
209	)
210
211	return response.choices[0].message.content
212
213
214	# Define specialized agents
215	def create_data_analyst():
216	client = openai.OpenAI()
217
218	def execute(task: str, context: dict) -> str:
219	response = client.chat.completions.create(
220	model="gpt-4o",
221	messages=[{
222	"role": "system",
223	"content": "You are a data analyst. Analyze data and provide insights."
224	}, {
225	"role": "user",
226	"content": f"Task: {task}\n\nContext: {json.dumps(context)}"
227	}]
228	)
229	return response.choices[0].message.content
230
231	return Agent(
232	name="data_analyst",
233	description="Analyzes data, finds patterns, calculates statistics",
234	execute=execute
235	)
236
237	def create_writer():
238	client = openai.OpenAI()
239
240	def execute(task: str, context: dict) -> str:
241	response = client.chat.completions.create(
242	model="gpt-4o",
243	messages=[{
244	"role": "system",
245	"content": "You are a professional writer. Create clear, engaging content."
246	}, {
247	"role": "user",
248	"content": f"Task: {task}\n\nContext: {json.dumps(context)}"
249	}]
250	)
251	return response.choices[0].message.content
252
253	return Agent(
254	name="writer",
255	description="Writes reports, summaries, and documentation",
256	execute=execute
257	)
258
259	def create_coder():
260	from hopx import Sandbox
261	client = openai.OpenAI()
262
263	def execute(task: str, context: dict) -> str:
264	# Generate code
265	response = client.chat.completions.create(
266	model="gpt-4o",
267	messages=[{
268	"role": "system",
269	"content": "Write Python code to accomplish the task. Output only code."
270	}, {
271	"role": "user",
272	"content": f"Task: {task}\n\nContext: {json.dumps(context)}"
273	}]
274	)
275
276	code = response.choices[0].message.content
277
278	# Execute in sandbox
279	sandbox = Sandbox.create(template="code-interpreter")
280	try:
281	sandbox.files.write("/app/task.py", code)
282	result = sandbox.commands.run("python /app/task.py")
283	return result.stdout if result.exit_code == 0 else f"Error: {result.stderr}"
284	finally:
285	sandbox.kill()
286
287	return Agent(
288	name="coder",
289	description="Writes and executes Python code for data processing and analysis",
290	execute=execute
291	)
292
293
294	# Usage
295	orchestrator = Orchestrator({
296	"data_analyst": create_data_analyst(),
297	"writer": create_writer(),
298	"coder": create_coder()
299	})
300
301	result = orchestrator.run(
302	"Analyze our Q4 sales data, identify the top 3 trends, "
303	"create visualizations, and write an executive summary."
304	)
305
306	print(result)
307

1	class PipelineOrchestrator:
2	def __init__(self, stages: list[Agent]):
3	self.stages = stages
4
5	def run(self, input_data: str) -> str:
6	current = input_data
7
8	for stage in self.stages:
9	print(f"Running stage: {stage.name}")
10	current = stage.execute(current, {})
11
12	return current
13
14
15	# Usage
16	pipeline = PipelineOrchestrator([
17	extract_agent, # Extract key information
18	transform_agent, # Transform data
19	analyze_agent, # Analyze patterns
20	report_agent # Generate report
21	])
22
23	result = pipeline.run(raw_document)
24

1	class RouterOrchestrator:
2	def __init__(self, agents: dict[str, Agent]):
3	self.client = openai.OpenAI()
4	self.agents = agents
5
6	def run(self, task: str) -> str:
7	# Classify the task
8	agent_name = self._route(task)
9
10	# Execute with selected agent
11	agent = self.agents[agent_name]
12	return agent.execute(task, {})
13
14	def _route(self, task: str) -> str:
15	agent_options = "\n".join([
16	f"- {name}: {agent.description}"
17	for name, agent in self.agents.items()
18	])
19
20	response = self.client.chat.completions.create(
21	model="gpt-4o",
22	messages=[{
23	"role": "user",
24	"content": f"""Which agent should handle this task?
25
26	Task: {task}
27
28	Agents:
29	{agent_options}
30
31	Reply with just the agent name."""
32	}]
33	)
34
35	return response.choices[0].message.content.strip()
36
37
38	# Usage
39	router = RouterOrchestrator({
40	"code": code_agent,
41	"writing": writing_agent,
42	"research": research_agent,
43	"math": math_agent
44	})
45
46	# Automatically routes to appropriate agent
47	result = router.run("Write a function to calculate compound interest")
48

1	class HierarchicalOrchestrator:
2	def __init__(self, sub_orchestrators: dict[str, Orchestrator]):
3	self.client = openai.OpenAI()
4	self.sub_orchestrators = sub_orchestrators
5
6	def run(self, goal: str) -> dict:
7	# Decompose into high-level phases
8	phases = self._plan_phases(goal)
9
10	results = {}
11	for phase in phases:
12	sub_orch = self.sub_orchestrators[phase["orchestrator"]]
13	result = sub_orch.run(phase["goal"])
14	results[phase["name"]] = result
15
16	return self._synthesize(goal, results)
17
18	def _plan_phases(self, goal: str) -> list[dict]:
19	orchestrator_list = ", ".join(self.sub_orchestrators.keys())
20
21	response = self.client.chat.completions.create(
22	model="gpt-4o",
23	messages=[{
24	"role": "user",
25	"content": f"""Break this goal into phases.
26
27	Goal: {goal}
28
29	Available orchestrators: {orchestrator_list}
30
31	Return JSON:
32	{{"phases": [{{"name": "phase_1", "orchestrator": "name", "goal": "sub-goal"}}]}}"""
33	}],
34	response_format={"type": "json_object"}
35	)
36
37	return json.loads(response.choices[0].message.content)["phases"]
38

1	class DynamicOrchestrator:
2	def __init__(self):
3	self.client = openai.OpenAI()
4	self.agent_cache = {}
5
6	def run(self, goal: str) -> str:
7	# Determine what agents we need
8	agent_specs = self._design_agents(goal)
9
10	# Create or retrieve agents
11	agents = {}
12	for spec in agent_specs:
13	agent = self._get_or_create_agent(spec)
14	agents[spec["name"]] = agent
15
16	# Create orchestrator with these agents
17	orchestrator = Orchestrator(agents)
18	return orchestrator.run(goal)
19
20	def _design_agents(self, goal: str) -> list[dict]:
21	response = self.client.chat.completions.create(
22	model="gpt-4o",
23	messages=[{
24	"role": "user",
25	"content": f"""Design specialized agents for this goal.
26
27	Goal: {goal}
28
29	Return JSON:
30	{{
31	"agents": [
32	{{
33	"name": "agent_name",
34	"role": "expert role description",
35	"capabilities": ["capability1", "capability2"],
36	"system_prompt": "You are..."
37	}}
38	]
39	}}"""
40	}],
41	response_format={"type": "json_object"}
42	)
43
44	return json.loads(response.choices[0].message.content)["agents"]
45
46	def _get_or_create_agent(self, spec: dict) -> Agent:
47	cache_key = spec["name"]
48
49	if cache_key in self.agent_cache:
50	return self.agent_cache[cache_key]
51
52	def create_execute(system_prompt):
53	def execute(task: str, context: dict) -> str:
54	response = self.client.chat.completions.create(
55	model="gpt-4o",
56	messages=[
57	{"role": "system", "content": system_prompt},
58	{"role": "user", "content": f"Task: {task}\nContext: {context}"}
59	]
60	)
61	return response.choices[0].message.content
62	return execute
63
64	agent = Agent(
65	name=spec["name"],
66	description=spec["role"],
67	execute=create_execute(spec["system_prompt"])
68	)
69
70	self.agent_cache[cache_key] = agent
71	return agent
72

1	class ResilientOrchestrator(Orchestrator):
2	def __init__(self, agents, max_retries=3):
3	super().__init__(agents)
4	self.max_retries = max_retries
5
6	def _execute_task(self, task: Task) -> str:
7	last_error = None
8
9	for attempt in range(self.max_retries):
10	try:
11	return super()._execute_task(task)
12	except Exception as e:
13	last_error = e
14	wait_time = 2 ** attempt # Exponential backoff
15	print(f"Task {task.id} failed, retrying in {wait_time}s...")
16	time.sleep(wait_time)
17
18	raise last_error
19

1	class FallbackOrchestrator(Orchestrator):
2	def __init__(self, agents, fallback_agents):
3	super().__init__(agents)
4	self.fallback_agents = fallback_agents
5
6	def _execute_task(self, task: Task) -> str:
7	try:
8	return super()._execute_task(task)
9	except Exception as primary_error:
10	# Try fallback agent
11	fallback = self.fallback_agents.get(task.agent_type)
12	if fallback:
13	print(f"Primary agent failed, using fallback for {task.id}")
14	return fallback.execute(task.description, self._get_context(task))
15	raise primary_error
16

1	class GracefulOrchestrator(Orchestrator):
2	def run(self, goal: str) -> dict:
3	result = super().run(goal)
4
5	if not result["success"]:
6	# Return what we could complete
7	completed_results = {
8	t.id: t.result
9	for t in self.tasks.values()
10	if t.status == TaskStatus.COMPLETED
11	}
12
13	return {
14	"success": False,
15	"partial_result": self._synthesize_partial(goal, completed_results),
16	"completed_tasks": list(completed_results.keys()),
17	"failed_tasks": [t.id for t in self.tasks.values() if t.status == TaskStatus.FAILED],
18	"note": "Some tasks failed. Partial results provided."
19	}
20
21	return result
22

1	# ❌ Orchestrator doing too much
2	class BadOrchestrator:
3	def run(self, goal):
4	# Orchestrator shouldn't contain domain logic
5	if "sales" in goal:
6	return self._analyze_sales()
7	elif "marketing" in goal:
8	return self._analyze_marketing()
9
10	# ✅ Orchestrator focuses on coordination
11	class GoodOrchestrator:
12	def run(self, goal):
13	tasks = self._decompose(goal) # What to do
14	agents = self._select_agents(tasks) # Who does it
15	results = self._execute(tasks, agents) # Coordination
16	return self._synthesize(results) # Combine results
17

1	# All agents should follow the same interface
2	class AgentInterface:
3	def execute(self, task: str, context: dict) -> str:
4	"""
5	Args:
6	task: What to do
7	context: Results from dependency tasks
8
9	Returns:
10	Result as string (or structured data as JSON string)
11	"""
12	raise NotImplementedError
13

1	def _execute_task(self, task):
2	start = time.time()
3
4	try:
5	result = self.agents[task.agent].execute(task.description, context)
6
7	self.metrics.record({
8	"task_id": task.id,
9	"agent": task.agent,
10	"duration": time.time() - start,
11	"success": True,
12	"result_size": len(result)
13	})
14
15	return result
16	except Exception as e:
17	self.metrics.record({
18	"task_id": task.id,
19	"agent": task.agent,
20	"duration": time.time() - start,
21	"success": False,
22	"error": str(e)
23	})
24	raise
25

1	def run(self, goal: str) -> dict:
2	try:
3	return self._full_execution(goal)
4	except Exception as e:
5	logger.warning(f"Full execution failed: {e}")
6
7	# Try simpler approach
8	try:
9	return self._simplified_execution(goal)
10	except:
11	# Last resort: single agent
12	return self._single_agent_fallback(goal)
13