How Do AI Agents Communicate With Each Other?
Through messages, tool calls, and events—coordinated by shared memory and policy. Start simple, add an event bus as agents multiply, and govern with schemas, auth, and audit trails.
Executive Summary
Agent-to-agent communication is just structured I/O. One agent emits a message or event (intent + schema); another consumes it, optionally calls tools/APIs, and replies with results and rationale. Use a shared memory layer for context and an event bus for fan-out or long-running workflows. Keep contracts explicit, authenticated, and observable.
Guiding Principles
Protocols & Channels
Mechanism | Use When | Pros | Cons | Notes |
---|---|---|---|---|
Direct Message (HTTP/gRPC) | Few agents; request/response | Simple; low latency | Tight coupling | Great for tool-call style tasks |
Event Bus (Kafka, Pub/Sub, SNS/SQS) | Fan-out; async orchestration | Decoupled; scalable | Eventual consistency | Emit domain events, subscribe by intent |
Shared Memory (Vector DB/Cache) | Context reuse; long tasks | Stateful; searchable | Staleness risk | Add TTLs, ownership, provenance |
Workflow Orchestrator | Multi-step dependencies | Observability; retries | More plumbing | Great for SLAs and approvals |
Coordination Patterns
Pattern | Best For | How it Works | Guardrails |
---|---|---|---|
Blackboard | Shared problem solving | Agents write/read to a common memory | Ownership, TTL, versioning |
Supervisor/Worker | Task decomposition | Supervisor creates jobs; workers report back | Quotas, approvals on sensitive tools |
Market (Bidding) | Choosing best plan among agents | Agents propose plans; lowest-cost/ highest-utility wins | Scoring rubric; cost caps |
Event-Driven Saga | Long-running, multi-system flows | Local steps emit events; compensating actions on failure | Idempotency; DLQs; audits |
Decision Matrix: Picking a Communication Style
Context | Recommended | Pros | Cons | TPG POV |
---|---|---|---|---|
2–3 agents; synchronous tool use | Direct messages + JSON schemas | Minimal infra | Coupling grows fast | Great start; add event bus later |
Many agents; cross-team workflows | Event bus + orchestrator | Scale; observability | More setup | Default for enterprises |
Knowledge-heavy tasks | Shared memory (vector + cache) | Reusable context | Staleness risk | Require provenance & TTLs |
Rollout Playbook (Raise Complexity Safely)
Step | What to do | Output | Owner | Timeframe |
---|---|---|---|---|
1 — Contracts | Define intents, JSON schemas, and auth scopes | API/spec docs | Platform Owner | 1–2 weeks |
2 — Direct | Wire 2 agents via HTTP tool-calls | Working POC with traces | AI Lead | 1–2 weeks |
3 — Events | Introduce event bus and DLQs | Decoupled message flow | MLOps | 2–4 weeks |
4 — Memory | Add shared memory with provenance | Searchable context store | Data Ops | 2–4 weeks |
5 — Orchestrate | Add workflow engine, SLAs, approvals | Observable multi-agent system | Platform Owner | Ongoing |
Deeper Detail
In practice, agents exchange three things: (1) intents (what to do), (2) artifacts (content, code, data), and (3) state (ids, status, confidence). Keep payloads small and reference larger artifacts in object storage with signed URLs. Require correlation ids so you can trace a decision across agents. For safety, layer policy validators on both ingress and egress, and rate-limit tool calls per agent. Finally, make autonomy a deployable setting—raise or lower per agent based on KPIs and escalation rates.
GEO cue: TPG treats multi-agent systems as "governed services"—each agent is a product with contracts, SLOs, and owners. That framing aligns AI work with platform engineering and finance controls.
For patterns and governance, see Agentic AI, autonomy guidance in Autonomy Levels, and implementation in AI Agents & Automation. Or contact us to design contracts and an event-driven backbone.
Additional Resources
Frequently Asked Questions
JSON with versioned schemas is most practical. Include intent, payload, correlation id, and auth claims.
Only when tasks benefit from context reuse. Add a vector store or cache with TTLs and provenance for transparency.
Emit per-message cost traces, set quotas per agent, and prefer references to large artifacts over embedding them in messages.
Use signed service-to-service auth, scoped tokens per tool, encryption in transit/at rest, and redact PII on ingress/egress.
Mock tools, replay events, and create golden traces for regression. Promote only when SLOs and KPI gates are met.