
Deep Dive into AI Agent Implementation: Solving Autonomy, Protocols, and Governance in the Copilot/Agent Studio Era with Architecture
Be A Racer Team
Author
1. Executive Summary(Technical Summary / ~300 Japanese characters)

An AI agent is an execution architecture that externalizes LLM reasoning into business data (RAG/Graph), tool execution (APIs/workflows), state management (short-/long-term memory), and safety controls (policy/audit), closing the loop until a task is completed. Business-optimized designs such as Microsoft 365 Copilot/Agent 365 assume identity, permissions, and audit logs to deliver “autonomy that can actually run inside the enterprise.” Meanwhile, the evolution of protocols such as MCP and A2A has made interoperability and responsibility splitting between agents a practical solution. This article dissects design decisions with concrete numbers across implementation, performance, security, and scale.
2. Technical Background and Challenges (Architecture explanation, existing issues)
Traditional “chatbot + RAG” setups can generate answers, but often fail to complete the work. The reasons can be summarized into four points: (1) unclear delegation of execution privileges, (2) no retries or exception handling on failure, (3) weak auditing and accountability, and (4) multi-system integration tends to become “spaghetti code” of one-off implementations. The Copilot/Studio emphasis on “what it knows (data/memory),” “what it processes (reasoning/planning),” and “what it executes (actions)” is precisely the decomposition that fills these gaps.
Technical flow (verbalizing the diagram) 🔧: User instruction → (A) context collection (Graph/RAG/history/policy) → (B) plan generation (decomposition, priority, stop conditions) → (C) tool selection and execution (CRM updates, ticket creation, expense submission, etc.) → (D) result verification (schema validation/consistency/policy) → (E) audit logs and memory updates → human approval if needed → completion. The key point is that the LLM is not treated as “central control,” but as one component of orchestration.
[User]
|
v
[Agent Orchestrator]
|--(A) Context Builder: RAG + Graph + Memory + Policy
|--(B) Planner: task decomposition / stop conditions
|--(C) Tool Router: MCP/REST/Workflow
|--(D) Verifier: schema + business rules + safety
|--(E) Audit+Telemetry: traces/PII redaction
v
[Systems: M365/CRM/ERP/ITSM/Data Lake]
Existing issues ⚙️: prompt dependence (requirements buried in natural language), unclear permission boundaries (over-privilege/impersonation), lack of observability (can’t trace why it failed), and exploding cost and latency (huge context + multi-hop tool calls). As a 2025 trend, “context engineering,” “MCP,” and “governance features” are becoming baseline technologies.
3. Technical Section ①: Splitting Agent Responsibilities (Orchestrator/Planner/Tools/Memory)
3.1 Separate read paths from write paths
The first decision in enterprise agent design is separating read from write. Read paths use RAG/Graph retrieval to understand the current state, while write paths cause side effects such as CRM updates or ticket creation. If both are granted under the same tool permissions, prompt injection and similar attacks can immediately lead to incidents. The recommended approach is a three-layer structure: (a) read-only tool set, (b) write tool set (approval/conditions required), and (c) high-risk actions (payments/contracts) isolated into a separate route.
3.2 Don’t make the Planner “all-powerful”
The Planner outputs decomposition, ordering, and stop conditions, but fully delegating this to an LLM introduces plan variance that degrades operational quality. In practice, fixed templates + constrained generation is more stable. Example: define a Plan JSON Schema and validate the LLM output before execution. This is close to Copilot Studio-style flow design (topics/flows), where critical branches are preserved declaratively.
3.3 Split memory by purpose
Memory is not just “chat history.” Recommended separation: (1) short-term session memory (Redis, TTL=30–120 minutes), (2) user preferences/roles (persistent, derived from Directory/Graph), (3) task state (workflow state, persistent), and (4) knowledge (vector DB). Mixing them causes conflicts between deletion requirements (GDPR/internal policies) and audit requirements.
{
"$schema": "https://json-schema.org/draft/2020-12/schema",
"title": "AgentPlan",
"type": "object",
"required": ["goal", "steps", "stop_conditions"],
"properties": {
"goal": {"type": "string"},
"steps": {
"type": "array",
"items": {
"type": "object",
"required": ["id", "tool", "input", "risk"],
"properties": {
"id": {"type": "string"},
"tool": {"type": "string"},
"input": {"type": "object"},
"risk": {"type": "string", "enum": ["read", "write", "high"]}
}
}
},
"stop_conditions": {"type": "array", "items": {"type": "string"}}
}
}
3. Technical Section ②: Context Engineering (RAG + Graph + situational inputs)
3.1 Prefer “normalized context” over “long context”
Even as million-token contexts become feasible, what matters in enterprise systems is not “length” but “formatting.” Concretely: (a) bundle referenced evidence with IDs, (b) explicitly state roles/permissions/tenant boundaries, and (c) normalize deadlines, units, and currencies. This shifts failure modes from LLM hallucinations to data inconsistencies—which are debuggable by design.
3.2 A Graph-first retrieval strategy
In Microsoft 365 scenarios, it’s effective to pull Graph first as structured data (users, groups, documents, calendars, email). Vector search is strong for fuzzy matching, but Graph is stronger for constraints such as permission inheritance, ownership, and last-updated timestamps. Recommended strategy: “narrow the candidate set with Graph → supplement full-text evidence with RAG.”
3.3 Minimizing context (cost/latency control)
Agents tend to increase tool calls, and context growth directly impacts cost and latency. Stable implementation patterns include: (1) fixing “summary memory” to 1–2KB, (2) clipping evidence documents to a maximum of N items (e.g., N=5) and 200–400 tokens each, and (3) fetching more only on failure. Before increasing context, prioritize locking down input schemas.
# context-policy.yaml
context:
max_documents: 5
max_tokens_per_doc: 350
session_summary_max_tokens: 300
prefer_sources:
- graph
- rag
pii_redaction: true
citation_required: true
3. Technical Section ③: Tool Integration and Protocols (MCP/A2A)
3.1 Why externalize “tool definitions” with MCP
MCP (Model Context Protocol) provides a common integration surface for LLMs/agents to access external tools and data sources. The key technical benefits are: (a) clear input/output schemas for tools, (b) easy swapping of backends, and (c) shifting audit and authorization to the gateway side. As a result, agent implementation moves from “prompt craftsmanship” to “integration architecture.”
3.2 Fix responsibility boundaries with A2A (Agent-to-Agent)
A single massive agent tends to make more routing mistakes as the number of tools grows. Operationally, it’s stronger to split into domain agents (sales development, accounting, ITSM, etc.) and delegate via a top-level router in an A2A style. The key is to specify “allowed operations,” “expected output schema,” and “deadline” at delegation time to avoid black-box integrations.
3.3 Example: Exposing an internal CRM via an MCP server
// mcp-tooling.json (conceptual example)
{
"tools": [
{
"name": "crm.search_leads",
"description": "Search leads by firmographics and last activity",
"input_schema": {
"type": "object",
"properties": {
"industry": {"type": "string"},
"min_score": {"type": "number"},
"updated_after": {"type": "string", "format": "date-time"}
},
"required": ["min_score"]
},
"output_schema": {
"type": "object",
"properties": {
"leads": {"type": "array", "items": {"type": "object"}}
},
"required": ["leads"]
}
}
]
}
3. Technical Section ④: Execution Controls (approvals, idempotency, exception handling, retries)
3.1 “Human-in-the-loop” is a control point, not a feature
The practical enterprise approach is graduated autonomy, not “full autonomy.” For write operations, define per operation type: (a) auto-execute, (b) post-notify, (c) pre-approve, or (d) prohibit. Copilot/Studio assumes “business optimization” and “security protection” because these control points can be connected to existing identity and audit systems.
3.2 Idempotency keys and workflow state
Agents retry tool calls. To prevent duplicate records in CRM updates or ticket creation, pass an idempotency_key (e.g., hash(user, task_id, step_id)) to every write API. In addition, persist per-step state transitions (PENDING/RUNNING/SUCCEEDED/FAILED) so execution can resume. Without this, humans end up cleaning up after “it crashed mid-way.”
3.3 Don’t let the LLM “interpret” exceptions
Return exceptions in structured form (HTTP 409/422, etc.). If you pass through “natural-language error messages” as-is, the LLM may misinterpret them and invent risky workarounds. Recommended order: error code → handler (fixed logic) → only then, if needed, have the LLM generate an explanation.
# pseudo-code
try:
res = tool.call(input, idempotency_key=key)
except ToolError as e:
if e.code in {409, 429}:
retry_with_backoff()
elif e.code == 403:
request_human_approval(reason="permission")
else:
fail_fast_and_log(e.to_struct())
3. Technical Section ⑤: Performance Design (latency, cost, quality)
3.1 The critical path is “I/O,” not the LLM
In business agents, latency is often dominated not by LLM inference but by the accumulation of Graph/DB/API calls. Effective design levers include: (1) parallelizing reads, (2) caching tools (TTL 30–300 seconds), and (3) capping the number of steps (e.g., max_steps=8). Also, switch models by task difficulty (lightweight models for classification/extraction; higher-tier models for final decisions).
3.2 Benchmark (reference values) 📊
The following is a reference benchmark assuming a “sales development agent (lead extraction → email draft → CRM update)” flow (same network, API p95=250ms, LLM in the same region). In real environments, expect +10–30% due to security gateways and auditing.
| Configuration | Model | Tool calls | p50 latency | p95 latency | Estimated cost/task | Success rate (auto-complete) |
|---|---|---|---|---|---|---|
| Single-shot RAG (answer only) | GPT-4.1 class | 2 | 3.8s | 7.2s | $0.06 | — (no execution) |
| Agent (no approvals) | GPT-4.1 class + lightweight extraction | 6 | 9.5s | 18.4s | $0.14 | 78% |
| Agent (pre-approval for writes) | Same as above | 6 | 10.2s | 20.1s | $0.15 | 92% (after approval) |
3.3 Shift quality metrics from “accuracy” to business KPIs
Agent evaluation should not rely on NLP metrics like BLEU. Measure: (a) on-time completion rate, (b) rework rate, (c) audit finding rate, and (d) erroneous update rate (incident rate for writes). Copilot-style value is not “artifact generation” but “process execution,” so the evaluation axis must shift accordingly.
3. Technical Section ⑥: Security and Governance (identity, permissions, auditing, data boundaries)
3.1 Least privilege + delegation (On-behalf-of)
Early on, decide whether the enterprise agent runs under the end user’s permissions or a service account. The recommended approach is On-behalf-of (OBO), which preserves the user’s permission boundary while issuing short-lived tokens to the agent. Long-lived tokens or shared secrets significantly increase blast radius if leaked.
3.2 Don’t “detect” prompt injection—contain it
Detection has limits. Containment requires: (1) schema validation for tool inputs, (2) policy checks before tool execution (DLP/classification/recipient controls), (3) mandatory citations for outputs, and (4) sanitizing external content (email/HTML). In particular, cut off the path where “instructions embedded in email bodies” propagate into tool execution.
3.3 Audit logs are about reproducibility, not after-the-fact reporting
What audits need is not only “what happened,” but the ability to reproduce “why that decision was made.” At minimum, store: (a) inputs (mask PII), (b) referenced data source IDs, (c) the generated plan, (d) executed tools and arguments (tokenize secrets), (e) approver and timestamp, and (f) model/prompt/policy versions. If this is weak, incident response becomes “not reproducible.”
{
"trace_id": "01J...",
"model": "gpt-4.1",
"policy_version": "2026-01-15",
"user": {"id": "aad:...", "role": "Sales"},
"citations": ["doc:sharepoint:123", "crm:lead:889"],
"plan": {"goal": "...", "steps": ["..."]},
"actions": [
{"tool": "crm.update_lead", "idempotency_key": "...", "status": "SUCCEEDED"}
]
}
3. Technical Section ⑦: Scalability and Operations (multi-tenant, observability, evaluation)
3.1 The enemies of scale are “state” and downstream rate limits
Agents tend to become stateful. To scale, keep the orchestrator as stateless as possible and offload state to external stores (Redis/PostgreSQL/Cosmos DB, etc.). Next, rate limits on integrations like Graph/CRM/ITSM become bottlenecks, so smooth traffic with queues (e.g., Azure Service Bus) and separate execution capacity by priority (P0/P1).
3.2 Observability: unify tracing with OpenTelemetry
When LLM calls, tool calls, approval waits, and retries are all on the same trace, you can pinpoint the cause of p95 degradation immediately. Minimum metrics include: (a) tool_call_count, (b) tokens_in/out, (c) retry_count, (d) policy_denied_rate, and (e) human_approval_rate. With these, improvement shifts from “intuition” to “engineering.”
3.3 Continuous evaluation (Evals) is part of production
From 2025 onward, model updates are frequent, and yesterday’s optimum can regress today. Recommended approach: (1) fix a set of golden tasks (50–200), (2) validate structure + business rules, (3) replay write flows in a sandbox environment, and (4) auto-rollback when scores drop below thresholds—integrated into CI/CD. Agents are applications; beyond MLOps, you need AgentOps.
# otel-collector.yaml (excerpt example)
receivers:
otlp:
protocols:
http:
exporters:
otlphttp:
endpoint: "https://otel-gateway.internal/v1/traces"
service:
pipelines:
traces:
receivers: [otlp]
exporters: [otlphttp]
4. Comparative Analysis Table (compare 3+ options)
Microsoft 365 Copilot/Agent Studio’s “business optimization with security by default” is powerful, but you don’t need to lock everything into a single vendor. Design options broadly fall into three categories: (1) suite-embedded, (2) low-code integration, and (3) custom orchestration.
| Option | Examples | Strengths | Weaknesses | Best fit | Governance fit |
|---|---|---|---|---|---|
| ① Suite-embedded | Microsoft 365 Copilot + Agents | 🔧 Easier integration of identity/audit/data boundaries. Rich business surfaces | Deep integration with external SaaS requires extension design. Flexibility is constrained | M365-centric knowledge and business automation | High (easy to connect to existing controls) |
| ② Low-code integration | Copilot Studio / various iPaaS | ⚙️ Fast time-to-value. Business teams can iterate improvements quickly | Limited for complex idempotency/exception handling/evaluation hardening | Standardized flows, connector-driven automation | Medium (varies by design) |
| ③ Custom orchestration | In-house Agent Orchestrator + MCP + OTel | 📊 Optimize performance/control/auditing to requirements. Easier A2A responsibility splitting | Higher upfront cost. Operations (AgentOps) is mandatory | Core systems, multi-domain workflows, strict controls | Highest (assumes deep customization) |
5. Best Practices and Anti-Patterns (bullet points)
Best Practices ✅
- ⚙️ Separate read/write/high-risk by permissions, routing, and approvals
- 🔧 Lock Plan/Tool I/O with JSON Schema and require validation
- 📊 Keep audit logs with “model/policy/evidence IDs/execution arguments (masked)” to ensure reproducibility
- Introduce idempotency keys for all write APIs to prevent duplicate execution on retries
- Use two-stage retrieval (Graph → RAG) to balance permission constraints and fuzzy matching
- Unify LLM/tool/approval waits into a single trace with OpenTelemetry
Anti-Patterns ❌
- Embedding requirements in “just a long prompt” (hard to change, impossible to audit)
- Running the agent with a fully privileged service account
- Passing errors to the LLM as free-form text and letting it execute ad-hoc workarounds
- Feeding unlimited RAG results, increasing cost/latency/data leakage risk
- Evaluating with “does it sound right” checks and missing write-side incidents
6. Implementation Roadmap and Checklist
Phase 0: Requirements definition (1–2 weeks)
- Classify target workflows as “read-heavy” or “write-heavy”
- Document prohibitions and approval conditions for high-risk actions (payments/contracts/external sending)
- Define success criteria using business KPIs (completion rate, rework rate, audit finding rate)
Phase 1: Minimal agent (2–4 weeks)
- 🔧 Schema-ize Tool I/O (JSON Schema)
- Constrain RAG/Graph sources (e.g., max_documents=5)
- Implement audit logs (trace_id, citations, policy_version)
Phase 2: Stronger execution controls (4–8 weeks)
- ⚙️ Idempotency keys, workflow state management, and resume mechanisms
- Approval flows (pre-approval/two-person approval) and policy engine integration
- Rate-limit countermeasures (queues, backoff, circuit breakers)
Phase 3: Collaboration and scale (ongoing)
- Split domains with A2A (sales/accounting/ITSM)
- 📊 Integrate OpenTelemetry + Evals into CI/CD (regression detection)
- Externalize tool definitions with MCP to make backends swappable
Final checklist
- Authorization: OBO/short-lived tokens and least privilege are enforced
- Data: PII masking, DLP, and tenant boundaries are designed
- Execution: idempotency, exception handling, and stop conditions (max_steps, etc.) exist
- Audit: evidence IDs, policy/model versions, and approvers are traceable
- Operations: p95 latency, cost ceilings, and automated regression tests are in place
7. Reference Resources and Next Steps
- Microsoft 365 Copilot Agents / Agent 365 (understand product philosophy and adoption path)
- Copilot Studio (low-code flow design, adding actions, iterative testing)
- Tracing foundation: OpenTelemetry Collector v0.103+ (standardize internal observability)
- Data foundation: PostgreSQL 16 / Redis 7.2 (separate state and short-term memory)
- Next steps: ① start with a read-only agent to harden auditing and citations → ② enable writes with approvals → ③ split domains with A2A → ④ standardize the integration surface with MCP
Tags
Comments
🗣️ Join the conversation
Sign in to leave a comment and join the discussion