Deep Dive into AI Agent Implementation: Solving Autonomy, Protocols, and Governance in the Copilot/Agent Studio Era with Architecture

1. Executive Summary（Technical Summary / ~300 Japanese characters）

An AI agent is an execution architecture that externalizes LLM reasoning into business data (RAG/Graph), tool execution (APIs/workflows), state management (short-/long-term memory), and safety controls (policy/audit), closing the loop until a task is completed. Business-optimized designs such as Microsoft 365 Copilot/Agent 365 assume identity, permissions, and audit logs to deliver “autonomy that can actually run inside the enterprise.” Meanwhile, the evolution of protocols such as MCP and A2A has made interoperability and responsibility splitting between agents a practical solution. This article dissects design decisions with concrete numbers across implementation, performance, security, and scale.

2. Technical Background and Challenges (Architecture explanation, existing issues)

Traditional “chatbot + RAG” setups can generate answers, but often fail to complete the work. The reasons can be summarized into four points: (1) unclear delegation of execution privileges, (2) no retries or exception handling on failure, (3) weak auditing and accountability, and (4) multi-system integration tends to become “spaghetti code” of one-off implementations. The Copilot/Studio emphasis on “what it knows (data/memory),” “what it processes (reasoning/planning),” and “what it executes (actions)” is precisely the decomposition that fills these gaps.

Technical flow (verbalizing the diagram) 🔧: User instruction → (A) context collection (Graph/RAG/history/policy) → (B) plan generation (decomposition, priority, stop conditions) → (C) tool selection and execution (CRM updates, ticket creation, expense submission, etc.) → (D) result verification (schema validation/consistency/policy) → (E) audit logs and memory updates → human approval if needed → completion. The key point is that the LLM is not treated as “central control,” but as one component of orchestration.

[User]
  |
  v
[Agent Orchestrator]
  |--(A) Context Builder: RAG + Graph + Memory + Policy
  |--(B) Planner: task decomposition / stop conditions
  |--(C) Tool Router: MCP/REST/Workflow
  |--(D) Verifier: schema + business rules + safety
  |--(E) Audit+Telemetry: traces/PII redaction
  v
[Systems: M365/CRM/ERP/ITSM/Data Lake]

Existing issues ⚙️: prompt dependence (requirements buried in natural language), unclear permission boundaries (over-privilege/impersonation), lack of observability (can’t trace why it failed), and exploding cost and latency (huge context + multi-hop tool calls). As a 2025 trend, “context engineering,” “MCP,” and “governance features” are becoming baseline technologies.

3. Technical Section ①: Splitting Agent Responsibilities (Orchestrator/Planner/Tools/Memory)

3.1 Separate read paths from write paths

The first decision in enterprise agent design is separating read from write. Read paths use RAG/Graph retrieval to understand the current state, while write paths cause side effects such as CRM updates or ticket creation. If both are granted under the same tool permissions, prompt injection and similar attacks can immediately lead to incidents. The recommended approach is a three-layer structure: (a) read-only tool set, (b) write tool set (approval/conditions required), and (c) high-risk actions (payments/contracts) isolated into a separate route.

3.2 Don’t make the Planner “all-powerful”

The Planner outputs decomposition, ordering, and stop conditions, but fully delegating this to an LLM introduces plan variance that degrades operational quality. In practice, fixed templates + constrained generation is more stable. Example: define a Plan JSON Schema and validate the LLM output before execution. This is close to Copilot Studio-style flow design (topics/flows), where critical branches are preserved declaratively.

3.3 Split memory by purpose

Memory is not just “chat history.” Recommended separation: (1) short-term session memory (Redis, TTL=30–120 minutes), (2) user preferences/roles (persistent, derived from Directory/Graph), (3) task state (workflow state, persistent), and (4) knowledge (vector DB). Mixing them causes conflicts between deletion requirements (GDPR/internal policies) and audit requirements.

{
  "$schema": "https://json-schema.org/draft/2020-12/schema",
  "title": "AgentPlan",
  "type": "object",
  "required": ["goal", "steps", "stop_conditions"],
  "properties": {
    "goal": {"type": "string"},
    "steps": {
      "type": "array",
      "items": {
        "type": "object",
        "required": ["id", "tool", "input", "risk"],
        "properties": {
          "id": {"type": "string"},
          "tool": {"type": "string"},
          "input": {"type": "object"},
          "risk": {"type": "string", "enum": ["read", "write", "high"]}
        }
      }
    },
    "stop_conditions": {"type": "array", "items": {"type": "string"}}
  }
}

3. Technical Section ②: Context Engineering (RAG + Graph + situational inputs)

3.1 Prefer “normalized context” over “long context”

Even as million-token contexts become feasible, what matters in enterprise systems is not “length” but “formatting.” Concretely: (a) bundle referenced evidence with IDs, (b) explicitly state roles/permissions/tenant boundaries, and (c) normalize deadlines, units, and currencies. This shifts failure modes from LLM hallucinations to data inconsistencies—which are debuggable by design.

3.2 A Graph-first retrieval strategy

In Microsoft 365 scenarios, it’s effective to pull Graph first as structured data (users, groups, documents, calendars, email). Vector search is strong for fuzzy matching, but Graph is stronger for constraints such as permission inheritance, ownership, and last-updated timestamps. Recommended strategy: “narrow the candidate set with Graph → supplement full-text evidence with RAG.”

3.3 Minimizing context (cost/latency control)

Agents tend to increase tool calls, and context growth directly impacts cost and latency. Stable implementation patterns include: (1) fixing “summary memory” to 1–2KB, (2) clipping evidence documents to a maximum of N items (e.g., N=5) and 200–400 tokens each, and (3) fetching more only on failure. Before increasing context, prioritize locking down input schemas.

# context-policy.yaml
context:
  max_documents: 5
  max_tokens_per_doc: 350
  session_summary_max_tokens: 300
  prefer_sources:
    - graph
    - rag
  pii_redaction: true
  citation_required: true

3. Technical Section ③: Tool Integration and Protocols (MCP/A2A)

3.1 Why externalize “tool definitions” with MCP

MCP (Model Context Protocol) provides a common integration surface for LLMs/agents to access external tools and data sources. The key technical benefits are: (a) clear input/output schemas for tools, (b) easy swapping of backends, and (c) shifting audit and authorization to the gateway side. As a result, agent implementation moves from “prompt craftsmanship” to “integration architecture.”

3.2 Fix responsibility boundaries with A2A (Agent-to-Agent)

A single massive agent tends to make more routing mistakes as the number of tools grows. Operationally, it’s stronger to split into domain agents (sales development, accounting, ITSM, etc.) and delegate via a top-level router in an A2A style. The key is to specify “allowed operations,” “expected output schema,” and “deadline” at delegation time to avoid black-box integrations.

3.3 Example: Exposing an internal CRM via an MCP server

// mcp-tooling.json (conceptual example)
{
  "tools": [
    {
      "name": "crm.search_leads",
      "description": "Search leads by firmographics and last activity",
      "input_schema": {
        "type": "object",
        "properties": {
          "industry": {"type": "string"},
          "min_score": {"type": "number"},
          "updated_after": {"type": "string", "format": "date-time"}
        },
        "required": ["min_score"]
      },
      "output_schema": {
        "type": "object",
        "properties": {
          "leads": {"type": "array", "items": {"type": "object"}}
        },
        "required": ["leads"]
      }
    }
  ]
}

3. Technical Section ④: Execution Controls (approvals, idempotency, exception handling, retries)

3.1 “Human-in-the-loop” is a control point, not a feature

The practical enterprise approach is graduated autonomy, not “full autonomy.” For write operations, define per operation type: (a) auto-execute, (b) post-notify, (c) pre-approve, or (d) prohibit. Copilot/Studio assumes “business optimization” and “security protection” because these control points can be connected to existing identity and audit systems.

3.2 Idempotency keys and workflow state

Agents retry tool calls. To prevent duplicate records in CRM updates or ticket creation, pass an idempotency_key (e.g., hash(user, task_id, step_id)) to every write API. In addition, persist per-step state transitions (PENDING/RUNNING/SUCCEEDED/FAILED) so execution can resume. Without this, humans end up cleaning up after “it crashed mid-way.”

3.3 Don’t let the LLM “interpret” exceptions

Return exceptions in structured form (HTTP 409/422, etc.). If you pass through “natural-language error messages” as-is, the LLM may misinterpret them and invent risky workarounds. Recommended order: error code → handler (fixed logic) → only then, if needed, have the LLM generate an explanation.

# pseudo-code
try:
    res = tool.call(input, idempotency_key=key)
except ToolError as e:
    if e.code in {409, 429}:
        retry_with_backoff()
    elif e.code == 403:
        request_human_approval(reason="permission")
    else:
        fail_fast_and_log(e.to_struct())

3. Technical Section ⑤: Performance Design (latency, cost, quality)

3.1 The critical path is “I/O,” not the LLM

In business agents, latency is often dominated not by LLM inference but by the accumulation of Graph/DB/API calls. Effective design levers include: (1) parallelizing reads, (2) caching tools (TTL 30–300 seconds), and (3) capping the number of steps (e.g., max_steps=8). Also, switch models by task difficulty (lightweight models for classification/extraction; higher-tier models for final decisions).

3.2 Benchmark (reference values) 📊

The following is a reference benchmark assuming a “sales development agent (lead extraction → email draft → CRM update)” flow (same network, API p95=250ms, LLM in the same region). In real environments, expect +10–30% due to security gateways and auditing.

Configuration	Model	Tool calls	p50 latency	p95 latency	Estimated cost/task	Success rate (auto-complete)
Single-shot RAG (answer only)	GPT-4.1 class	2	3.8s	7.2s	$0.06	— (no execution)
Agent (no approvals)	GPT-4.1 class + lightweight extraction	6	9.5s	18.4s	$0.14	78%
Agent (pre-approval for writes)	Same as above	6	10.2s	20.1s	$0.15	92% (after approval)

3.3 Shift quality metrics from “accuracy” to business KPIs

Agent evaluation should not rely on NLP metrics like BLEU. Measure: (a) on-time completion rate, (b) rework rate, (c) audit finding rate, and (d) erroneous update rate (incident rate for writes). Copilot-style value is not “artifact generation” but “process execution,” so the evaluation axis must shift accordingly.

3. Technical Section ⑥: Security and Governance (identity, permissions, auditing, data boundaries)

3.1 Least privilege + delegation (On-behalf-of)

Early on, decide whether the enterprise agent runs under the end user’s permissions or a service account. The recommended approach is On-behalf-of (OBO), which preserves the user’s permission boundary while issuing short-lived tokens to the agent. Long-lived tokens or shared secrets significantly increase blast radius if leaked.

3.2 Don’t “detect” prompt injection—contain it

Detection has limits. Containment requires: (1) schema validation for tool inputs, (2) policy checks before tool execution (DLP/classification/recipient controls), (3) mandatory citations for outputs, and (4) sanitizing external content (email/HTML). In particular, cut off the path where “instructions embedded in email bodies” propagate into tool execution.

3.3 Audit logs are about reproducibility, not after-the-fact reporting

What audits need is not only “what happened,” but the ability to reproduce “why that decision was made.” At minimum, store: (a) inputs (mask PII), (b) referenced data source IDs, (c) the generated plan, (d) executed tools and arguments (tokenize secrets), (e) approver and timestamp, and (f) model/prompt/policy versions. If this is weak, incident response becomes “not reproducible.”

{
  "trace_id": "01J...",
  "model": "gpt-4.1",
  "policy_version": "2026-01-15",
  "user": {"id": "aad:...", "role": "Sales"},
  "citations": ["doc:sharepoint:123", "crm:lead:889"],
  "plan": {"goal": "...", "steps": ["..."]},
  "actions": [
    {"tool": "crm.update_lead", "idempotency_key": "...", "status": "SUCCEEDED"}
  ]
}

3. Technical Section ⑦: Scalability and Operations (multi-tenant, observability, evaluation)

3.1 The enemies of scale are “state” and downstream rate limits

Agents tend to become stateful. To scale, keep the orchestrator as stateless as possible and offload state to external stores (Redis/PostgreSQL/Cosmos DB, etc.). Next, rate limits on integrations like Graph/CRM/ITSM become bottlenecks, so smooth traffic with queues (e.g., Azure Service Bus) and separate execution capacity by priority (P0/P1).

3.2 Observability: unify tracing with OpenTelemetry

When LLM calls, tool calls, approval waits, and retries are all on the same trace, you can pinpoint the cause of p95 degradation immediately. Minimum metrics include: (a) tool_call_count, (b) tokens_in/out, (c) retry_count, (d) policy_denied_rate, and (e) human_approval_rate. With these, improvement shifts from “intuition” to “engineering.”

3.3 Continuous evaluation (Evals) is part of production

From 2025 onward, model updates are frequent, and yesterday’s optimum can regress today. Recommended approach: (1) fix a set of golden tasks (50–200), (2) validate structure + business rules, (3) replay write flows in a sandbox environment, and (4) auto-rollback when scores drop below thresholds—integrated into CI/CD. Agents are applications; beyond MLOps, you need AgentOps.

# otel-collector.yaml (excerpt example)
receivers:
  otlp:
    protocols:
      http:
exporters:
  otlphttp:
    endpoint: "https://otel-gateway.internal/v1/traces"
service:
  pipelines:
    traces:
      receivers: [otlp]
      exporters: [otlphttp]

4. Comparative Analysis Table (compare 3+ options)

Microsoft 365 Copilot/Agent Studio’s “business optimization with security by default” is powerful, but you don’t need to lock everything into a single vendor. Design options broadly fall into three categories: (1) suite-embedded, (2) low-code integration, and (3) custom orchestration.

Option	Examples	Strengths	Weaknesses	Best fit	Governance fit
① Suite-embedded	Microsoft 365 Copilot + Agents	🔧 Easier integration of identity/audit/data boundaries. Rich business surfaces	Deep integration with external SaaS requires extension design. Flexibility is constrained	M365-centric knowledge and business automation	High (easy to connect to existing controls)
② Low-code integration	Copilot Studio / various iPaaS	⚙️ Fast time-to-value. Business teams can iterate improvements quickly	Limited for complex idempotency/exception handling/evaluation hardening	Standardized flows, connector-driven automation	Medium (varies by design)
③ Custom orchestration	In-house Agent Orchestrator + MCP + OTel	📊 Optimize performance/control/auditing to requirements. Easier A2A responsibility splitting	Higher upfront cost. Operations (AgentOps) is mandatory	Core systems, multi-domain workflows, strict controls	Highest (assumes deep customization)

5. Best Practices and Anti-Patterns (bullet points)

Best Practices ✅

⚙️ Separate read/write/high-risk by permissions, routing, and approvals
🔧 Lock Plan/Tool I/O with JSON Schema and require validation
📊 Keep audit logs with “model/policy/evidence IDs/execution arguments (masked)” to ensure reproducibility
Introduce idempotency keys for all write APIs to prevent duplicate execution on retries
Use two-stage retrieval (Graph → RAG) to balance permission constraints and fuzzy matching
Unify LLM/tool/approval waits into a single trace with OpenTelemetry

Anti-Patterns ❌

Embedding requirements in “just a long prompt” (hard to change, impossible to audit)
Running the agent with a fully privileged service account
Passing errors to the LLM as free-form text and letting it execute ad-hoc workarounds
Feeding unlimited RAG results, increasing cost/latency/data leakage risk
Evaluating with “does it sound right” checks and missing write-side incidents

6. Implementation Roadmap and Checklist

Phase 0: Requirements definition (1–2 weeks)

Classify target workflows as “read-heavy” or “write-heavy”
Document prohibitions and approval conditions for high-risk actions (payments/contracts/external sending)
Define success criteria using business KPIs (completion rate, rework rate, audit finding rate)

Phase 1: Minimal agent (2–4 weeks)

🔧 Schema-ize Tool I/O (JSON Schema)
Constrain RAG/Graph sources (e.g., max_documents=5)
Implement audit logs (trace_id, citations, policy_version)

Phase 2: Stronger execution controls (4–8 weeks)

⚙️ Idempotency keys, workflow state management, and resume mechanisms
Approval flows (pre-approval/two-person approval) and policy engine integration
Rate-limit countermeasures (queues, backoff, circuit breakers)

Phase 3: Collaboration and scale (ongoing)

Split domains with A2A (sales/accounting/ITSM)
📊 Integrate OpenTelemetry + Evals into CI/CD (regression detection)
Externalize tool definitions with MCP to make backends swappable

Final checklist

Authorization: OBO/short-lived tokens and least privilege are enforced
Data: PII masking, DLP, and tenant boundaries are designed
Execution: idempotency, exception handling, and stop conditions (max_steps, etc.) exist
Audit: evidence IDs, policy/model versions, and approvers are traceable
Operations: p95 latency, cost ceilings, and automated regression tests are in place

7. Reference Resources and Next Steps

Microsoft 365 Copilot Agents / Agent 365 (understand product philosophy and adoption path)
Copilot Studio (low-code flow design, adding actions, iterative testing)
Tracing foundation: OpenTelemetry Collector v0.103+ (standardize internal observability)
Data foundation: PostgreSQL 16 / Redis 7.2 (separate state and short-term memory)
Next steps: ① start with a read-only agent to harden auditing and citations → ② enable writes with approvals → ③ split domains with A2A → ④ standardize the integration surface with MCP