Deep Dive into Production-Grade AI Agent Implementation: LLM Orchestration, Safe Tool Execution, and Governance Design Patterns

1. Executive Summary (Technical Overview · ~300 characters)

An AI agent is software that uses an LLM as its core to iterate a loop of “Plan → Act → Observe → Reflect,” autonomously completing tasks by calling external tools/APIs. While RPA/IA excels at automating predefined procedures, agents can handle exploration, branching, and exception handling under uncertainty. In production, however, bottlenecks include runaway behavior (over-execution and cost spikes), privilege escalation, data leakage, and lack of reproducibility. This article presents production-grade patterns—tool execution sandboxes, state management, evaluation metrics, audit logs, and staged autonomy—along with concrete versions and configuration values. 📊

2. Technical Background and Challenges (Architecture description, existing pain points)

a group of people standing next to each other

As many reference articles point out, 2025 is often called “the first year of AI agents.” In practice, the real challenge on the ground is not understanding the concept—it’s building an operable architecture. Traditional generative AI (chat) typically ends with a single inference, whereas AI agents have state and side effects. This is the same “accidents happen” territory as RPA/IA, and classic LLM guardrails alone are not enough.

Technical flow diagram (described in text): A user request is (1) received by an API Gateway → (2) a Policy/Prompt Router determines the use case, permissions, and data classification → (3) a Planner (LLM) decomposes the task → (4) a Tool Router selects approved tools → (5) a Sandbox Executor (browser/CLI/HTTP) executes → (6) an Observation Collector gathers results and evidence → (7) a Verifier (rules/LLM/tests) validates correctness → (8) state is persisted to a State Store → (9) re-plan if needed; if complete, record to audit logs—forming a closed loop.

Existing pain points:

🔧 Tool calling is “too unconstrained”: side effects such as mis-sends, duplicate orders, or mass bookings
⚙️ Ambiguous state management: retries after partial failures re-run the same operation, causing duplicate updates
📊 Hard to evaluate: in business workflows without ground-truth labels, SLO definition is still unclear
🔐 Security: handling PII/confidential data, unauthorized access, prompt injection
💸 Cost: autonomous exploration can increase tokens and external API fees exponentially

3. Technical Section 1: Reference Architecture for Agents (Separating Plan/Act/Observe) ⚙️

3.1 Component decomposition: Planner / Executor / Verifier

In production, don’t make “LLM = everything.” Let the Planner focus on reasoning (decomposition, prioritization, stop conditions). Make the Executor run side-effecting operations deterministically. Have the Verifier validate outputs using rules/tests/a redundant LLM. This contains probabilistic LLM behavior at the boundaries.

3.2 State model: Run / Step / Artifact

Split agent execution into Run (one request) → Step (one tool execution) → Artifact (outputs/evidence). In particular, require an idempotency_key for every Step to prevent duplicate side effects on retries.

3.3 Example reference stack (version-pinned)

LLM: GPT-4.1 / GPT-4.1-mini (example)
Orchestration: LangGraph 0.2.x or Semantic Kernel 1.2.x
Vector DB: pgvector 0.7 + PostgreSQL 16
Cache/Queue: Redis 7.2 / Kafka 3.7
Observability: OpenTelemetry Collector 0.96 + Prometheus 2.51 + Grafana 10.4
Runtime: Python 3.12 + FastAPI 0.110
Sandbox: Playwright 1.49 (browser automation) + gVisor (container isolation)

4. Technical Section 2: Tool Use Design—“Safely Typed” Function Calling 🔧

4.1 Strict argument constraints with JSON Schema

Tool arguments should not be free-form natural language; lock them down with JSON Schema types, ranges, and enums. Even if you can parse the LLM output, unconstrained value ranges lead to incidents. For example, enforce transfer limits, restrict email recipient domains, and cap search query length at the schema level.

{
  "name": "create_purchase_order",
  "description": "Create PO in ERP (dry-run supported)",
  "parameters": {
    "type": "object",
    "required": ["supplier_id", "items", "dry_run", "idempotency_key"],
    "properties": {
      "supplier_id": {"type": "string", "pattern": "^SUP-[0-9]{6}$"},
      "items": {
        "type": "array",
        "minItems": 1,
        "maxItems": 20,
        "items": {
          "type": "object",
          "required": ["sku", "qty"],
          "properties": {
            "sku": {"type": "string", "pattern": "^[A-Z0-9-]{3,32}$"},
            "qty": {"type": "integer", "minimum": 1, "maximum": 1000}
          }
        }
      },
      "dry_run": {"type": "boolean"},
      "idempotency_key": {"type": "string", "minLength": 16, "maxLength": 64}
    }
  }
}

4.2 Two-phase commit (dry-run → confirm)

Autonomy should be increased gradually. Start with dry-run to show diffs, then commit only after human-in-the-loop (HITL) approval or policy-based approval. Bringing the RPA/IA culture of “confirmed actions” into agents is the pragmatic approach.

def execute_po(tool_args, policy):
    # 1) validate schema
    validate_jsonschema(tool_args)

    # 2) enforce policy
    if tool_args["dry_run"] is False and not policy.can_commit_po:
        raise PermissionError("commit not allowed")

    # 3) idempotency
    if store.exists(tool_args["idempotency_key"]):
        return store.get(tool_args["idempotency_key"])

    # 4) call ERP
    result = erp.create_po(**tool_args)
    store.put(tool_args["idempotency_key"], result)
    return result

4.3 Required constraints for browser automation (Operator-style)

Browser automation is flexible but brittle when the DOM changes. If implementing with Playwright, set a minimum baseline of: (1) an allowlist of operable domains, (2) click/submit limits (e.g., max 20 actions per run), (3) no file attachments, and (4) credentials delivered via a vault using short-lived tokens.

5. Technical Section 3: Memory and Knowledge Injection (RAG + State)—Where “Remembering” Should Stop 🧠

5.1 Conversational memory vs working memory vs organizational knowledge

“Continuous learning” is easy to misunderstand. In implementation, separate: (a) conversation logs (short-term), (b) Run state (mid-term: ToDo/progress/constraints), and (c) RAG documents (long-term: policies/design docs). If you “dump everything” into the LLM, you increase leakage risk and cost.

5.2 Minimal configuration example: pgvector + PostgreSQL 16

CREATE EXTENSION IF NOT EXISTS vector;

CREATE TABLE rag_chunks (
  id bigserial PRIMARY KEY,
  doc_id text NOT NULL,
  chunk_no int NOT NULL,
  content text NOT NULL,
  embedding vector(1536) NOT NULL,
  data_class text NOT NULL DEFAULT 'internal',
  created_at timestamptz NOT NULL DEFAULT now()
);

CREATE INDEX rag_chunks_ivfflat
  ON rag_chunks USING ivfflat (embedding vector_cosine_ops)
  WITH (lists = 200);

Configuration guidance: lists=200 is a reasonable starting point for mid-scale workloads (up to a few million chunks). For queries, start with SET ivfflat.probes=10; and tune the recall/latency trade-off.

5.3 Apply “data classification” to retrieved documents before prompting

RAG is powerful, but it can accidentally pull in confidential documents. Before retrieval, filter doc_id/data_class using ABAC (attribute-based access control) and remove restricted content before it reaches the LLM. “Don’t output secrets” in a prompt is not governance.

6. Technical Section 4: Evaluation and Benchmarking—Turning Agents into SLOs 📊

6.1 Metric design: success rate alone is not enough

In production, monitor: (1) task success rate, (2) rework rate (how often humans had to fix results), (3) number of external API calls, (4) cost (tokens + external billing), (5) tool failure rate, and (6) safety violations (policy blocks). Note that “zero safety violations” may simply mean the agent did nothing—so track it alongside success rate.

6.2 Benchmark example (internal help desk automation)

Configuration	Avg latency (P50)	Avg latency (P95)	Success rate	Avg tool calls	Estimated cost/Run
Chat (RAG only)	1.8s	4.9s	71%	0.0	$0.006
Single agent (Plan + Tool)	6.4s	22.0s	84%	2.7	$0.028
Multi-agent (Planner/Verifier separated)	7.9s	24.5s	90%	3.1	$0.041

Implication: success rate improves, but P95 latency and cost reliably worsen. Therefore, deciding which workflows to agentify should be based on latency tolerance and the value of side effects. If you aim to “automate everything” in an RPA/IA style, the program will collapse.

6.3 Regression testing: golden runs and tool mocks

Agents are non-deterministic, so a practical testing approach is: (1) fix the LLM (temperature=0, top_p=1.0), (2) record/replay tool calls using a VCR-style mechanism, and (3) diff-validate the final artifacts. When updating models, prioritize “side-effect diffs” over raw success rate.

7. Technical Section 5: Security—From Prompt Injection to Privilege Escalation 🔐

7.1 Threat model: LLMs trust inputs

Prompt injection—where web pages or emails embed instructions like “send confidential data” or “click this URL”—becomes acute in browser-operating agents. The answer is not “be careful,” but defense in depth: (a) minimize tool permissions, (b) label external inputs as untrusted data, (c) require two-phase commit for critical actions, and (d) inspect outputs with DLP.

7.2 Authorization design: OAuth scopes + short-lived credentials

Do not give agents long-lived API keys. For example, for Google/Microsoft/internal APIs, use OAuth 2.1 with minimal scopes, set token TTL to ~15 minutes, and manage refresh server-side. On Kubernetes, use IRSA/Workload Identity (cloud-dependent) to grant Pod permissions and avoid placing Secrets directly in workloads.

7.3 Audit logs: who executed what, and why

What auditors need is not “the full LLM output,” but the decision points: whether approval occurred, selected tool, arguments, target resources, results, idempotency_key, execution time, and actor. Conversation logs that may contain PII should be separated (encrypted, shorter retention).

8. Technical Section 6: Scalability—Parallelism Bottlenecks on “Side Effects,” Not Agent Count 📈

8.1 Parallel execution model: Queue + concurrency caps

Because agents hit external APIs and UIs, external I/O—not CPU—becomes the bottleneck. Distribute Runs via Kafka/Redis Queue while limiting concurrency per tool (e.g., ERP max=5, email sending max=20). The LLM side also has RPM/TPM limits, so throttle using a token budget.

8.2 Rate control configuration example (YAML)

rateLimits:
  llm:
    provider: openai
    requests_per_minute: 300
    tokens_per_minute: 200000
  tools:
    erp_api:
      concurrency: 5
      retry:
        max_attempts: 3
        backoff_ms: [200, 800, 2000]
    email_send:
      concurrency: 20
      daily_cap: 2000
      allow_domains: ["example.co.jp"]

8.3 State store design: closer to event sourcing

Storing Run/Step/Artifact as append-only events makes resume and auditing easier. This is especially important in multi-agent setups, where you must trace each agent’s decisions over time. An RDB is sufficient, but forcing everything into a single update-heavy table often breaks down operationally.

9. Technical Section 7: Integrating with RPA/IA—Not Replacement, but “Separation of Responsibilities” 🧩

9.1 Role split: decisions by the agent, execution by RPA

Taking the IA concept (AI instructs RPA) and designing it more rigorously, stability improves when you split responsibilities: “agents handle judgment and exception handling,” while “RPA handles deterministic UI operations.” If you delegate UI operations entirely to an agent, it becomes fragile due to UI changes, timeouts, and MFA.

9.2 Call RPA as a “tool”

From an agent’s perspective, RPA (e.g., UiPath/Power Automate) is just another tool. Type its arguments, version-control the RPA workflows, and store return values (success/failure/error codes/screenshots) as Artifacts.

9.3 Fallback design for failures

If an agent fails and you only “hand it to a human,” operations will eventually clog. Define a priority order such as: (1) switch to an existing RPA flow, (2) provide a chat-only response as a temporary measure, (3) create a ticket—and iterate improvements using an SRE-style error budget.

10. Comparative Analysis Table (Compare 3+ options)

Option	Primary use	Strengths	Weaknesses/Risks	When to apply
Generative AI chat + RAG	Q&A, knowledge search	Low cost, low risk, fast to deploy	No side effects (hard to become true automation)	Start here—an entry point for SLA-sensitive work
RPA (rule-based)	Routine UI operations, transcription, batch jobs	Deterministic, easy to audit	Weak on exceptions; brittle to UI changes	Back-office processes with fixed procedures and few exceptions
IA (AI instructs RPA)	Decision + execution coordination	Can reduce human intervention	If responsibilities are unclear, incidents occur (AI “runs wild” into UI)	Requires a design where AI decides and RPA executes
AI agent (Tool Use + state management)	Uncertain work, autonomous task execution	Exploration, branching, exception handling; strong for complex problems	Cost, reproducibility, security, and auditing are difficult	Roll out in stages, limited to high-value side effects

11. Best Practices & Anti-Patterns (Bullets)

✅ Best Practices

⚙️ Separate Planner/Executor/Verifier and make tool execution deterministic
🔧 Type tool arguments with JSON Schema, including value constraints and allowlists
🔐 Use two-phase commit (dry-run → confirm) to control side effects
📊 Turn success rate + cost + safety violations + rework rate into SLOs
🧾 Design audit logs around “decision points” and separate them from PII logs
📈 Set per-tool concurrency caps and daily caps

❌ Anti-Patterns

“It’s safe because the prompt forbids it”: dangerous without DLP/ABAC/scope control
Turning UI operations into a universal tool: making the browser do everything becomes brittle and unmaintainable
Autonomous execution without state: retries cause duplicate execution and double billing
Shipping to production without evaluation: model updates silently degrade until an incident occurs
A KPI of “automate all work”: low-value exploration only increases cost

12. Implementation Roadmap and Checklist

12.1 Roadmap (staged autonomy)

Phase 0: Chat + RAG (no side effects). Lock down data classification, retrieval filters, and logging design
Phase 1: Read-only tools (search, reference APIs). Introduce audit logs and rate control
Phase 2: Write tools in dry-run only. Diff presentation and approval flow (HITL)
Phase 3: Allow commits in a limited scope (limits/allowlists/daily caps). Start SLO operations
Phase 4: Multi-agent (specialized agents). Stronger Verifier, automated regression testing

12.2 Checklist

🔐 OAuth scopes are minimized, and there are no hard-coded Secrets
🔧 Tool definitions are schematized, and idempotency_key is required
📊 Core KPIs (success rate/rework rate/cost/safety violations) are dashboarded
🧪 Regression tests run with golden runs (temperature=0)
📈 Per-tool parallelism, daily caps, and retries are configured
🧾 Audit logs (decision points) are preserved, with retention and masking defined

13. Reference Resources and Next Steps

IPA SDS technical column: AI agents (overview and key issues for societal implementation)
Gartner: characteristics of AI agents (autonomy, adaptability, goal orientation, etc.)
OpenTelemetry (observability for Run/Step via distributed tracing)
LangGraph / Semantic Kernel (implementing agents as state machines)

Next steps: Pick one internal workflow where “high-value side effects” matter (e.g., ticket creation, inventory allocation, quote generation) and roll it out in stages: Read-only → dry-run → commit. If you already have RPA assets, the fastest path is to treat RPA as a tool, separate responsibilities, and avoid importing UI brittleness into the agent layer. ⚙️