MLOps / Generative AI

Deploying 200+ AI Agents on AWS Bedrock AgentCore: Architecture, Guardrails & Lessons Learned

Ananya Rao · MLOps Lead, codetoday.io June 2025 12 min read

When a global pharmaceutical client handed us a brief to automate 20 separate manual workflows for clinical data extraction — with a full HIPAA audit trail and enterprise-grade guardrails — we turned to AWS Bedrock AgentCore. Eight weeks later, 200+ agents were running in production. Here's the unvarnished technical account: the architecture, the three hardest problems, the real code, and the lessons we'd pass on to anyone attempting this at scale.

The Client Problem: 20 Manual Workflows, Zero Scalability

Our client — a top-10 global pharmaceutical company operating across 40 countries — had a clinical data operations team of 34 people whose entire job was to manually extract structured fields from unstructured clinical trial documents: adverse event reports, patient narratives, protocol amendments, and lab result PDFs.

Twenty separate workflows. Each owned by a different sub-team. Each with its own Excel macros, SharePoint folders, and tribal knowledge. Turnaround time averaged 72 hours per document batch. During peak clinical trial season, the team was drowning.

The brief was deceptively simple: automate these workflows with AI. The constraints were brutal. Every extracted field needed a full audit trail traceable to the source document and the model that produced it. Patient data could never leave a HIPAA-compliant boundary. The system needed to process 5,000+ documents per week at launch, scaling to 50,000 within 18 months. And it needed to go live in under 90 days.

We scoped the solution as a multi-agent system: one orchestrator agent per workflow type, each backed by a set of specialized sub-agents handling discrete tasks. That got us to 200+ agents across the 20 workflows.

Why AWS Bedrock AgentCore Over the Alternatives

We evaluated four serious options before committing: Google Vertex AI Agent Builder, Azure AI Agent Service, self-hosted LangChain on ECS, and AWS Bedrock AgentCore. Each has genuine strengths. Here's why AgentCore won for this specific engagement.

Managed Runtime, Not Your Problem

LangChain on ECS is extremely powerful — and extremely operationally expensive. You own the container lifecycle, the memory management, the scaling policies, the cold-start mitigation. For a 90-day delivery, we couldn't afford to spend three weeks on platform engineering before writing a single agent prompt. AgentCore's managed runtime handles all of this. Agents scale to zero when idle and spin up in under 300ms. We didn't write a single Dockerfile.

IAM-Native Design

Every AgentCore invocation is an IAM operation. That means CloudTrail captures it automatically — agent ID, caller identity, timestamp, input/output token counts, guardrail decisions — without any custom logging middleware. For HIPAA audit requirements, this was the single most important architectural property. Vertex AI and Azure AI both require you to build audit logging yourself, routing through their respective logging stacks. We'd been down that road before and didn't want to own it again.

VPC Integration and Data Residency

Clinical trial data cannot traverse the public internet, full stop. AgentCore supports VPC endpoints, meaning all model calls stay within a private network boundary. We deployed within the client's existing VPC, with PrivateLink endpoints for Bedrock, S3, DynamoDB, and Step Functions. Zero public internet exposure for any patient-adjacent data.

Built-in Guardrails API

Bedrock Guardrails is a first-class product, not an afterthought. Content filtering, PII redaction, topic denial, grounding checks — all configured declaratively via CloudFormation and applied consistently across every agent. When we compared this to building equivalent middleware in LangChain, the Guardrails path saved us an estimated six weeks of development and testing.

Architecture Deep-Dive

The system follows a hierarchical multi-agent pattern. Each of the 20 clinical workflows has a dedicated Orchestrator Agent. Each orchestrator coordinates six specialized sub-agents. Everything runs on a Step Functions Express Workflow backbone (more on why in the next section).


┌─────────────────────────────────────────────────────────────────────┐
│                      API Gateway + Cognito                          │
│                   (Client-facing ingestion endpoint)                │
└───────────────────────────┬─────────────────────────────────────────┘
                            │
                            ▼
┌─────────────────────────────────────────────────────────────────────┐
│              Step Functions Express Workflow (per workflow)         │
│                                                                     │
│   ┌─────────────────────────────────────────────────────────────┐   │
│   │                 Orchestrator Agent (AgentCore)              │   │
│   │            Model: Claude 3.5 Sonnet · Guardrail: pharma-v2  │   │
│   └──┬─────────┬──────────┬──────────┬────────────┬────────────┘   │
│      │         │          │          │            │                 │
│      ▼         ▼          ▼          ▼            ▼                 │
│  ┌───────┐ ┌───────┐ ┌────────┐ ┌────────┐ ┌──────────┐           │
│  │ Doc   │ │ Regex │ │Clinical│ │Validate│ │  Audit   │           │
│  │Classif│ │Extract│ │  NLP   │ │ Agent  │ │  Logger  │           │
│  └───────┘ └───────┘ └────────┘ └────────┘ └──────────┘           │
│                                                    │                │
│                                            ┌──────────────┐        │
│                                            │Report Generat│        │
│                                            └──────────────┘        │
└─────────────────────────────────────────────────────────────────────┘
          │                  │                       │
          ▼                  ▼                       ▼
  ┌──────────────┐  ┌─────────────────┐    ┌────────────────┐
  │  S3 + Bedrock│  │   DynamoDB      │    │  SNS Alerts    │
  │  Knowledge   │  │  (Agent State + │    │  (Errors +     │
  │  Base        │  │   Session Mem.) │    │   Cost Alerts) │
  └──────────────┘  └─────────────────┘    └────────────────┘

Each sub-agent has a single, tightly scoped responsibility. The Document Classifier routes incoming files by type (adverse event, protocol amendment, lab report). The Regex Extractor pulls structured fields using patterns maintained in a DynamoDB config table. The Clinical NLP Agent uses Claude 3.5 Sonnet with a specialized system prompt to extract free-text clinical entities. The Validation Agent cross-checks extracted fields against a Bedrock Knowledge Base containing formulary and coding standards. The Audit Logger writes a complete extraction record to DynamoDB and emits a CloudWatch metric. The Report Generator assembles the final structured output in the required downstream format.

⚠ Warning: Don't Skip the Validation Agent In early prototypes we went straight from extraction to report generation. The hallucination rate on drug names was 8%. After inserting the Validation Agent (which cross-checks every entity against the Bedrock Knowledge Base), the hallucination rate dropped to 0.4%. Never trust a single-pass extraction in a healthcare context.

The 3 Hardest Technical Problems

Problem 1: Agent-to-Agent Communication Latency

Our first prototype chained agents directly: the Orchestrator called the Document Classifier, waited for a response, then called the Regex Extractor with the classifier's output, and so on. With six sub-agents in sequence, end-to-end latency hit 18–25 seconds per document. Acceptable for batch overnight jobs; unacceptable for the interactive query workflows the client also needed.

The fix: replace direct agent chaining with Step Functions Express Workflows as the orchestration backbone. This gave us three things we couldn't get from direct chaining: parallelization (Regex Extractor and Clinical NLP Agent now run concurrently), automatic retry with configurable backoff, and full execution history in the Step Functions console. End-to-end latency dropped to 4–7 seconds. Equally important, debugging a failed extraction became a matter of opening the Step Functions execution history — not parsing distributed logs across six separate CloudWatch streams.

Problem 2: Guardrails Configuration for HIPAA

Bedrock Guardrails is powerful but requires careful configuration. Out of the box, the default content filters don't cover clinical-specific risk scenarios. We built the guardrail configuration iteratively over three weeks of red-teaming:

  • Denied topics: Treatment recommendations, dosing advice, diagnostic conclusions — agents are explicitly prohibited from making these statements even if the document contains them as source material
  • PII redaction: Patient names, MRNs, dates of birth, addresses, and phone numbers are anonymized before any response leaves Bedrock — configured as ANONYMIZE, not BLOCK, so the extraction workflow still functions
  • Custom word policies: Twelve proprietary drug compound names added to a custom blocklist to prevent the model from discussing off-label applications
  • Grounding checks: All Clinical NLP responses must be grounded in the source document — hallucinated entities trigger a guardrail block event, logged to CloudTrail, and the extraction is flagged for human review

Problem 3: Cost Control at Scale

Two hundred agents calling Claude 3.5 Sonnet is expensive if you don't instrument cost controls from the start. We learned this the hard way when a misconfigured prompt sent one agent into a loop processing the same document 847 times before a human noticed. That incident cost $340 in a single afternoon and prompted us to build the cost control layer we should have built on day one.

The system now has: per-agent token quota enforcement via Bedrock Guardrails (max tokens per session), per-workflow CloudWatch budget alarms (80% threshold triggers SNS, 100% triggers Step Functions abort), and daily Cost Anomaly Detection alerts via AWS Cost Anomaly Monitor scoped to the Bedrock service. Every agent invocation includes a custom CloudWatch metric emission for token counts attributed to the specific workflow and team — not just the aggregate Bedrock line item.

Real Code: IAM Policy, Guardrails Config, Step Functions Fragment

IAM Policy for AgentCore Runtime

Least-privilege IAM for an AgentCore agent execution role. Note the scoped Redshift Serverless statement (specific workgroup ARN, not wildcard) and the explicit Knowledge Base ARN restriction.

# AgentCore execution role — inline policy (JSON)
{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Sid": "BedrockInvokeModel",
      "Effect": "Allow",
      "Action": [
        "bedrock:InvokeModel",
        "bedrock:InvokeModelWithResponseStream"
      ],
      "Resource": "arn:aws:bedrock:us-east-1::foundation-model/anthropic.claude-3-5-sonnet-20241022-v2:0"
    },
    {
      "Sid": "BedrockKnowledgeBase",
      "Effect": "Allow",
      "Action": [
        "bedrock:Retrieve",
        "bedrock:RetrieveAndGenerate"
      ],
      "Resource": "arn:aws:bedrock:us-east-1:123456789012:knowledge-base/KBID123456"
    },
    {
      "Sid": "RedshiftWorkgroupQuery",
      "Effect": "Allow",
      "Action": "redshift-data:ExecuteStatement",
      "Resource": "arn:aws:redshift-serverless:us-east-1:123456789012:workgroup/clinical-wg"
    },
    {
      "Sid": "DynamoDBAgentState",
      "Effect": "Allow",
      "Action": [
        "dynamodb:GetItem",
        "dynamodb:PutItem",
        "dynamodb:UpdateItem"
      ],
      "Resource": "arn:aws:dynamodb:us-east-1:123456789012:table/agent-state-prod"
    }
  ]
}

Guardrails Configuration (CloudFormation)

# CloudFormation — Bedrock Guardrail for HIPAA-compliant clinical agents
GuardrailClinicalProd:
  Type: AWS::Bedrock::Guardrail
  Properties:
    Name: pharma-clinical-guardrail-prod
    Description: HIPAA-compliant guardrail for clinical extraction agents
    BlockedInputMessaging: "This query is outside the permitted scope for clinical data processing."
    BlockedOutputsMessaging: "This response was blocked by compliance policy. Please contact your data ops team."
    TopicPolicyConfig:
      TopicsConfig:
        - Name: ClinicalTreatmentAdvice
          Definition: "Any recommendation on patient treatment, dosing, diagnosis, or clinical management"
          Examples:
            - "What dose should this patient receive?"
            - "Is this drug safe for this condition?"
          Type: DENY
        - Name: OffLabelUsage
          Definition: "Discussion of drug compounds for unapproved indications"
          Type: DENY
    SensitiveInformationPolicyConfig:
      PiiEntitiesConfig:
        - Type: NAME
          Action: ANONYMIZE
        - Type: EMAIL
          Action: ANONYMIZE
        - Type: PHONE
          Action: ANONYMIZE
        - Type: ADDRESS
          Action: ANONYMIZE
        - Type: US_SOCIAL_SECURITY_NUMBER
          Action: BLOCK
    ContentPolicyConfig:
      FiltersConfig:
        - Type: HATE
          InputStrength: HIGH
          OutputStrength: HIGH
        - Type: MISCONDUCT
          InputStrength: MEDIUM
          OutputStrength: HIGH
    Tags:
      - Key: Environment
        Value: prod
      - Key: Compliance
        Value: HIPAA

Step Functions Express Workflow Fragment (Python SDK)

# Python — define and start a Step Functions Express Workflow for a single extraction run
import boto3, json, uuid
from datetime import datetime

sfn = boto3.client('stepfunctions', region_name='us-east-1')

WORKFLOW_DEFINITION = {
    "Comment": "Clinical data extraction multi-agent pipeline",
    "StartAt": "ClassifyDocument",
    "States": {
        "ClassifyDocument": {
            "Type": "Task",
            "Resource": "arn:aws:states:::bedrock:invokeAgent",
            "Parameters": {
                "AgentId": "DOC_CLASSIFIER_AGENT_ID",
                "AgentAliasId": "PROD",
                "SessionId.$": "$.sessionId",
                "InputText.$": "$.documentText",
                "EnableTrace": True
            },
            "Retry": [{
                "ErrorEquals": ["States.TaskFailed"],
                "IntervalSeconds": 3,
                "MaxAttempts": 2,
                "BackoffRate": 2.0
            }],
            "Next": "ParallelExtraction"
        },
        "ParallelExtraction": {
            "Type": "Parallel",
            "Branches": [
                {"StartAt": "RegexExtract", "States": {"RegexExtract": {"Type": "Task", "Resource": "arn:aws:states:::bedrock:invokeAgent", "Parameters": {"AgentId": "REGEX_AGENT_ID", "AgentAliasId": "PROD", "SessionId.$": "$.sessionId", "InputText.$": "$.documentText"}, "End": True}}},
                {"StartAt": "ClinicalNLP", "States": {"ClinicalNLP": {"Type": "Task", "Resource": "arn:aws:states:::bedrock:invokeAgent", "Parameters": {"AgentId": "NLP_AGENT_ID", "AgentAliasId": "PROD", "SessionId.$": "$.sessionId", "InputText.$": "$.documentText"}, "End": True}}}
            ],
            "Next": "ValidateAndAudit"
        },
        "ValidateAndAudit": {
            "Type": "Task",
            "Resource": "arn:aws:states:::bedrock:invokeAgent",
            "Parameters": {
                "AgentId": "VALIDATION_AGENT_ID",
                "AgentAliasId": "PROD",
                "SessionId.$": "$.sessionId",
                "InputText.$": "States.JsonToString($)",
                "EnableTrace": True
            },
            "End": True
        }
    }
}

# Execute the workflow for a document batch
def run_extraction(document_text: str, workflow_id: str) -> dict:
    execution = sfn.start_execution(
        stateMachineArn=f"arn:aws:states:us-east-1:123456789012:stateMachine:clinical-extraction-{workflow_id}",
        name=f"exec-{uuid.uuid4()}",
        input=json.dumps({
            "sessionId": str(uuid.uuid4()),
            "documentText": document_text,
            "workflowId": workflow_id,
            "timestamp": datetime.utcnow().isoformat()
        })
    )
    return execution

Production Results: 8 Weeks, $340K/yr Saved

After eight weeks of development and a two-week phased rollout, all 20 clinical workflows went live on the AgentCore platform. The outcomes were measurably better than the manual baseline on every KPI the client tracked.

94%
Extraction accuracy
(vs 87% manual baseline)
$340K
Annual labor cost
reduction
4.7s
Avg end-to-end
latency per document
0
Data leakage incidents
in 6 months

The full HIPAA audit trail is delivered via CloudTrail. Every extracted field is traceable to the exact agent invocation, session ID, document hash, and model version that produced it. During a mock regulatory audit in month four, the compliance team was able to reconstruct the provenance of any extracted value in under two minutes — compared to hours of manual log archaeology under the old system.

Guardrails blocked 1,247 out-of-scope queries across the six-month period — a 3.1% block rate. Of these, roughly 40% were legitimate prompts that were too broadly worded, and 60% were genuine scope violations. Every block event was logged, reviewed weekly, and fed back into guardrail refinement.

💡 Tip: Enable CloudTrail Data Events for Bedrock CloudTrail management events capture API calls. But to capture the actual input/output content of agent invocations (needed for true HIPAA audit trails), you must explicitly enable CloudTrail data events for the Bedrock service. This is not on by default and is easy to miss. Without it, you have a call log but no content log — which is not sufficient for clinical compliance.

4 Lessons We'd Give Our Past Selves

1. Start With 3–5 Agents, Not 200

We were handed a scope of 20 workflows and immediately scoped out the full 200+ agent architecture. In retrospect, we should have built out one workflow end-to-end first — including the observability layer, cost attribution, and audit trail infrastructure — and validated it fully before scaling. By the time we had ten workflows running, we retrofitted the cost attribution layer and spent two sprints fixing naming inconsistencies. Start small, get the platform right, then scale.

2. Guardrails Are Not Optional in Healthcare

PII redaction, topic denial, and grounding checks must be configured from day one in clinical environments — not added as an afterthought after the first UAT feedback session. In our case, the first QA pass revealed a scenario where the Clinical NLP Agent was surfacing patient names in extraction output. We'd been running those tests for three days before the guardrail was in place. Configure PII redaction before you run a single test with real documents.

3. Step Functions Over Direct Agent Chaining

Direct agent chaining (orchestrator calls sub-agent, sub-agent returns result) is seductive in its simplicity. It is also a debugging nightmare at scale. When an extraction fails at step 4 of 6, you want to see the complete execution graph, the input/output at each step, and the retry history — all in one place. Step Functions gives you this. Direct chaining does not. The extra ~200ms latency from Step Functions Express Workflows was worth every millisecond.

4. Budget Alerts Per-Agent, Not Per-Account

A single AWS budget alarm on the Bedrock service line item tells you when you've spent $X. It tells you nothing about which of 200 agents is responsible. We emit a custom CloudWatch metric for every agent invocation — tagged with agent_id, workflow_id, and team_id — and create a budget alarm per workflow (not per account). When a workflow hits 80% of its monthly budget, the team gets an SNS alert before it becomes a finance surprise.

AWS Bedrock AgentCore MLOps LLM Agents Enterprise AI Guardrails HIPAA Step Functions CloudTrail

Designing a multi-agent AI system on AWS?

Book a free 30-minute architecture review with our MLOps team. We'll review your agent topology, guardrail strategy, and cost model — no sales pitch, just technical depth.

Book Free Architecture Review →
// Related Articles
// AI Platforms
AWS Bedrock vs Azure OpenAI: The Honest 2025 Comparison
// MLOps
MLflow vs SageMaker MLOps 2025
// Related Service
Generative AI & LLM Platforms
We deploy enterprise AI agent systems on AWS Bedrock AgentCore — 200+ agents, guardrails, evaluation, and production monitoring.
Learn More