Deploying 200+ AI Agents on AWS Bedrock AgentCore: Architecture, Guardrails & Lessons Learned
When a global pharmaceutical client handed us a brief to automate 20 separate manual workflows for clinical data extraction — with a full HIPAA audit trail and enterprise-grade guardrails — we turned to AWS Bedrock AgentCore. Eight weeks later, 200+ agents were running in production. Here's the unvarnished technical account: the architecture, the three hardest problems, the real code, and the lessons we'd pass on to anyone attempting this at scale.
The Client Problem: 20 Manual Workflows, Zero Scalability
Our client — a top-10 global pharmaceutical company operating across 40 countries — had a clinical data operations team of 34 people whose entire job was to manually extract structured fields from unstructured clinical trial documents: adverse event reports, patient narratives, protocol amendments, and lab result PDFs.
Twenty separate workflows. Each owned by a different sub-team. Each with its own Excel macros, SharePoint folders, and tribal knowledge. Turnaround time averaged 72 hours per document batch. During peak clinical trial season, the team was drowning.
The brief was deceptively simple: automate these workflows with AI. The constraints were brutal. Every extracted field needed a full audit trail traceable to the source document and the model that produced it. Patient data could never leave a HIPAA-compliant boundary. The system needed to process 5,000+ documents per week at launch, scaling to 50,000 within 18 months. And it needed to go live in under 90 days.
We scoped the solution as a multi-agent system: one orchestrator agent per workflow type, each backed by a set of specialized sub-agents handling discrete tasks. That got us to 200+ agents across the 20 workflows.
Why AWS Bedrock AgentCore Over the Alternatives
We evaluated four serious options before committing: Google Vertex AI Agent Builder, Azure AI Agent Service, self-hosted LangChain on ECS, and AWS Bedrock AgentCore. Each has genuine strengths. Here's why AgentCore won for this specific engagement.
Managed Runtime, Not Your Problem
LangChain on ECS is extremely powerful — and extremely operationally expensive. You own the container lifecycle, the memory management, the scaling policies, the cold-start mitigation. For a 90-day delivery, we couldn't afford to spend three weeks on platform engineering before writing a single agent prompt. AgentCore's managed runtime handles all of this. Agents scale to zero when idle and spin up in under 300ms. We didn't write a single Dockerfile.
IAM-Native Design
Every AgentCore invocation is an IAM operation. That means CloudTrail captures it automatically — agent ID, caller identity, timestamp, input/output token counts, guardrail decisions — without any custom logging middleware. For HIPAA audit requirements, this was the single most important architectural property. Vertex AI and Azure AI both require you to build audit logging yourself, routing through their respective logging stacks. We'd been down that road before and didn't want to own it again.
VPC Integration and Data Residency
Clinical trial data cannot traverse the public internet, full stop. AgentCore supports VPC endpoints, meaning all model calls stay within a private network boundary. We deployed within the client's existing VPC, with PrivateLink endpoints for Bedrock, S3, DynamoDB, and Step Functions. Zero public internet exposure for any patient-adjacent data.
Built-in Guardrails API
Bedrock Guardrails is a first-class product, not an afterthought. Content filtering, PII redaction, topic denial, grounding checks — all configured declaratively via CloudFormation and applied consistently across every agent. When we compared this to building equivalent middleware in LangChain, the Guardrails path saved us an estimated six weeks of development and testing.
Architecture Deep-Dive
The system follows a hierarchical multi-agent pattern. Each of the 20 clinical workflows has a dedicated Orchestrator Agent. Each orchestrator coordinates six specialized sub-agents. Everything runs on a Step Functions Express Workflow backbone (more on why in the next section).
┌─────────────────────────────────────────────────────────────────────┐
│ API Gateway + Cognito │
│ (Client-facing ingestion endpoint) │
└───────────────────────────┬─────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────────┐
│ Step Functions Express Workflow (per workflow) │
│ │
│ ┌─────────────────────────────────────────────────────────────┐ │
│ │ Orchestrator Agent (AgentCore) │ │
│ │ Model: Claude 3.5 Sonnet · Guardrail: pharma-v2 │ │
│ └──┬─────────┬──────────┬──────────┬────────────┬────────────┘ │
│ │ │ │ │ │ │
│ ▼ ▼ ▼ ▼ ▼ │
│ ┌───────┐ ┌───────┐ ┌────────┐ ┌────────┐ ┌──────────┐ │
│ │ Doc │ │ Regex │ │Clinical│ │Validate│ │ Audit │ │
│ │Classif│ │Extract│ │ NLP │ │ Agent │ │ Logger │ │
│ └───────┘ └───────┘ └────────┘ └────────┘ └──────────┘ │
│ │ │
│ ┌──────────────┐ │
│ │Report Generat│ │
│ └──────────────┘ │
└─────────────────────────────────────────────────────────────────────┘
│ │ │
▼ ▼ ▼
┌──────────────┐ ┌─────────────────┐ ┌────────────────┐
│ S3 + Bedrock│ │ DynamoDB │ │ SNS Alerts │
│ Knowledge │ │ (Agent State + │ │ (Errors + │
│ Base │ │ Session Mem.) │ │ Cost Alerts) │
└──────────────┘ └─────────────────┘ └────────────────┘
Each sub-agent has a single, tightly scoped responsibility. The Document Classifier routes incoming files by type (adverse event, protocol amendment, lab report). The Regex Extractor pulls structured fields using patterns maintained in a DynamoDB config table. The Clinical NLP Agent uses Claude 3.5 Sonnet with a specialized system prompt to extract free-text clinical entities. The Validation Agent cross-checks extracted fields against a Bedrock Knowledge Base containing formulary and coding standards. The Audit Logger writes a complete extraction record to DynamoDB and emits a CloudWatch metric. The Report Generator assembles the final structured output in the required downstream format.
The 3 Hardest Technical Problems
Problem 1: Agent-to-Agent Communication Latency
Our first prototype chained agents directly: the Orchestrator called the Document Classifier, waited for a response, then called the Regex Extractor with the classifier's output, and so on. With six sub-agents in sequence, end-to-end latency hit 18–25 seconds per document. Acceptable for batch overnight jobs; unacceptable for the interactive query workflows the client also needed.
The fix: replace direct agent chaining with Step Functions Express Workflows as the orchestration backbone. This gave us three things we couldn't get from direct chaining: parallelization (Regex Extractor and Clinical NLP Agent now run concurrently), automatic retry with configurable backoff, and full execution history in the Step Functions console. End-to-end latency dropped to 4–7 seconds. Equally important, debugging a failed extraction became a matter of opening the Step Functions execution history — not parsing distributed logs across six separate CloudWatch streams.
Problem 2: Guardrails Configuration for HIPAA
Bedrock Guardrails is powerful but requires careful configuration. Out of the box, the default content filters don't cover clinical-specific risk scenarios. We built the guardrail configuration iteratively over three weeks of red-teaming:
- Denied topics: Treatment recommendations, dosing advice, diagnostic conclusions — agents are explicitly prohibited from making these statements even if the document contains them as source material
- PII redaction: Patient names, MRNs, dates of birth, addresses, and phone numbers are anonymized before any response leaves Bedrock — configured as ANONYMIZE, not BLOCK, so the extraction workflow still functions
- Custom word policies: Twelve proprietary drug compound names added to a custom blocklist to prevent the model from discussing off-label applications
- Grounding checks: All Clinical NLP responses must be grounded in the source document — hallucinated entities trigger a guardrail block event, logged to CloudTrail, and the extraction is flagged for human review
Problem 3: Cost Control at Scale
Two hundred agents calling Claude 3.5 Sonnet is expensive if you don't instrument cost controls from the start. We learned this the hard way when a misconfigured prompt sent one agent into a loop processing the same document 847 times before a human noticed. That incident cost $340 in a single afternoon and prompted us to build the cost control layer we should have built on day one.
The system now has: per-agent token quota enforcement via Bedrock Guardrails (max tokens per session), per-workflow CloudWatch budget alarms (80% threshold triggers SNS, 100% triggers Step Functions abort), and daily Cost Anomaly Detection alerts via AWS Cost Anomaly Monitor scoped to the Bedrock service. Every agent invocation includes a custom CloudWatch metric emission for token counts attributed to the specific workflow and team — not just the aggregate Bedrock line item.
Real Code: IAM Policy, Guardrails Config, Step Functions Fragment
IAM Policy for AgentCore Runtime
Least-privilege IAM for an AgentCore agent execution role. Note the scoped Redshift Serverless statement (specific workgroup ARN, not wildcard) and the explicit Knowledge Base ARN restriction.
# AgentCore execution role — inline policy (JSON)
{
"Version": "2012-10-17",
"Statement": [
{
"Sid": "BedrockInvokeModel",
"Effect": "Allow",
"Action": [
"bedrock:InvokeModel",
"bedrock:InvokeModelWithResponseStream"
],
"Resource": "arn:aws:bedrock:us-east-1::foundation-model/anthropic.claude-3-5-sonnet-20241022-v2:0"
},
{
"Sid": "BedrockKnowledgeBase",
"Effect": "Allow",
"Action": [
"bedrock:Retrieve",
"bedrock:RetrieveAndGenerate"
],
"Resource": "arn:aws:bedrock:us-east-1:123456789012:knowledge-base/KBID123456"
},
{
"Sid": "RedshiftWorkgroupQuery",
"Effect": "Allow",
"Action": "redshift-data:ExecuteStatement",
"Resource": "arn:aws:redshift-serverless:us-east-1:123456789012:workgroup/clinical-wg"
},
{
"Sid": "DynamoDBAgentState",
"Effect": "Allow",
"Action": [
"dynamodb:GetItem",
"dynamodb:PutItem",
"dynamodb:UpdateItem"
],
"Resource": "arn:aws:dynamodb:us-east-1:123456789012:table/agent-state-prod"
}
]
}
Guardrails Configuration (CloudFormation)
# CloudFormation — Bedrock Guardrail for HIPAA-compliant clinical agents
GuardrailClinicalProd:
Type: AWS::Bedrock::Guardrail
Properties:
Name: pharma-clinical-guardrail-prod
Description: HIPAA-compliant guardrail for clinical extraction agents
BlockedInputMessaging: "This query is outside the permitted scope for clinical data processing."
BlockedOutputsMessaging: "This response was blocked by compliance policy. Please contact your data ops team."
TopicPolicyConfig:
TopicsConfig:
- Name: ClinicalTreatmentAdvice
Definition: "Any recommendation on patient treatment, dosing, diagnosis, or clinical management"
Examples:
- "What dose should this patient receive?"
- "Is this drug safe for this condition?"
Type: DENY
- Name: OffLabelUsage
Definition: "Discussion of drug compounds for unapproved indications"
Type: DENY
SensitiveInformationPolicyConfig:
PiiEntitiesConfig:
- Type: NAME
Action: ANONYMIZE
- Type: EMAIL
Action: ANONYMIZE
- Type: PHONE
Action: ANONYMIZE
- Type: ADDRESS
Action: ANONYMIZE
- Type: US_SOCIAL_SECURITY_NUMBER
Action: BLOCK
ContentPolicyConfig:
FiltersConfig:
- Type: HATE
InputStrength: HIGH
OutputStrength: HIGH
- Type: MISCONDUCT
InputStrength: MEDIUM
OutputStrength: HIGH
Tags:
- Key: Environment
Value: prod
- Key: Compliance
Value: HIPAA
Step Functions Express Workflow Fragment (Python SDK)
# Python — define and start a Step Functions Express Workflow for a single extraction run
import boto3, json, uuid
from datetime import datetime
sfn = boto3.client('stepfunctions', region_name='us-east-1')
WORKFLOW_DEFINITION = {
"Comment": "Clinical data extraction multi-agent pipeline",
"StartAt": "ClassifyDocument",
"States": {
"ClassifyDocument": {
"Type": "Task",
"Resource": "arn:aws:states:::bedrock:invokeAgent",
"Parameters": {
"AgentId": "DOC_CLASSIFIER_AGENT_ID",
"AgentAliasId": "PROD",
"SessionId.$": "$.sessionId",
"InputText.$": "$.documentText",
"EnableTrace": True
},
"Retry": [{
"ErrorEquals": ["States.TaskFailed"],
"IntervalSeconds": 3,
"MaxAttempts": 2,
"BackoffRate": 2.0
}],
"Next": "ParallelExtraction"
},
"ParallelExtraction": {
"Type": "Parallel",
"Branches": [
{"StartAt": "RegexExtract", "States": {"RegexExtract": {"Type": "Task", "Resource": "arn:aws:states:::bedrock:invokeAgent", "Parameters": {"AgentId": "REGEX_AGENT_ID", "AgentAliasId": "PROD", "SessionId.$": "$.sessionId", "InputText.$": "$.documentText"}, "End": True}}},
{"StartAt": "ClinicalNLP", "States": {"ClinicalNLP": {"Type": "Task", "Resource": "arn:aws:states:::bedrock:invokeAgent", "Parameters": {"AgentId": "NLP_AGENT_ID", "AgentAliasId": "PROD", "SessionId.$": "$.sessionId", "InputText.$": "$.documentText"}, "End": True}}}
],
"Next": "ValidateAndAudit"
},
"ValidateAndAudit": {
"Type": "Task",
"Resource": "arn:aws:states:::bedrock:invokeAgent",
"Parameters": {
"AgentId": "VALIDATION_AGENT_ID",
"AgentAliasId": "PROD",
"SessionId.$": "$.sessionId",
"InputText.$": "States.JsonToString($)",
"EnableTrace": True
},
"End": True
}
}
}
# Execute the workflow for a document batch
def run_extraction(document_text: str, workflow_id: str) -> dict:
execution = sfn.start_execution(
stateMachineArn=f"arn:aws:states:us-east-1:123456789012:stateMachine:clinical-extraction-{workflow_id}",
name=f"exec-{uuid.uuid4()}",
input=json.dumps({
"sessionId": str(uuid.uuid4()),
"documentText": document_text,
"workflowId": workflow_id,
"timestamp": datetime.utcnow().isoformat()
})
)
return execution
Production Results: 8 Weeks, $340K/yr Saved
After eight weeks of development and a two-week phased rollout, all 20 clinical workflows went live on the AgentCore platform. The outcomes were measurably better than the manual baseline on every KPI the client tracked.
(vs 87% manual baseline)
reduction
latency per document
in 6 months
The full HIPAA audit trail is delivered via CloudTrail. Every extracted field is traceable to the exact agent invocation, session ID, document hash, and model version that produced it. During a mock regulatory audit in month four, the compliance team was able to reconstruct the provenance of any extracted value in under two minutes — compared to hours of manual log archaeology under the old system.
Guardrails blocked 1,247 out-of-scope queries across the six-month period — a 3.1% block rate. Of these, roughly 40% were legitimate prompts that were too broadly worded, and 60% were genuine scope violations. Every block event was logged, reviewed weekly, and fed back into guardrail refinement.
4 Lessons We'd Give Our Past Selves
1. Start With 3–5 Agents, Not 200
We were handed a scope of 20 workflows and immediately scoped out the full 200+ agent architecture. In retrospect, we should have built out one workflow end-to-end first — including the observability layer, cost attribution, and audit trail infrastructure — and validated it fully before scaling. By the time we had ten workflows running, we retrofitted the cost attribution layer and spent two sprints fixing naming inconsistencies. Start small, get the platform right, then scale.
2. Guardrails Are Not Optional in Healthcare
PII redaction, topic denial, and grounding checks must be configured from day one in clinical environments — not added as an afterthought after the first UAT feedback session. In our case, the first QA pass revealed a scenario where the Clinical NLP Agent was surfacing patient names in extraction output. We'd been running those tests for three days before the guardrail was in place. Configure PII redaction before you run a single test with real documents.
3. Step Functions Over Direct Agent Chaining
Direct agent chaining (orchestrator calls sub-agent, sub-agent returns result) is seductive in its simplicity. It is also a debugging nightmare at scale. When an extraction fails at step 4 of 6, you want to see the complete execution graph, the input/output at each step, and the retry history — all in one place. Step Functions gives you this. Direct chaining does not. The extra ~200ms latency from Step Functions Express Workflows was worth every millisecond.
4. Budget Alerts Per-Agent, Not Per-Account
A single AWS budget alarm on the Bedrock service line item tells you when you've spent $X. It
tells you nothing about which of 200 agents is responsible. We emit a custom CloudWatch metric
for every agent invocation — tagged with agent_id, workflow_id, and
team_id — and create a budget alarm per workflow (not per account). When a workflow
hits 80% of its monthly budget, the team gets an SNS alert before it becomes a finance surprise.
Designing a multi-agent AI system on AWS?
Book a free 30-minute architecture review with our MLOps team. We'll review your agent topology, guardrail strategy, and cost model — no sales pitch, just technical depth.
Book Free Architecture Review →