ADR-0003: CloudWatch Alarms for MCP Security Monitoring#
Status: Accepted Date: 2026-04-27
Context#
MCP server logs are shipped to CloudWatch via the awslogs driver but no alarms are configured. Security-relevant events — repeated OBO authentication failures, unhandled errors, and abnormal HTTP error rates — go undetected. This was identified as gap H3 in the OWASP Secure MCP Guide compliance analysis (research/secure_mcp_guide_compliance.md, §8.3).
Four MCP servers are currently deployed (SharePoint, Outlook, Databricks, Teams), each writing to a CloudWatch log group following the pattern /ecs/aiconnectors-{env}-aiconnectors/{appname}-main-svc/{appname}. All share the same infrastructure module (infra/modules/mcp_server/) and are managed by a common Terragrunt pattern.
The key signals available in existing logs are:
- OBO auth failures:
logger.error("OBO token exchange failed: ...")inauth.py - Unhandled errors: Python
ERROR-level log output - HTTP errors: uvicorn access log lines containing the HTTP status code (e.g.
"POST /mcp HTTP/1.1" 422)
Decision#
Add CloudWatch Metric Filters and Alarms to the shared mcp_server Terraform module, and create one SNS topic per environment in infra/main/shared/ to route alarm notifications to the owner group by email.
Alarm definitions (per MCP):
| Alarm | Log Filter Pattern | Threshold | Period |
|---|---|---|---|
| Auth failures | "OBO token exchange failed" |
≥ 5 | 5 min |
| Error rate | "ERROR" |
≥ 20 | 5 min |
| 4xx spike | 4?? |
≥ 50 | 5 min |
| 5xx spike | 5?? |
≥ 10 | 5 min |
All alarms use namespace MCP/Security with dimension MCP = {appname} and treat_missing_data = "notBreaching".
SNS: One topic per environment (aiconnectors-alarms-{env}). The subscriber list is a Terraform variable (alarm_email_subscribers, type list(string)) defined in infra/main/shared/ and set per environment in the corresponding _envs/*.hcl file. This keeps the list out of module source and allows it to differ between environments without touching shared code.
Wiring: Each MCP's terragrunt.hcl receives sns_topic_arn from the existing dependency "shared" block — no new dependency is introduced.
Alarms fire in both dev and prod. Thresholds are initial values to be tuned after observing baseline traffic.
Consequences#
Positive#
- Security-relevant events now produce actionable alerts rather than silently accumulating in logs
- Placing the resources in
mcp_servermeans all existing and future MCPs inherit alarms automatically — consistent with how SSM and DynamoDB IAM already work in that module - No application code changes required; all signal comes from existing log output
- Closes OWASP Secure MCP Guide gap H3 (§8.3 continuous monitoring)
Negative#
- SNS email subscriptions require manual confirmation after first
terragrunt applyonmain/shared/; alarms will not deliver until all four recipients confirm - Alarm thresholds are estimates — they will need tuning after observing real traffic baselines to avoid false positives
Risks#
- Pattern false positives: The
"ERROR"filter matches any log line containing the word "ERROR", not just PythonERROR-level entries. If anyINFO-level log message contains the word "ERROR" in its text, it will be counted. Mitigation: monitor false-positive rate in dev after deployment and tighten the pattern (e.g."ERROR:") if needed. The4??/5??patterns match any standalone 3-character word starting with4or5; in uvicorn's access log format ("POST /mcp HTTP/1.1" 404 Not Found) the HTTP status code is the only such word, as port numbers are attached to the IP with:and are not standalone words. - Log group existence: Metric filters reference a log group name derived from
var.appnameandvar.environment. If the log group does not yet exist when the filter is created, Terraform will fail. The log group is created byecs_app_single_stagein the same module apply, so ordering is guaranteed within a singleterragrunt apply.
Alternatives Considered#
Option B: New mcp_alarms local module#
Create infra/modules/mcp_alarms/ and instantiate it separately in each MCP's terragrunt.hcl.
Rejected because: Introduces 4× duplication in terragrunt files and does not auto-apply to future MCPs. The mcp_server module already owns per-MCP IAM and DynamoDB resources; alarm resources belong there for the same reason.
Option C: Alarms in infra/main/shared/#
Add all alarms to the shared module with a hardcoded list of MCP names.
Rejected because: Requires updating main/shared/ every time a new MCP is added, coupling the shared layer to per-MCP concerns. Violates the existing pattern where each MCP owns its resources.
ALB-native metrics for HTTP 4xx/5xx#
Use AWS/ApplicationELB metrics (HTTPCode_Target_4XX_Count) instead of log-based filters for HTTP error rate alarms.
Not adopted (for now): ALB metrics are per-target-group, requiring the target group ARN to be plumbed through the module. Log-based filters on uvicorn's access log output achieve the same signal with less wiring. If per-MCP HTTP metrics need to be more precise, ALB-based alarms can be added in a follow-up.
Related#
- GitHub issue #68 — original ticket
research/secure_mcp_guide_compliance.md— H3 gap descriptioninfra/modules/mcp_server/— module being extendedconnectors/mcps/*/src/*/auth.py— source of the"OBO token exchange failed"log message