Skip to content

ADR-0003: CloudWatch Alarms for MCP Security Monitoring#

Status: Accepted Date: 2026-04-27

Context#

MCP server logs are shipped to CloudWatch via the awslogs driver but no alarms are configured. Security-relevant events — repeated OBO authentication failures, unhandled errors, and abnormal HTTP error rates — go undetected. This was identified as gap H3 in the OWASP Secure MCP Guide compliance analysis (research/secure_mcp_guide_compliance.md, §8.3).

Four MCP servers are currently deployed (SharePoint, Outlook, Databricks, Teams), each writing to a CloudWatch log group following the pattern /ecs/aiconnectors-{env}-aiconnectors/{appname}-main-svc/{appname}. All share the same infrastructure module (infra/modules/mcp_server/) and are managed by a common Terragrunt pattern.

The key signals available in existing logs are:

  • OBO auth failures: logger.error("OBO token exchange failed: ...") in auth.py
  • Unhandled errors: Python ERROR-level log output
  • HTTP errors: uvicorn access log lines containing the HTTP status code (e.g. "POST /mcp HTTP/1.1" 422)

Decision#

Add CloudWatch Metric Filters and Alarms to the shared mcp_server Terraform module, and create one SNS topic per environment in infra/main/shared/ to route alarm notifications to the owner group by email.

Alarm definitions (per MCP):

Alarm Log Filter Pattern Threshold Period
Auth failures "OBO token exchange failed" ≥ 5 5 min
Error rate "ERROR" ≥ 20 5 min
4xx spike 4?? ≥ 50 5 min
5xx spike 5?? ≥ 10 5 min

All alarms use namespace MCP/Security with dimension MCP = {appname} and treat_missing_data = "notBreaching".

SNS: One topic per environment (aiconnectors-alarms-{env}). The subscriber list is a Terraform variable (alarm_email_subscribers, type list(string)) defined in infra/main/shared/ and set per environment in the corresponding _envs/*.hcl file. This keeps the list out of module source and allows it to differ between environments without touching shared code.

Wiring: Each MCP's terragrunt.hcl receives sns_topic_arn from the existing dependency "shared" block — no new dependency is introduced.

Alarms fire in both dev and prod. Thresholds are initial values to be tuned after observing baseline traffic.

Consequences#

Positive#

  • Security-relevant events now produce actionable alerts rather than silently accumulating in logs
  • Placing the resources in mcp_server means all existing and future MCPs inherit alarms automatically — consistent with how SSM and DynamoDB IAM already work in that module
  • No application code changes required; all signal comes from existing log output
  • Closes OWASP Secure MCP Guide gap H3 (§8.3 continuous monitoring)

Negative#

  • SNS email subscriptions require manual confirmation after first terragrunt apply on main/shared/; alarms will not deliver until all four recipients confirm
  • Alarm thresholds are estimates — they will need tuning after observing real traffic baselines to avoid false positives

Risks#

  • Pattern false positives: The "ERROR" filter matches any log line containing the word "ERROR", not just Python ERROR-level entries. If any INFO-level log message contains the word "ERROR" in its text, it will be counted. Mitigation: monitor false-positive rate in dev after deployment and tighten the pattern (e.g. "ERROR:") if needed. The 4?? / 5?? patterns match any standalone 3-character word starting with 4 or 5; in uvicorn's access log format ("POST /mcp HTTP/1.1" 404 Not Found) the HTTP status code is the only such word, as port numbers are attached to the IP with : and are not standalone words.
  • Log group existence: Metric filters reference a log group name derived from var.appname and var.environment. If the log group does not yet exist when the filter is created, Terraform will fail. The log group is created by ecs_app_single_stage in the same module apply, so ordering is guaranteed within a single terragrunt apply.

Alternatives Considered#

Option B: New mcp_alarms local module#

Create infra/modules/mcp_alarms/ and instantiate it separately in each MCP's terragrunt.hcl.

Rejected because: Introduces 4× duplication in terragrunt files and does not auto-apply to future MCPs. The mcp_server module already owns per-MCP IAM and DynamoDB resources; alarm resources belong there for the same reason.

Option C: Alarms in infra/main/shared/#

Add all alarms to the shared module with a hardcoded list of MCP names.

Rejected because: Requires updating main/shared/ every time a new MCP is added, coupling the shared layer to per-MCP concerns. Violates the existing pattern where each MCP owns its resources.

ALB-native metrics for HTTP 4xx/5xx#

Use AWS/ApplicationELB metrics (HTTPCode_Target_4XX_Count) instead of log-based filters for HTTP error rate alarms.

Not adopted (for now): ALB metrics are per-target-group, requiring the target group ARN to be plumbed through the module. Log-based filters on uvicorn's access log output achieve the same signal with less wiring. If per-MCP HTTP metrics need to be more precise, ALB-based alarms can be added in a follow-up.

  • GitHub issue #68 — original ticket
  • research/secure_mcp_guide_compliance.md — H3 gap description
  • infra/modules/mcp_server/ — module being extended
  • connectors/mcps/*/src/*/auth.py — source of the "OBO token exchange failed" log message