ADR-0002: Structured Audit Logging via Kinesis Firehose#

Status: Accepted Date: 2026-04-27

Context#

Internal compliance requires a structured audit trail for every MCP tool call — recording who invoked which tool, with what parameters, and what the outcome was. The existing infrastructure provides no such trail:

Operational logs (stdout → CloudWatch via ECS awslogs driver) are not suitable for audit. The LOG_LEVEL env var is typically WARNING in production, so any audit event emitted through the Python logging hierarchy at INFO would be silently dropped. Audit delivery must be guaranteed regardless of operational log level.
Azure AD sign-in logs and M365 Unified Audit Log capture authentication events but not MCP-level tool invocations, parameters, or outcomes.
ALB access logs record HTTP requests but all MCP traffic is POST /mcp — individual tool calls are indistinguishable at the load balancer layer.

The requirement is: user identity (OID + UPN), tool name, key parameters, outcome (success / redacted / error), duration, and timestamp — emitted on every tool call, always, with a minimum 1-year retention, and with zero risk of blocking a tool call if delivery fails.

Decision#

Audit events are written directly to AWS Kinesis Firehose from inside the application, completely bypassing the Python logging hierarchy and CloudWatch. One shared delivery stream per environment (aiconnectors-audit-dev, aiconnectors-audit-prod) buffers records and delivers them as newline-delimited JSON to S3, partitioned by date. A Glue catalog table over the S3 prefix enables ad-hoc Athena queries.

The application layer is implemented in nn_mcp_core.audit (connectors/libs/nn-mcp-core/src/nn_mcp_core/audit.py). Every MCP server calls configure_audit() at startup and wraps each @mcp.tool() function with the @audit_tool() decorator, which handles the try/finally timing, outcome classification, user claim extraction, and log_tool_call() dispatch.

The ECS task role for every MCP receives firehose:PutRecord on the shared audit stream ARN via infra/modules/mcp_server/iam.tf.

Consequences#

Positive#

Compliance requirement is satisfied — every tool invocation produces a structured, durable record regardless of operational log level
Audit delivery is completely isolated from the operational log stream; LOG_LEVEL has no effect on audit emission
One shared Firehose stream per environment serves all MCPs — no per-MCP audit infrastructure
S3 lifecycle policy (Glacier at 90 days, expiry at 365 days) enforces retention at the storage layer, not in application code
Athena + Glue enables ad-hoc SQL queries over the full audit history without a dedicated SIEM

Negative#

Firehose buffers records for a minimum of 60 seconds before flushing to S3 — audit events are not queryable in real-time
boto3.put_record is synchronous and runs in a thread pool executor, adding ~5–20 ms network latency to every tool call
A persistent Firehose or IAM outage causes silent audit gaps — delivery failures are swallowed with a warning rather than surfaced to the caller

Risks#

Silent audit gaps: If firehose:PutRecord is denied (e.g. due to an IAM policy change) or the Firehose stream is deleted, tool calls succeed silently without audit records. Mitigation: CloudWatch alarm on Firehose DeliveryToS3.Records metric dropping to zero.
Parameter data quality: params is populated from **kwargs at the call site — tools must not pass large payloads or sensitive content in named parameters. Convention enforced by code review; no runtime sanitisation today.

Alternatives Considered#

CloudWatch Logs → S3 via subscription filter#

Write audit events to a dedicated CloudWatch log group, then forward to S3 via a Firehose subscription filter.

Rejected because: This adds an extra infrastructure hop (log group + subscription filter) with no benefit over writing directly to Firehose. More critically, if audit events are emitted through Python's logging hierarchy, they are subject to LOG_LEVEL suppression in production — violating the core requirement that audit emission is unconditional.

DynamoDB (already provisioned per MCP)#

Append audit records to the existing per-MCP DynamoDB table used for token caching.

Rejected because: DynamoDB is provisioned as a token cache, not an audit store. It has no native S3 export, no built-in lifecycle/retention policy, and write costs at audit volume (one record per tool call per user) would be significantly higher than Firehose → S3. Long-term querying over a year of audit data is not practical in DynamoDB without additional ETL.

SIEM / external logging pipeline (e.g. Splunk, Azure Monitor)#

Route audit events to an existing enterprise logging platform.

Rejected because: MCP servers run in AWS. Routing audit events to an Azure-hosted service adds cross-cloud network dependency and increases coupling to a platform that may not be available at the time a tool call is made. Firehose → S3 + Athena keeps the audit trail entirely within the same AWS account and environment boundary as the services generating it, with no external dependency on availability or access controls.

research/audit-logging-design.md — full design specification
connectors/libs/nn-mcp-core/src/nn_mcp_core/audit.py — shared audit module
infra/modules/mcp_server/iam.tf:48 — firehose:PutRecord IAM permission
infra/main/shared/ — Firehose delivery stream and S3 bucket definitions
ADR-0001 — decision to extract audit.py into nn-mcp-core shared library