ADR-0002: Structured Audit Logging via Kinesis Firehose#
Status: Accepted Date: 2026-04-27
Context#
Internal compliance requires a structured audit trail for every MCP tool call — recording who invoked which tool, with what parameters, and what the outcome was. The existing infrastructure provides no such trail:
- Operational logs (stdout → CloudWatch via ECS
awslogsdriver) are not suitable for audit. TheLOG_LEVELenv var is typicallyWARNINGin production, so any audit event emitted through the Python logging hierarchy atINFOwould be silently dropped. Audit delivery must be guaranteed regardless of operational log level. - Azure AD sign-in logs and M365 Unified Audit Log capture authentication events but not MCP-level tool invocations, parameters, or outcomes.
- ALB access logs record HTTP requests but all MCP traffic is
POST /mcp— individual tool calls are indistinguishable at the load balancer layer.
The requirement is: user identity (OID + UPN), tool name, key parameters, outcome (success / redacted / error), duration, and timestamp — emitted on every tool call, always, with a minimum 1-year retention, and with zero risk of blocking a tool call if delivery fails.
Decision#
Audit events are written directly to AWS Kinesis Firehose from inside the application, completely bypassing the Python logging hierarchy and CloudWatch. One shared delivery stream per environment (aiconnectors-audit-dev, aiconnectors-audit-prod) buffers records and delivers them as newline-delimited JSON to S3, partitioned by date. A Glue catalog table over the S3 prefix enables ad-hoc Athena queries.
The application layer is implemented in nn_mcp_core.audit (connectors/libs/nn-mcp-core/src/nn_mcp_core/audit.py). Every MCP server calls configure_audit() at startup and wraps each @mcp.tool() function with the @audit_tool() decorator, which handles the try/finally timing, outcome classification, user claim extraction, and log_tool_call() dispatch.
The ECS task role for every MCP receives firehose:PutRecord on the shared audit stream ARN via infra/modules/mcp_server/iam.tf.
Consequences#
Positive#
- Compliance requirement is satisfied — every tool invocation produces a structured, durable record regardless of operational log level
- Audit delivery is completely isolated from the operational log stream;
LOG_LEVELhas no effect on audit emission - One shared Firehose stream per environment serves all MCPs — no per-MCP audit infrastructure
- S3 lifecycle policy (Glacier at 90 days, expiry at 365 days) enforces retention at the storage layer, not in application code
- Athena + Glue enables ad-hoc SQL queries over the full audit history without a dedicated SIEM
Negative#
- Firehose buffers records for a minimum of 60 seconds before flushing to S3 — audit events are not queryable in real-time
boto3.put_recordis synchronous and runs in a thread pool executor, adding ~5–20 ms network latency to every tool call- A persistent Firehose or IAM outage causes silent audit gaps — delivery failures are swallowed with a warning rather than surfaced to the caller
Risks#
- Silent audit gaps: If
firehose:PutRecordis denied (e.g. due to an IAM policy change) or the Firehose stream is deleted, tool calls succeed silently without audit records. Mitigation: CloudWatch alarm on FirehoseDeliveryToS3.Recordsmetric dropping to zero. - Parameter data quality:
paramsis populated from**kwargsat the call site — tools must not pass large payloads or sensitive content in named parameters. Convention enforced by code review; no runtime sanitisation today.
Alternatives Considered#
CloudWatch Logs → S3 via subscription filter#
Write audit events to a dedicated CloudWatch log group, then forward to S3 via a Firehose subscription filter.
Rejected because: This adds an extra infrastructure hop (log group + subscription filter) with no benefit over writing directly to Firehose. More critically, if audit events are emitted through Python's logging hierarchy, they are subject to LOG_LEVEL suppression in production — violating the core requirement that audit emission is unconditional.
DynamoDB (already provisioned per MCP)#
Append audit records to the existing per-MCP DynamoDB table used for token caching.
Rejected because: DynamoDB is provisioned as a token cache, not an audit store. It has no native S3 export, no built-in lifecycle/retention policy, and write costs at audit volume (one record per tool call per user) would be significantly higher than Firehose → S3. Long-term querying over a year of audit data is not practical in DynamoDB without additional ETL.
SIEM / external logging pipeline (e.g. Splunk, Azure Monitor)#
Route audit events to an existing enterprise logging platform.
Rejected because: MCP servers run in AWS. Routing audit events to an Azure-hosted service adds cross-cloud network dependency and increases coupling to a platform that may not be available at the time a tool call is made. Firehose → S3 + Athena keeps the audit trail entirely within the same AWS account and environment boundary as the services generating it, with no external dependency on availability or access controls.
Related#
research/audit-logging-design.md— full design specificationconnectors/libs/nn-mcp-core/src/nn_mcp_core/audit.py— shared audit moduleinfra/modules/mcp_server/iam.tf:48—firehose:PutRecordIAM permissioninfra/main/shared/— Firehose delivery stream and S3 bucket definitions- ADR-0001 — decision to extract
audit.pyintonn-mcp-coreshared library