Backup and Disaster Recovery#
Relevant controls: SC.10.01.01, SC.10.01.02, SC.10.02, SC.10.03.01, SC.10.03.02
1. Recovery Objectives#
The AI connectors platform is designed as a stateless proxy layer — it does not store business data. The only persistent state is the OAuth token cache (DynamoDB), audit logs (S3), and infrastructure secrets (SSM). This architecture simplifies recovery because the application itself can be redeployed from source at any time.
| Component | RPO | RTO | Justification |
|---|---|---|---|
| DynamoDB token cache | N/A (ephemeral) | N/A | Token cache is ephemeral — loss only means users re-authenticate on next request. No business data is lost. |
| SSM secrets | 0 (no loss acceptable) | 30 minutes | Secrets can be re-generated via terragrunt apply in infra/initial/mcps/{name}/ — Terraform generates a new client secret and writes it to SSM. |
| ECS service (application) | N/A (stateless) | 15 minutes | Stateless containers — redeploy from ECR image. Automated via CI/CD or manual ecs/render.py + aws ecs update-service. |
| ECR images | N/A (rebuildable) | 30 minutes | Images can be rebuilt from source (docker build from any commit). Dev ECR image is always available as source of truth. |
| Audit logs (S3) | 0 (no loss acceptable) | N/A (read-only archive) | S3 versioning protects against accidental overwrite/deletion. Bucket policy denies DeleteObject. |
| Terraform state (S3) | 0 (no loss acceptable) | 1 hour | S3 versioning enabled. State can be reconstructed via terraform import if catastrophically lost. |
2. What Needs Backing Up#
| Component | Backup required? | Method | Frequency | Retention |
|---|---|---|---|---|
| DynamoDB token cache | No | Point-in-time recovery (PITR) enabled as a safety net, but cache loss is non-critical | Continuous (AWS-managed, 35-day window) | 35 days |
| SSM parameters | No (recoverable) | Secrets can be re-generated via terragrunt apply — Terraform generates a new client secret against the Azure AD app registration |
N/A — recovery is via Terraform apply | N/A |
| Terraform state (S3) | Yes | S3 versioning + encryption | Every terragrunt apply creates a new version |
All versions retained (S3 versioning, no lifecycle expiry on state bucket) |
| Audit logs (S3) | Yes | S3 versioning + lifecycle (Glacier after 90 days, expire after 365 days) | Continuous (Firehose delivers every 60s or 5 MB) | 365 days (then expired) |
| Source code | Yes | Git (GitHub) | Every commit | Indefinite (Git history) |
| ECR images | No (rebuildable) | Can be rebuilt from any Git commit | N/A | ECR lifecycle policy retains last 10 tagged images |
Why Most Components Don't Need Traditional Backup#
- ECS services: Stateless containers defined entirely by code (Dockerfile + ECS task definition YAML). Fully reproducible from Git.
- DynamoDB token cache: Contains short-lived OAuth tokens with TTL-based expiry. PITR is enabled as defence in depth, but losing the cache simply forces users to re-authenticate — no data loss.
- SSM secrets: Can be re-created by running
terragrunt applyininfra/initial/mcps/{name}/— Terraform generates a new client secret against the existing Azure AD app registration and writes it to SSM. - Infrastructure: Fully described in Terraform/Terragrunt. A complete environment rebuild is possible from the
infra/directory.
3. Backup Security#
Terraform State Bucket#
| Property | Value |
|---|---|
| Bucket name | aiconnectors-terraform-state-{env} |
| Encryption | SSE-S3 (AES-256), encrypt = true in backend config |
| Versioning | Enabled |
| Access | AWS profile ai-connectors-{env} + GitHub Actions OIDC role (Terraform apply only) |
| Public access | Blocked (S3 Block Public Access) |
| State locking | DynamoDB table aiconnectors-terraform-lock-{env} |
Audit Log Bucket#
See logging-monitoring.md §6 for full tamper-protection controls (delete protection, versioning, public access block). Backup-relevant properties:
| Property | Value |
|---|---|
| Bucket name | nn-aiconnectors-audit-{env} |
| Versioning | Enabled — overwrites create new versions |
| Lifecycle | Transition to Glacier after 90 days; expire after 365 days |
DynamoDB PITR#
| Property | Value |
|---|---|
| Table pattern | mcp-oauth-storage-{appname}-{env} |
| PITR | Enabled (continuous backup, 35-day recovery window) |
| Encryption | AWS-managed KMS key (default DynamoDB encryption) |
| Access | ECS task role only (per-MCP, no cross-MCP access) |
Cross-Region Replication#
Cross-region replication is not currently configured. All resources reside in eu-central-1. This is acceptable given:
- The platform is stateless (rebuild from source is faster than cross-region failover)
- Audit logs are the only truly irreplaceable data, and S3 provides 99.999999999% durability within a single region
- If cross-region DR becomes a requirement, S3 Cross-Region Replication can be added to the audit bucket
4. IT Recovery Plan#
4.1 Full ECS Service Redeploy#
Scenario: ECS service deleted or unhealthy, but infrastructure (ALB, cluster, ECR) remains intact.
- Verify the latest Docker image exists in ECR:
aws ecr describe-images --repository-name mcp-{name} --region eu-central-1 - Render the task definition:
cd connectors/mcps/{name}/ecs && python render.py {env} - Register the new task definition:
aws ecs register-task-definition --cli-input-json file://task-definition.json - Update the ECS service to use the new task definition:
aws ecs update-service --cluster aiconnectors-{env}-aiconnectors --service mcp-{name}-main-svc --task-definition mcp-{name} --force-new-deployment - Wait for healthy targets on the ALB:
aws ecs wait services-stable --cluster aiconnectors-{env}-aiconnectors --services mcp-{name}-main-svc - Verify health check:
curl -s https://{name}.{domain}/health
Alternatively, re-run the GitHub Actions deploy workflow (deploy-{name}-mcp.yml) which performs all of the above automatically.
Time estimate: 5–15 minutes (automated), 15–30 minutes (manual).
4.2 SSM Secret Recovery#
Scenario: Azure AD client secret deleted from SSM or compromised.
See secrets-management.md §3 for the full rotation and emergency rotation procedures. Summary:
- Run
terragrunt applyininfra/initial/mcps/{name}/— Terraform generates a new client secret and writes it to SSM. - Force a new ECS deployment:
aws ecs update-service --cluster aiconnectors-{env}-aiconnectors --service mcp-{name}-main-svc --force-new-deployment - Verify health check passes and a test tool call succeeds.
Time estimate: 15–30 minutes.
4.3 DynamoDB Restore#
Scenario: Token cache table accidentally deleted or corrupted.
Option A — Recreate empty table (preferred, faster):
- Run
TG_ENV={env} terragrunt applyininfra/main/mcps/{name}/— Terraform will recreate the table with the correct schema and PITR enabled. - Force a new ECS deployment:
aws ecs update-service --cluster aiconnectors-{env}-aiconnectors --service mcp-{name}-main-svc --force-new-deployment - Users will re-authenticate on their next request. No data loss (tokens are ephemeral).
Option B — Restore from PITR (if specific data must be recovered):
- In the AWS Console or CLI, initiate a restore:
aws dynamodb restore-table-to-point-in-time --source-table-name mcp-oauth-storage-{appname}-{env} --target-table-name mcp-oauth-storage-{appname}-{env}-restored --restore-date-time {ISO-8601} - Verify the restored table contents.
- Rename or re-point the application to the restored table (update
STORAGE_DYNAMODB_TABLE_NAMEin the ECS environment). - Force new ECS deployment.
Time estimate: Option A: 5–10 minutes. Option B: 15–30 minutes.
4.4 Full Environment Rebuild#
Scenario: Catastrophic loss of an entire AWS environment (all ECS services, ALB, VPC, etc.). Terraform state is intact in S3.
Recovery is split into two phases. CI/CD is the primary path for both; manual steps are fallback only if GitHub Actions itself is unavailable.
Phase 1 — Rebuild infrastructure#
Primary (CI/CD): Push to main (or re-run the latest run) to trigger infra-apply.yml. This applies main/shared/ and all per-MCP modules in sequence, with plan preview and a full audit trail in GitHub Actions.
Fallback (local): If GitHub Actions is unavailable:
cd infra/main/shared && TG_ENV={env} terragrunt apply
cd infra/main/mcps/sharepoint && TG_ENV={env} terragrunt apply
cd infra/main/mcps/outlook-mcp && TG_ENV={env} terragrunt apply
cd infra/main/mcps/teams && TG_ENV={env} terragrunt apply
cd infra/main/mcps/databricks && TG_ENV={env} terragrunt apply
Note: ACM certificate DNS validation is automatic but takes 5–30 minutes. Wait for the certificate to reach
ISSUEDstatus before proceeding.
Phase 2 — Rebuild images and deploy services#
Primary (CI/CD): Trigger the deploy workflow for each MCP (deploy-{name}-mcp.yml). Each workflow builds the Docker image from source, pushes to ECR, and deploys the ECS service in a single run. This is always preferable to manual steps — it uses OIDC auth, applies the correct task definition, and leaves a full audit trail.
Fallback (local): If GitHub Actions is unavailable:
cd connectors/mcps/{name}
docker build -t mcp-{name}:latest .
aws ecr get-login-password --region eu-central-1 | docker login --username AWS --password-stdin {account}.dkr.ecr.eu-central-1.amazonaws.com
docker tag mcp-{name}:latest {account}.dkr.ecr.eu-central-1.amazonaws.com/mcp-{name}:latest
docker push {account}.dkr.ecr.eu-central-1.amazonaws.com/mcp-{name}:latest
python ecs/render.py {env}
aws ecs register-task-definition --cli-input-json file://ecs/task-definition.json
aws ecs update-service --cluster aiconnectors-{env}-aiconnectors --service mcp-{name}-main-svc --task-definition mcp-{name} --force-new-deployment
Verification#
- Confirm DNS:
{name}.{env-prefix}connectors.novo-genai.comresolves to the new ALB. - Smoke test each MCP:
curl -s https://{name}.{domain}/health - Make a test tool call and confirm audit records appear in
nn-aiconnectors-audit-{env}within 60 seconds.
Time estimate: 1–2 hours (assuming Terraform state is intact).
4.5 Terraform State Loss#
Scenario: State bucket accidentally deleted or corrupted, S3 versioning insufficient.
- Create a new state bucket with the expected name:
aiconnectors-terraform-state-{env} - Enable versioning and encryption.
- For each Terragrunt module, run
terragrunt importto re-import existing AWS resources into fresh state. - Alternatively, if resources are fully destroyed, follow section 4.4 (full rebuild) — Terraform will create everything from scratch.
Time estimate: 4–8 hours (manual import of all resources).
5. Recovery Testing#
Recovery testing validates that documented procedures work in practice. Tests should be performed in the dev environment to avoid production impact.
| Test type | Frequency | Last tested | Outcome | Evidence location |
|---|---|---|---|---|
| ECS redeploy drill | Quarterly | — | — | GitHub Actions workflow run logs |
| DynamoDB recovery test (table recreate) | Annually | — | — | Terraform apply logs + ECS deploy logs |
| Full environment rebuild (dev) | Annually | — | — | Terraform plan/apply logs + smoke test output |
| Secret rotation drill | Annually | — | — | Terraform apply logs + ECS deploy logs |
Test Procedures#
ECS Redeploy Drill (Quarterly):
1. Force-stop all running tasks for one MCP in dev: aws ecs update-service --desired-count 0
2. Wait for service to reach 0 running tasks.
3. Restore: aws ecs update-service --desired-count 1
4. Verify health check passes within 5 minutes.
5. Record time-to-recovery.
DynamoDB Recovery Test (Annually):
1. Delete the dev token cache table for one MCP.
2. Run TG_ENV=dev terragrunt apply in infra/main/mcps/{name}/ — Terraform recreates the table with the correct schema and PITR enabled.
3. Force a new ECS deployment: aws ecs update-service --cluster aiconnectors-dev-aiconnectors --service mcp-{name}-main-svc --force-new-deployment
4. Verify the health check passes and a test tool call succeeds (users will re-authenticate on next request — expected behaviour).
5. Record time-to-recovery.
Full Environment Rebuild (Annually):
1. In a separate Terraform workspace or using a temporary TG_ENV=dr-test, run full terragrunt apply for main/shared/ and one MCP.
2. Push a Docker image and verify the service comes up healthy.
3. Destroy the test environment after validation.
Secret Rotation Drill (Annually):
1. Run terragrunt apply in infra/initial/mcps/{name}/ for a dev MCP — Terraform generates a new client secret and updates SSM.
2. Force a new ECS deployment to pick up the new secret.
3. Verify tool calls succeed with the new secret.
4. Confirm the old secret has expired or remove it from Azure AD.
Evidence Retention#
All recovery test evidence (workflow logs, Terraform output, health check screenshots) must be retained for a minimum of 3 years in accordance with Novo Nordisk quality records policy. Evidence is stored in this repository under docs/compliance/dr-test-evidence/, committed after each test run.