Skip to content

Backup and Disaster Recovery#

Relevant controls: SC.10.01.01, SC.10.01.02, SC.10.02, SC.10.03.01, SC.10.03.02


1. Recovery Objectives#

The AI connectors platform is designed as a stateless proxy layer — it does not store business data. The only persistent state is the OAuth token cache (DynamoDB), audit logs (S3), and infrastructure secrets (SSM). This architecture simplifies recovery because the application itself can be redeployed from source at any time.

Component RPO RTO Justification
DynamoDB token cache N/A (ephemeral) N/A Token cache is ephemeral — loss only means users re-authenticate on next request. No business data is lost.
SSM secrets 0 (no loss acceptable) 30 minutes Secrets can be re-generated via terragrunt apply in infra/initial/mcps/{name}/ — Terraform generates a new client secret and writes it to SSM.
ECS service (application) N/A (stateless) 15 minutes Stateless containers — redeploy from ECR image. Automated via CI/CD or manual ecs/render.py + aws ecs update-service.
ECR images N/A (rebuildable) 30 minutes Images can be rebuilt from source (docker build from any commit). Dev ECR image is always available as source of truth.
Audit logs (S3) 0 (no loss acceptable) N/A (read-only archive) S3 versioning protects against accidental overwrite/deletion. Bucket policy denies DeleteObject.
Terraform state (S3) 0 (no loss acceptable) 1 hour S3 versioning enabled. State can be reconstructed via terraform import if catastrophically lost.

2. What Needs Backing Up#

Component Backup required? Method Frequency Retention
DynamoDB token cache No Point-in-time recovery (PITR) enabled as a safety net, but cache loss is non-critical Continuous (AWS-managed, 35-day window) 35 days
SSM parameters No (recoverable) Secrets can be re-generated via terragrunt apply — Terraform generates a new client secret against the Azure AD app registration N/A — recovery is via Terraform apply N/A
Terraform state (S3) Yes S3 versioning + encryption Every terragrunt apply creates a new version All versions retained (S3 versioning, no lifecycle expiry on state bucket)
Audit logs (S3) Yes S3 versioning + lifecycle (Glacier after 90 days, expire after 365 days) Continuous (Firehose delivers every 60s or 5 MB) 365 days (then expired)
Source code Yes Git (GitHub) Every commit Indefinite (Git history)
ECR images No (rebuildable) Can be rebuilt from any Git commit N/A ECR lifecycle policy retains last 10 tagged images

Why Most Components Don't Need Traditional Backup#

  • ECS services: Stateless containers defined entirely by code (Dockerfile + ECS task definition YAML). Fully reproducible from Git.
  • DynamoDB token cache: Contains short-lived OAuth tokens with TTL-based expiry. PITR is enabled as defence in depth, but losing the cache simply forces users to re-authenticate — no data loss.
  • SSM secrets: Can be re-created by running terragrunt apply in infra/initial/mcps/{name}/ — Terraform generates a new client secret against the existing Azure AD app registration and writes it to SSM.
  • Infrastructure: Fully described in Terraform/Terragrunt. A complete environment rebuild is possible from the infra/ directory.

3. Backup Security#

Terraform State Bucket#

Property Value
Bucket name aiconnectors-terraform-state-{env}
Encryption SSE-S3 (AES-256), encrypt = true in backend config
Versioning Enabled
Access AWS profile ai-connectors-{env} + GitHub Actions OIDC role (Terraform apply only)
Public access Blocked (S3 Block Public Access)
State locking DynamoDB table aiconnectors-terraform-lock-{env}

Audit Log Bucket#

See logging-monitoring.md §6 for full tamper-protection controls (delete protection, versioning, public access block). Backup-relevant properties:

Property Value
Bucket name nn-aiconnectors-audit-{env}
Versioning Enabled — overwrites create new versions
Lifecycle Transition to Glacier after 90 days; expire after 365 days

DynamoDB PITR#

Property Value
Table pattern mcp-oauth-storage-{appname}-{env}
PITR Enabled (continuous backup, 35-day recovery window)
Encryption AWS-managed KMS key (default DynamoDB encryption)
Access ECS task role only (per-MCP, no cross-MCP access)

Cross-Region Replication#

Cross-region replication is not currently configured. All resources reside in eu-central-1. This is acceptable given:

  • The platform is stateless (rebuild from source is faster than cross-region failover)
  • Audit logs are the only truly irreplaceable data, and S3 provides 99.999999999% durability within a single region
  • If cross-region DR becomes a requirement, S3 Cross-Region Replication can be added to the audit bucket

4. IT Recovery Plan#

4.1 Full ECS Service Redeploy#

Scenario: ECS service deleted or unhealthy, but infrastructure (ALB, cluster, ECR) remains intact.

  1. Verify the latest Docker image exists in ECR: aws ecr describe-images --repository-name mcp-{name} --region eu-central-1
  2. Render the task definition: cd connectors/mcps/{name}/ecs && python render.py {env}
  3. Register the new task definition: aws ecs register-task-definition --cli-input-json file://task-definition.json
  4. Update the ECS service to use the new task definition: aws ecs update-service --cluster aiconnectors-{env}-aiconnectors --service mcp-{name}-main-svc --task-definition mcp-{name} --force-new-deployment
  5. Wait for healthy targets on the ALB: aws ecs wait services-stable --cluster aiconnectors-{env}-aiconnectors --services mcp-{name}-main-svc
  6. Verify health check: curl -s https://{name}.{domain}/health

Alternatively, re-run the GitHub Actions deploy workflow (deploy-{name}-mcp.yml) which performs all of the above automatically.

Time estimate: 5–15 minutes (automated), 15–30 minutes (manual).

4.2 SSM Secret Recovery#

Scenario: Azure AD client secret deleted from SSM or compromised.

See secrets-management.md §3 for the full rotation and emergency rotation procedures. Summary:

  1. Run terragrunt apply in infra/initial/mcps/{name}/ — Terraform generates a new client secret and writes it to SSM.
  2. Force a new ECS deployment: aws ecs update-service --cluster aiconnectors-{env}-aiconnectors --service mcp-{name}-main-svc --force-new-deployment
  3. Verify health check passes and a test tool call succeeds.

Time estimate: 15–30 minutes.

4.3 DynamoDB Restore#

Scenario: Token cache table accidentally deleted or corrupted.

Option A — Recreate empty table (preferred, faster):

  1. Run TG_ENV={env} terragrunt apply in infra/main/mcps/{name}/ — Terraform will recreate the table with the correct schema and PITR enabled.
  2. Force a new ECS deployment: aws ecs update-service --cluster aiconnectors-{env}-aiconnectors --service mcp-{name}-main-svc --force-new-deployment
  3. Users will re-authenticate on their next request. No data loss (tokens are ephemeral).

Option B — Restore from PITR (if specific data must be recovered):

  1. In the AWS Console or CLI, initiate a restore: aws dynamodb restore-table-to-point-in-time --source-table-name mcp-oauth-storage-{appname}-{env} --target-table-name mcp-oauth-storage-{appname}-{env}-restored --restore-date-time {ISO-8601}
  2. Verify the restored table contents.
  3. Rename or re-point the application to the restored table (update STORAGE_DYNAMODB_TABLE_NAME in the ECS environment).
  4. Force new ECS deployment.

Time estimate: Option A: 5–10 minutes. Option B: 15–30 minutes.

4.4 Full Environment Rebuild#

Scenario: Catastrophic loss of an entire AWS environment (all ECS services, ALB, VPC, etc.). Terraform state is intact in S3.

Recovery is split into two phases. CI/CD is the primary path for both; manual steps are fallback only if GitHub Actions itself is unavailable.

Phase 1 — Rebuild infrastructure#

Primary (CI/CD): Push to main (or re-run the latest run) to trigger infra-apply.yml. This applies main/shared/ and all per-MCP modules in sequence, with plan preview and a full audit trail in GitHub Actions.

Fallback (local): If GitHub Actions is unavailable:

cd infra/main/shared && TG_ENV={env} terragrunt apply
cd infra/main/mcps/sharepoint && TG_ENV={env} terragrunt apply
cd infra/main/mcps/outlook-mcp && TG_ENV={env} terragrunt apply
cd infra/main/mcps/teams && TG_ENV={env} terragrunt apply
cd infra/main/mcps/databricks && TG_ENV={env} terragrunt apply

Note: ACM certificate DNS validation is automatic but takes 5–30 minutes. Wait for the certificate to reach ISSUED status before proceeding.

Phase 2 — Rebuild images and deploy services#

Primary (CI/CD): Trigger the deploy workflow for each MCP (deploy-{name}-mcp.yml). Each workflow builds the Docker image from source, pushes to ECR, and deploys the ECS service in a single run. This is always preferable to manual steps — it uses OIDC auth, applies the correct task definition, and leaves a full audit trail.

Fallback (local): If GitHub Actions is unavailable:

cd connectors/mcps/{name}
docker build -t mcp-{name}:latest .
aws ecr get-login-password --region eu-central-1 | docker login --username AWS --password-stdin {account}.dkr.ecr.eu-central-1.amazonaws.com
docker tag mcp-{name}:latest {account}.dkr.ecr.eu-central-1.amazonaws.com/mcp-{name}:latest
docker push {account}.dkr.ecr.eu-central-1.amazonaws.com/mcp-{name}:latest
python ecs/render.py {env}
aws ecs register-task-definition --cli-input-json file://ecs/task-definition.json
aws ecs update-service --cluster aiconnectors-{env}-aiconnectors --service mcp-{name}-main-svc --task-definition mcp-{name} --force-new-deployment

Verification#

  1. Confirm DNS: {name}.{env-prefix}connectors.novo-genai.com resolves to the new ALB.
  2. Smoke test each MCP: curl -s https://{name}.{domain}/health
  3. Make a test tool call and confirm audit records appear in nn-aiconnectors-audit-{env} within 60 seconds.

Time estimate: 1–2 hours (assuming Terraform state is intact).

4.5 Terraform State Loss#

Scenario: State bucket accidentally deleted or corrupted, S3 versioning insufficient.

  1. Create a new state bucket with the expected name: aiconnectors-terraform-state-{env}
  2. Enable versioning and encryption.
  3. For each Terragrunt module, run terragrunt import to re-import existing AWS resources into fresh state.
  4. Alternatively, if resources are fully destroyed, follow section 4.4 (full rebuild) — Terraform will create everything from scratch.

Time estimate: 4–8 hours (manual import of all resources).


5. Recovery Testing#

Recovery testing validates that documented procedures work in practice. Tests should be performed in the dev environment to avoid production impact.

Test type Frequency Last tested Outcome Evidence location
ECS redeploy drill Quarterly GitHub Actions workflow run logs
DynamoDB recovery test (table recreate) Annually Terraform apply logs + ECS deploy logs
Full environment rebuild (dev) Annually Terraform plan/apply logs + smoke test output
Secret rotation drill Annually Terraform apply logs + ECS deploy logs

Test Procedures#

ECS Redeploy Drill (Quarterly): 1. Force-stop all running tasks for one MCP in dev: aws ecs update-service --desired-count 0 2. Wait for service to reach 0 running tasks. 3. Restore: aws ecs update-service --desired-count 1 4. Verify health check passes within 5 minutes. 5. Record time-to-recovery.

DynamoDB Recovery Test (Annually): 1. Delete the dev token cache table for one MCP. 2. Run TG_ENV=dev terragrunt apply in infra/main/mcps/{name}/ — Terraform recreates the table with the correct schema and PITR enabled. 3. Force a new ECS deployment: aws ecs update-service --cluster aiconnectors-dev-aiconnectors --service mcp-{name}-main-svc --force-new-deployment 4. Verify the health check passes and a test tool call succeeds (users will re-authenticate on next request — expected behaviour). 5. Record time-to-recovery.

Full Environment Rebuild (Annually): 1. In a separate Terraform workspace or using a temporary TG_ENV=dr-test, run full terragrunt apply for main/shared/ and one MCP. 2. Push a Docker image and verify the service comes up healthy. 3. Destroy the test environment after validation.

Secret Rotation Drill (Annually): 1. Run terragrunt apply in infra/initial/mcps/{name}/ for a dev MCP — Terraform generates a new client secret and updates SSM. 2. Force a new ECS deployment to pick up the new secret. 3. Verify tool calls succeed with the new secret. 4. Confirm the old secret has expired or remove it from Azure AD.

Evidence Retention#

All recovery test evidence (workflow logs, Terraform output, health check screenshots) must be retained for a minimum of 3 years in accordance with Novo Nordisk quality records policy. Evidence is stored in this repository under docs/compliance/dr-test-evidence/, committed after each test run.