Design: Databricks Access#

Overview#

Purpose: Expose Databricks workspaces to AI assistants via the MCP protocol, enabling read-only SQL queries, Genie Space natural language queries, and Unity Catalog browsing — all scoped to the authenticated user's existing Databricks permissions.

Scope: - Eleven MCP tools across two categories: discovery tools (list_genie_spaces, list_sql_warehouses, list_catalogs, list_schemas, list_tables, get_table, search_tables) and proxy tools (list_genie_tools, proxy_to_genie, list_sql_tools, proxy_to_sql) - Read-only access to Databricks via the Databricks REST API and managed MCP endpoints (Genie Space MCP, SQL MCP) - OAuth2 On-Behalf-Of (OBO) authentication — see DESIGN:OAuthAuthentication - SCI pre-filtering: discovery tool responses exclude assets tagged as Strictly Confidential Information (SCI) in Unity Catalog before returning results to the AI assistant - Multi-workspace support via a static JSON workspace registry configured at deploy time

Out of Scope: - SQL write operations (INSERT, UPDATE, DELETE, CREATE, DROP, ALTER) — blocked at application layer - Databricks Vector Search MCP and Unity Catalog Functions MCP (not in scope for v1) - Dynamic workspace discovery via Azure ARM or Databricks Accounts API - per-tool argument schema validation against Databricks tool inputSchemas (open gap — see issue #125)

Architecture#

┌────────────────────┐     ┌─────────────────────────────┐     ┌─────────────────────────────┐
│  AI Assistant      │────▶│  Databricks MCP Connector   │────▶│  Databricks REST API         │
│  (Claude / MCP     │     │  Server (FastMCP)            │     │  /api/2.x/...                │
│   client)          │◀────│                             │◀────│  Unity Catalog, Genie, SQL   │
└────────────────────┘     │                             │     └─────────────────────────────┘
                           │  OBO Exchange               │     ┌─────────────────────────────┐
                           │  Azure AD token A           │────▶│  Databricks Managed MCP      │
                           │  → Databricks token B       │     │  /api/2.0/mcp/genie/{id}     │
                           │                             │◀────│  /api/2.0/mcp/sql            │
                           └─────────────────────────────┘     └─────────────────────────────┘
                                        │
                                        ▼
                           ┌─────────────────────────────┐
                           │  Azure AD (Entra ID)         │
                           │  OBO token exchange          │
                           │  → Databricks-scoped token   │
                           └─────────────────────────────┘

Components: - Databricks MCP Connector — FastMCP server running on AWS ECS Fargate; the thin proxy layer where OBO token exchange, SCI pre-filtering, and read-only enforcement are applied before results reach the AI assistant - OBOAuthenticator — exchanges the user's Azure AD token for a Databricks-scoped token (2ff814a6-3304-4ab8-85cb-cd0e6f879c1d/.default) via the standard OBO flow; ensures every downstream call carries the user's own Databricks identity - DatabricksClient — per-user httpx wrapper; holds the OBO-exchanged token and handles 401 retry-on-expiry - mcp_proxy — uses FastMCP Client + StreamableHttpTransport to forward MCP JSON-RPC calls to the Databricks managed MCP endpoints with the Databricks token injected as a header - TTLCache — in-process per-user client cache (keyed by Azure AD oid; 3000s TTL) to avoid repeated OBO exchanges across tool calls

Tech Stack#

Runtime: Python 3.13 Framework: FastMCP ≥ 3.0.0 HTTP client: httpx ≥ 0.27.0 Auth: Azure AD OBO via msal (wrapped in OBOAuthenticator); FastMCP AzureProvider for incoming token validation Token cache: py-key-value-aio[dynamodb] — DynamoDB-backed OAuth token storage in production Client cache: cachetools.TTLCache — in-process per-user Databricks client cache JWT decode: pyjwt — extracts oid claim for cache keying (signature not re-validated here; AzureProvider validates on ingress) Configuration: pydantic-settings — all config from environment variables Infrastructure: AWS ECS Fargate, ALB (TLS termination), ECR, DynamoDB (token store), SSM Parameter Store (secrets)

Tool Surface#

Discovery Tools (implemented directly)#

Tool	Databricks API	Returns
`list_genie_spaces`	`GET /api/2.0/genie/spaces`	space_id, title, warehouse_id, description
`list_sql_warehouses`	`GET /api/2.0/sql/warehouses`	id, name, state
`list_catalogs`	`GET /api/2.1/unity-catalog/catalogs`	name, full_name
`list_schemas`	`GET /api/2.1/unity-catalog/schemas?catalog_name={c}`	name, full_name
`list_tables`	`GET /api/2.1/unity-catalog/tables?catalog_name={c}&schema_name={s}`	name, full_name, table_type
`get_table`	`GET /api/2.1/unity-catalog/tables/{full_name}`	name, full_name, table_type, columns
`search_tables`	`GET /api/2.1/unity-catalog/tables?catalog_name=...&name_pattern=...`	name, full_name, table_type

Proxy Tools (forwarded to Databricks managed MCPs)#

Tool	Proxies to	Access
`list_genie_tools`	`<workspace>/api/2.0/mcp/genie/{genie_space_id}`	Read
`proxy_to_genie`	`<workspace>/api/2.0/mcp/genie/{genie_space_id}`	Read
`list_sql_tools`	`<workspace>/api/2.0/mcp/sql`	Read (write tools filtered out)
`proxy_to_sql`	`<workspace>/api/2.0/mcp/sql`	Read-only (execute_sql blocked)

Read-only enforcement: execute_sql is listed in _BLOCKED_SQL_TOOLS; list_sql_tools filters it from the advertised tool list and proxy_to_sql raises ValueError if it is submitted directly.

Expected AI assistant call chain#

# Genie path
list_genie_spaces → list_genie_tools(genie_space_id) → proxy_to_genie(genie_space_id, tool_name, tool_args)

# SQL path
list_sql_tools → proxy_to_sql(tool_name, tool_args)

# Catalog path
list_catalogs → list_schemas(catalog_name) → list_tables(catalog_name, schema_name) → get_table(full_name)

Authentication#

Authentication follows the shared OBO pattern documented in DESIGN:OAuthAuthentication. The Databricks-specific additions are:

OBO target scope: 2ff814a6-3304-4ab8-85cb-cd0e6f879c1d/.default (Azure Databricks fixed resource ID, workspace-agnostic)
Azure AD app registration requirement: delegated AzureDatabricks / user_impersonation permission; all Azure Databricks workspaces in the tenant accept this token
Workspace auto-detection: on first tool call, the server probes each configured workspace (GET /api/2.0/preview/scim/v2/Me) and caches the first accessible one per user

Token lifecycle: 1. User authenticates with the MCP server via Azure AD OAuth2 (FastMCP AzureProvider) 2. On first tool call, _client() extracts oid from JWT, checks _client_cache 3. Cache miss: probes configured workspaces; exchanges user token for Databricks token via OBO 4. DatabricksClient caches the Databricks token in-instance; re-exchanges on 401 5. _client_cache retains the client for 3000s (TTL ~50 min, within the ~60 min Azure AD token lifetime)

Security#

Authentication: Azure AD OAuth2 AzureProvider on all MCP tool endpoints; /health endpoint unprotected (returns {"status": "healthy"} only)

Authorization: Unity Catalog enforces per-user data permissions on all Databricks API calls; MCP connector cannot grant access beyond what Databricks permits

Read-only enforcement: Application-level blocklist on SQL write tools (defence-in-depth; Unity Catalog read-only grants are the primary control)

SCI pre-filtering: Discovery tools (list_catalogs, list_schemas, list_tables, get_table, search_tables) exclude assets tagged as Strictly Confidential Information in Unity Catalog metadata before returning responses to the AI assistant. This is defence-in-depth on top of Unity Catalog permission enforcement; correct SCI tagging of data assets remains DataCore's responsibility. Genie Space responses are not content-filtered — DataCore is responsible for ensuring Genie Spaces accessible via this MCP do not have access to SCI data sources (see RISK:DatabricksAccess RISK-006).

Credential storage: Azure AD client ID and secret stored in AWS SSM Parameter Store (SecureString); injected as ECS secrets at runtime

Secrets in transit: All external calls use HTTPS (Azure AD OBO endpoint, Databricks REST API, Databricks managed MCP endpoints)

Audit logging: @audit_tool decorator applied to all tools; records tool name, user identity, timestamp, and outcome to Kinesis Firehose → S3

Open gaps (tracked in GitHub issues): - Container runs as root — #122 (E6.2) - No rate limiting on proxy tools — #123 (D3.3) - OBO error body logged verbatim — #124 (E6.6) - Proxy tool_args not schema-validated — #125 (D3.1) - No bandit/pip-audit in CI — #126 (V8.1/V8.2)

Data Model#

No persistent data storage of Databricks query results or catalog metadata. The only persisted state is OAuth2 tokens:

Store	What	Backend	TTL
DynamoDB (`mcp-oauth-storage-mcp-databricks-{env}`)	MCP server OAuth2 session tokens (user ↔ MCP server)	DynamoDB	Azure AD session lifetime
In-process `TTLCache`	Per-user `DatabricksClient` (holds OBO-exchanged Databricks token)	Memory	3000s (~50 min)

Deployment#

Environments:

Env	URL	ECS Cluster	AWS Account
Dev	`https://databricks.dev.connectors.novo-genai.com`	`aiconnectors-dev-aiconnectors`	`094069622854`
Prod	`https://databricks.connectors.novo-genai.com`	`aiconnectors-prod-aiconnectors`	`673034950531`

Infrastructure: Managed by Terragrunt at infra/main/mcps/databricks/terragrunt.hcl using the shared mcp_server module (ECS service, ECR repo, ALB listener rule, Route53 record, DynamoDB table).

CI/CD: deploy-mcp.yml → tmpl-deploy-mcp.yml: 1. lint — ruff + black (blocks build on failure) 2. build — Docker multi-stage build, push to dev ECR 3. apply-infra-dev — Terragrunt apply (dev) 4. deploy-dev — ECS task definition render + service update 5. test-dev — BDD tests against live dev endpoint 6. deploy-prod — promote image to prod ECR + update prod ECS (requires Production environment gate)

Monitoring#

Logs: AWS CloudWatch (/ecs/aiconnectors-{env}-aiconnectors/mcp-databricks-main-svc/mcp-databricks) Audit trail: Kinesis Firehose stream (aiconnectors-audit-{env}) → S3 Health check: GET /health → {"status": "healthy"} (ECS target group health check, 30s interval) Alarms: CloudWatch alarms on ECS CPU, memory, and ALB 5xx rate (shared alarm SNS topic)

Decisions#

Why a thin proxy layer?#

A thin proxy over the Databricks managed MCPs gives us flexibility to modify, filter, or augment whatever Databricks exposes — while still being able to selectively reuse capabilities as their managed MCP offering evolves. Critically, this layer is also where we perform OBO token exchange (so every downstream call carries the user's scoped Databricks identity) and SCI pre-filtering (so tagged assets are excluded before the AI assistant sees them). Neither of those would be possible if we delegated directly to the Databricks SDK or the databricks-mcp PyPI package, which assume Databricks CLI auth and have no place for application-level filtering.

Why static workspace registry instead of dynamic discovery?#

Dynamic workspace discovery via Azure ARM requires additional API permissions and adds latency. The workspace list is stable; static JSON config via ECS environment variable is simple and sufficient for the current scale. Revisit when workspace count exceeds ~10.

Version: 1.0
Date: 2026-04-28
Approved by: Pending Review
Related Artifacts: - User Requirements: requirements/features/databricks-access.feature - Risk Assessment: docs/risks/risk-assessment-databricks-access.md - Intended Use: docs/requirements/01-intended-use.md - OAuth Design: docs/design-oauth-authentication.md