Design: SharePoint Access#

Overview#

Purpose: Expose SharePoint and OneDrive file search and retrieval capabilities to AI assistants via the MCP protocol, enabling users to search for documents, browse sites and folders, and read document content using delegated read-only permissions.

Scope: - Seven MCP tools: search_documents, search_pages, list_sites, list_folder, read_document, read_site_page, get_metadata - Read-only access to SharePoint sites and OneDrive files via Microsoft Graph API - Cross-site search using the Microsoft Search API with KQL (Keyword Query Language) - Document parsing for DOCX, PDF, XLSX, PPTX to plain text (platform-independent, no server-side conversion) - Sensitivity label gate: items above configured maximum sensitivity are redacted but remain visible in results - Site and folder whitelist: restricts access to approved corporate content only - OAuth2 On-Behalf-Of (OBO) authentication — see DESIGN:OAuthAuthentication

Out of Scope: - Document creation, modification, or deletion - File upload or version management - Permission management or sharing operations - SharePoint lists or list item operations (only document libraries) - SharePoint workflows or approvals - Personal OneDrive access (whitelist enforces corporate sites only)

Architecture#

┌────────────────────┐     ┌──────────────────────┐     ┌──────────────────────┐
│  AI Assistant      │────▶│  SharePoint MCP      │────▶│  Microsoft Graph API │
│  (Claude / MCP     │     │  Server (FastMCP)    │     │  /v1.0/search/query  │
│   client)          │◀────│                      │◀────│  /v1.0/sites/...     │
└────────────────────┘     └──────────────────────┘     └──────────────────────┘
                                      │
                                      ▼
                            ┌──────────────────────┐
                            │  Filter Pipeline     │
                            │  (pipeline.py)       │
                            │  1. Whitelist        │
                            │  2. Metadata filters │
                            │  3. Sensitivity gate │
                            └──────────────────────┘
                                      │
                                      ▼
                            ┌──────────────────────┐
                            │  Document Parsers    │
                            │  (parsers.py)        │
                            │  DOCX, PDF, XLSX,    │
                            │  PPTX → plain text   │
                            └──────────────────────┘
                                      │
                                      ▼
                            ┌──────────────────────┐
                            │  OBOAuthenticator    │
                            │  (auth.py)           │
                            └──────────────────────┘
                                      │
                                      ▼
                            ┌──────────────────────┐
                            │  Token Storage       │
                            │  (DynamoDB / Memory) │
                            └──────────────────────┘

Components: - FastMCP Server (server.py) — ASGI HTTP server, MCP protocol handler, tool registration - Search Tools (tools/search.py) — search_documents, search_pages tool wrappers; KQL query construction - Browse Tools (tools/browse.py) — list_sites, list_folder tool wrappers; site/folder navigation - Read Tools (tools/read.py) — read_document, read_site_page, get_metadata tool wrappers; content retrieval - Document Parsers (tools/parsers.py) — Platform-independent text extraction (DOCX, PDF, XLSX, PPTX → plain text) - Graph Search Client (graph/search.py) — Microsoft Search API calls (POST /search/query) - Graph Sites Client (graph/sites.py) — Sites and folder browsing Graph client - Graph Files Client (graph/files.py) — Document download and metadata Graph client - Graph Client (graph/client.py) — Authenticated HTTP client (OBO token injection) - Filter Pipeline (filters/pipeline.py) — Orchestrates whitelist, metadata, and sensitivity filters - Whitelist Filter (filters/whitelist.py) — Site/folder allowlist enforcement - Sensitivity Filter (filters/sensitivity.py) — Sensitivity label gate; redacts content above max label - Metadata Filter (filters/metadata.py) — File type, date range, author filters - Filter Rules (filters/rules.py) — Loads filter configuration from environment/file - OBOAuthenticator (auth.py) — On-Behalf-Of token exchange; see DESIGN:OAuthAuthentication - Microsoft Graph API — Upstream data source; enforces user-scoped authorization

Tech Stack#

Backend: Python 3.13, FastMCP ≥3.0.0, Starlette (ASGI) HTTP Client: httpx (async) OAuth: MSAL 1.28+, OBO flow via auth.py Token Storage: py-key-value-aio (DynamoDB in production, memory in dev) Document Parsing: python-docx, pypdf, openpyxl, python-pptx (all platform-independent, no native dependencies) Infrastructure: AWS ECS Fargate, Application Load Balancer, ECR Secret Management: AWS SSM Parameter Store (KMS encrypted) Deployment: Docker (multi-stage), GitHub Actions CI/CD

MCP Tool API#

`search_documents`#

Search files (DOCX, PDF, PPTX, XLSX, etc.) across SharePoint and OneDrive using KQL full-text search.

Parameters: - query (str, required) — KQL search query (e.g., "project plan" or "author:jane") - file_types (list[str], optional) — Only use when you specifically need a particular file format (e.g., ["docx", "pdf"]). Omit for general topic or content searches — filtering by type excludes site pages, wikis, and other formats where the answer may actually live. - modified_after (datetime, optional) — Return only items modified after this ISO-8601 datetime - modified_before (datetime, optional) — Return only items modified before this ISO-8601 datetime - author_email (str, optional) — Return only items created by this email address - page_size (int, default: 25) — Number of results to return (1–500); use 50+ for broad queries

Returns: List of { id, name, webUrl, _drive_id, lastModifiedDateTime, createdBy, size, sensitivityLabel, _redacted (bool) }

Graph endpoint: POST /v1.0/search/query with entityTypes: ["driveItem"]

Filter pipeline: Results pass through whitelist → metadata filters → sensitivity gate before being returned. Sensitivity-blocked items remain in results with _redacted: true and content replaced by a redaction notice (transparency: user knows document exists but is restricted).

`search_pages`#

Search SharePoint site pages, news posts, and wiki content using KQL full-text search.

Parameters: - query (str, required) — KQL search query (e.g., "project plan" or "Amazon SES") - page_size (int, default: 25) — Number of results to return (1–500); use 50+ for broad queries

Returns: List of { id, webUrl, title, siteId, lastModifiedDateTime, createdBy, contentType }

Graph endpoint: POST /v1.0/search/query with entityTypes: ["sitePage"]

Note: Use siteId and title with read_site_page to fetch full page content.

`list_sites`#

List SharePoint sites accessible to the current user, filtered by the configured site whitelist.

Parameters: - search_query (str, default: "*") — Keyword to filter site names; use "*" to list all accessible sites

Returns: List of { id, displayName, name, webUrl }

Graph endpoint: GET /v1.0/sites?$search={query}

Note: Use the id value as site_id in list_folder and read_document.

`list_folder`#

Browse direct children of a folder in a SharePoint site's default drive.

Parameters: - site_id (str, required) — SharePoint site Graph ID (from list_sites) - folder_path (str, default: "") — Folder path relative to the drive root (e.g., "Shared Documents/Reports"); empty string = root

Returns: List of { id, name, webUrl, file (present on files), folder (present on folders), lastModifiedDateTime, size }

Graph endpoint: - Root: GET /v1.0/sites/{site_id}/drive/root/children - Subfolder: GET /v1.0/sites/{site_id}/drive/root:/{path}:/children

`read_document`#

Download and parse a document to plain text. Supports DOCX, PDF, XLSX, PPTX, and common plain-text formats.

Parameters: - drive_id (str, required) — Drive ID (from search_documents _drive_id or list_folder parentReference.driveId) - item_id (str, required) — Drive item ID (from search_documents id or list_folder id)

Returns: { metadata: {...}, content: "...", redacted: bool }

Graph endpoint: GET /v1.0/drives/{drive_id}/items/{item_id}/content (raw file bytes)

Parsing: Raw bytes are passed to parsers.parse_document() which dispatches to the correct parser based on file extension. Parsing is synchronous (CPU-bound) and wrapped with asyncio.to_thread().

Sensitivity gate: Content is withheld (replaced with REDACTED_CONTENT) if the item is above the configured maximum sensitivity label. The item metadata is still returned so agents can see the document exists.

Site page handling: If the item is a SharePoint site page (.aspx) or a site page asset folder, the response will instruct the caller to use read_site_page instead, with site_id and page_name pre-filled from the item's metadata.

`read_site_page`#

Fetch and parse a SharePoint modern site page to plain text.

Parameters: - site_id (str, required) — SharePoint site ID (from list_sites id or drive item parentReference.siteId) - page_name (str, required) — Page name or title to look up (e.g., "Power-BI-Best-practices" or "Power BI Best Practices"); case-insensitive

Returns: { metadata: {...}, content: "...", available_pages: [...] (only present when no match found) }

Graph endpoint: GET /v1.0/sites/{site_id}/pages?$expand=canvasLayout

Parsing: The canvasLayout is parsed by parsers.extract_site_page_text(), which strips HTML tags and unescapes entities from each text web part. Title and description are prepended.

Note: Use this tool for SharePoint site pages (.aspx) — read_document cannot read these as they have no file content stream. If read_document returns a redirect notice for a site page item, call this tool with the site_id and page_name from that notice.

`get_metadata`#

Fetch full metadata for a drive item, including sensitivity label.

Parameters: - drive_id (str, required) — Drive ID - item_id (str, required) — Drive item ID

Returns: Drive item metadata dict including { name, webUrl, size, lastModifiedDateTime, createdBy, sensitivityLabel }

Graph endpoint: GET /v1.0/drives/{drive_id}/items/{item_id}?$select=...

Filter Pipeline#

The filter pipeline (filters/pipeline.py) applies three filters in sequence to all search results before returning them to the MCP client.

1. Whitelist Filter#

Purpose: Restricts access to approved corporate content only; prevents access to personal OneDrive or unapproved sites.

Mechanism: Each drive item's webUrl is checked against a list of allowed prefixes (site URLs and folder paths). Items whose webUrl does not start with at least one allowed prefix are dropped from results.

Configuration: FILTER_RULES_JSON environment variable or filters.json file. Example:

{
  "whitelist": {
    "sites": [
      "https://novonordisk.sharepoint.com/sites/ApprovedSite1",
      "https://novonordisk.sharepoint.com/sites/ApprovedSite2"
    ],
    "folder_prefixes": [
      "https://novonordisk.sharepoint.com/sites/SomeOtherSite/Shared Documents/Public"
    ]
  }
}

Default behavior: If no whitelist is configured, all items are allowed through (permissive default).

Rationale: Scopes access to approved corporate content only. Without this filter, users could access any SharePoint site or OneDrive file they have permissions for, including personal OneDrive files. The whitelist enforces organizational policy at the MCP server layer (defense in depth).

2. Metadata Filter#

Purpose: Filters search results by file type, modification date range, and author email.

Mechanism: - file_types: Checks file extension against allowed list; drops non-matching items - modified_after / modified_before: Checks lastModifiedDateTime; drops items outside range - author_email: Checks createdBy.user.email; drops items not authored by specified email

Configuration: Passed as parameters to search_documents tool; no static config needed.

Note: Metadata filters are applied after whitelist but before sensitivity gate — this ensures we don't waste parsing cycles on items that will be filtered out anyway.

3. Sensitivity Label Gate#

Purpose: Prevents disclosure of highly sensitive content while maintaining transparency (users can see that a document exists but is restricted).

Mechanism: Each drive item's sensitivityLabel.displayName is checked against the configured maximum label level. Items above the max level have their _content field replaced with [REDACTED — sensitivity label exceeds configured maximum] and _redacted: true is added to the result dict. Items are not dropped from results — they remain visible so agents know the document exists.

Sensitivity levels (ordered lowest → highest): 1. Public 2. General 3. Confidential 4. Highly Confidential

Configuration: FILTER_RULES_JSON environment variable or filters.json file. Example:

{
  "sensitivity": {
    "max_label": "General",
    "block_unlabeled": false
  }
}

Default behavior: If no sensitivity rules are configured, all items are allowed through.

Unlabeled items: Controlled by block_unlabeled setting. If false (default), unlabeled items are allowed. If true, unlabeled items are redacted.

Unknown labels: Items with unrecognized label names are treated as unlabeled.

Rationale: Defense in depth — even if Graph API permissions allow access to a sensitive file, the MCP server enforces an additional content policy. Redacting rather than dropping items provides transparency: the AI assistant can inform the user "I found a document titled X but cannot access it due to sensitivity restrictions" rather than silently omitting it.

Security#

Authentication: OAuth2 OBO flow — see DESIGN:OAuthAuthentication

Authorization: - All Graph API calls use /me/ or user-scoped endpoints — access scoped to the authenticated user - Microsoft Graph API provides secondary authorization layer based on SharePoint permissions - No user ID parameters accepted from MCP clients (prevents cross-user access)

Graph Permission Scopes: Sites.Read.All, Files.Read.All, offline_access

Data Protection: - Document content never persisted — stateless HTTP mode (stateless_http=True) - DynamoDB stores OAuth tokens only — no document content, file names, or metadata - File names, document content, and author names excluded from application logs - LOG_LEVEL=INFO in production — DEBUG logging disabled - Transport: TLS 1.3 (enforced at ALB layer)

Whitelist Enforcement: - Site/folder whitelist scopes access to approved corporate content only - Prevents access to personal OneDrive files or unapproved SharePoint sites - Enforced at application layer (not just Graph API permissions) — defense in depth

Sensitivity Gate: - Content redaction for items above configured max sensitivity label - Prevents disclosure of highly sensitive content even if Graph API permissions allow access - Transparency: redacted items remain visible in results (not silently dropped)

Document Parsing: - All parsers are platform-independent pure Python libraries — no native code execution - No server-side conversion via Graph API (avoids leaking document content to Microsoft's conversion service) - Parsing is synchronous and CPU-bound — wrapped with asyncio.to_thread() to avoid blocking the event loop

Data Model#

No persistent data model for SharePoint content. The only data stored is OAuth tokens:

DynamoDB table: mcp-oauth-storage-sharepoint-mcp-{env}
  Partition key: user_id (string)
  Sort key:      session_id (string)
  Attributes:    access_token, refresh_token, expires_at
  TTL:           expires_at + 90 days

All SharePoint data is fetched on-demand from Microsoft Graph API and returned directly to the MCP client. No caching, no secondary storage, no logging of document content or file names.

Integrations#

Microsoft Graph API - Protocol: REST over HTTPS - Auth: OAuth2 Bearer token (OBO-exchanged, user-scoped) - Scopes: Sites.Read.All, Files.Read.All, offline_access - Base URL: https://graph.microsoft.com/v1.0 - Rate limits: 10,000 requests per 10 minutes per user - Error handling: HTTP 429 propagated to client (respects Retry-After); HTTP 401/403 re-raised; transient 5xx errors surfaced as MCP tool errors with user-friendly message

Microsoft Search API - Endpoint: POST /v1.0/search/query - Query language: KQL (Keyword Query Language) — supports full-text search, property filters (e.g., author:jane), and boolean operators - Entity types: driveItem (files), sitePage (SharePoint pages) - Pagination: from (offset) and size (page size); max 500 results per request - Fields: Explicitly requested via fields array to minimize response size

Azure AD (authentication) - See DESIGN:OAuthAuthentication for full flow - Tenant: Novo Nordisk single-tenant

Performance#

Expected Load: Low — per-user assistant sessions, not batch operations Response Time Target: - Search operations: < 3s for search_documents / search_pages (depends on search complexity and result count) - Browse operations: < 1.5s for list_sites / list_folder - Read operations: < 5s for read_document (depends on file size and format), < 2s for read_site_page

OBO Exchange Overhead: ~200–500ms per tool call (see DESIGN:OAuthAuthentication)

Document Parsing Overhead: - DOCX: ~50–200ms for typical document (10–50 pages) - PDF: ~100–500ms for typical document (10–50 pages); highly variable depending on PDF structure - XLSX: ~100–300ms for typical spreadsheet (< 10 sheets, < 1000 rows) - PPTX: ~50–150ms for typical presentation (10–20 slides)

Optimization Notes: - No Graph token caching (security > latency) - Document parsing is synchronous (CPU-bound) — wrapped with asyncio.to_thread() to avoid blocking event loop - Filter pipeline runs sequentially (whitelist → metadata → sensitivity) — later filters avoid wasting cycles on already-dropped items - Microsoft Search API pagination: use page_size parameter to control result count; larger page sizes reduce roundtrips but increase response time

Deployment#

Environments:

Env	URL	ECS sizing	Storage
Dev	`sharepoint.dev.connectors.novo-genai.com`	0.25 vCPU / 512 MB	Memory
Prod	`sharepoint.connectors.novo-genai.com`	0.5 vCPU / 1 GB	DynamoDB

CI/CD: .github/workflows/deploy-sharepoint-mcp.yml - lint → build (push to dev ECR) → deploy-dev → deploy-prod (gated: environment: Production) - Prod promotes image from dev ECR (no rebuild)

Secrets (SSM Parameter Store):

/aiconnectors/{env}/sharepoint-mcp/client_id
/aiconnectors/{env}/sharepoint-mcp/client_secret
/aiconnectors/{env}/sharepoint-mcp/dynamodb_table_name

Filter Rules Configuration: - Dev: FILTER_RULES_JSON env var (set in ECS task def) - Prod: FILTER_RULES_JSON env var (set in ECS task def) - Alternatively: mount filters.json file via EFS or bake into container image

Monitoring#

Logs: JSON structured logs → CloudWatch Logs (/ecs/sharepoint-mcp-{env}) Retention: 90 days (GDPR Article 30 compliance)

Key Events Logged: - documents_searched — user identity, query, result count, filter stats (whitelisted/redacted counts) - pages_searched — user identity, query, result count - sites_listed — user identity, search query, result count - folder_listed — user identity, site_id, folder_path, result count - document_read — user identity, drive_id, item_id, file extension, parse time, redacted (bool) - site_page_read — user identity, site_id, page_name, parse time - metadata_fetched — user identity, drive_id, item_id - graph_request — method, endpoint, HTTP status (no content) - obo_exchange_failed — error code, user identity - whitelist_filter_applied — dropped count, total count - sensitivity_gate_applied — redacted count, total count

NOT Logged: - File names or document titles - Document content or page text - Author names or email addresses (only logged as generic "createdBy" field for filtering audit) - OAuth tokens or authorization codes

Alerts: - Error rate > 5% over 5 minutes - Graph API 5xx rate > 10 per minute - ECS health check failures - Document parsing failures > 10% over 5 minutes

Key Decisions#

Why Microsoft Search API (`/search/query`) instead of site-scoped search?#

Decision: search_documents and search_pages use POST /v1.0/search/query with entityTypes filtering rather than site-specific search endpoints like GET /sites/{id}/drive/search.

Rationale: Microsoft Search API enables cross-site search in a single request. Site-scoped search would require enumerating all allowed sites and issuing one query per site, then merging and de-duplicating results. The Search API is designed for enterprise-wide content discovery and provides better relevance ranking, KQL support, and result aggregation. It is the correct abstraction for "find me documents about X across the organization."

Why redact sensitive items rather than filter them entirely?#

Decision: The sensitivity gate replaces content with a redaction notice and marks items as _redacted: true, but does not drop them from results.

Rationale: Transparency. If a highly sensitive document is silently omitted, the AI assistant will respond "I couldn't find any documents about X" — which is misleading if a relevant document exists but is restricted. With redaction, the assistant can inform the user "I found a document titled Y but cannot access it due to sensitivity restrictions; please contact your admin or open it directly in SharePoint." This improves user experience and reduces confusion. It also provides a trail for auditing: users can see that the system is working correctly and respecting sensitivity labels.

Why site/folder whitelist?#

Decision: All search results are filtered by a configurable whitelist of allowed site URLs and folder paths. Items outside the whitelist are dropped before being returned to the MCP client.

Rationale: Scopes access to approved corporate content only. Without a whitelist, users could access any SharePoint site or OneDrive file they have permissions for, including personal OneDrive files or shadow IT sites. The whitelist enforces organizational policy at the MCP server layer (defense in depth — even if Graph API permissions allow broader access, the MCP server narrows it). This also prevents accidental data leakage from personal OneDrive files being indexed by AI assistants.

Why platform-independent document parsers?#

Decision: Document parsing (DOCX, PDF, XLSX, PPTX → plain text) is performed in-process using pure Python libraries rather than delegating to Microsoft Graph API's conversion service (e.g., GET /items/{id}/content?format=pdf).

Rationale: Platform independence (no native dependencies), no external API dependencies for core parsing logic, and avoids leaking document content to Microsoft's conversion service (defense in depth — content never leaves the MCP server). Pure Python libraries are well-tested, actively maintained, and handle edge cases (malformed PDFs, complex DOCX layouts) gracefully. They also support offline/air-gapped deployments if needed in the future. The trade-off is slightly higher CPU usage, but ECS autoscaling handles this transparently.

Why separate `search_documents` and `search_pages` tools?#

Decision: Document search (files like DOCX/PDF/XLSX) and site page search (modern SharePoint pages, news posts, wikis) are exposed as separate MCP tools rather than a single unified search tool.

Rationale: Different entity types (driveItem vs sitePage) return different metadata schemas and require different follow-up actions. Files are read with read_document (which downloads and parses raw bytes), while site pages are read with read_site_page (which fetches structured canvasLayout JSON). Exposing them as separate tools provides clearer parameter descriptions, better error messages, and avoids the complexity of a polymorphic result type. AI assistants can also reason more clearly about "search for a file" vs "search for a wiki page" as distinct operations.

URS: @URS:SharePointAccess Risk Assessment: @RISK:SharePointAccess OAuth Design: @DESIGN:OAuthAuthentication

Risks Addressed by This Design: - R1 (token compromise) — OBO flow, read-only scopes, TLS - R2 (data in logs) — structured logging with field exclusion (no file names, content, or author names), LOG_LEVEL=INFO - R3 (cross-user access) — user-scoped Graph API endpoints, no user ID parameters, whitelist per-user enforcement - R4 (sensitivity breach) — sensitivity label gate, content redaction, no persistent storage - R5 (Graph API availability) — graceful error messages, 429 Retry-After respected, transient 5xx errors surfaced to client - R6 (whitelist bypass) — enforced at application layer (defense in depth), applied before content parsing - R7 (document parsing errors) — graceful fallback to [Unsupported file type] or [Parsing failed] messages, no crash on malformed files

Version: 1.0 Date: 2026-04-22 Author: AILab Approved by: Pending PR review