Runbooks
This chapter contains deeper operational runbooks for common incidents. Each runbook follows a consistent structure:
- Symptom
- Scope assessment
- Immediate checks
- Root cause isolation
- Remediation steps
- Validation
- Escalation and what to collect
- Prevention
Runbook 1: Transcription backlog growing
Symptom
- Time from call ingestion → transcript available is increasing, backlog rising.
Scope assessment
- Affects: ☐ all tenants ☐ subset of tenants ☐ specific region/provider
Immediate checks
- transcription worker health and queue depth
- provider status/quota/rate limits
- recent deployments/config changes
Remediation
- scale transcription workers
- switch to alternate transcription engine/provider (if supported)
- throttle ingestion (temporary) if backlog threatens stability
Validation
- backlog decreases; latency returns to baseline
- new transcripts appear for test tenant
Escalation / collect
- provider error codes
- sample conversation IDs
- queue depth metrics
- worker logs
Runbook 2: AI Assistant job stopped / not processing
Symptom
- No new insights appear; job heartbeat missing; lag increasing.
Immediate checks
- job process health (service status)
- queue depth and worker availability
- credentials to AI engines still valid
Remediation
- restart job workers
- roll back recent changes
- temporarily disable expensive/unstable tasks
Validation
- job heartbeat restored
- test tenant conversation gets processed
Runbook 3: Spike in invalid JSON responses
Symptom
- Task failure rate spikes; schema validation errors increase.
Immediate checks
- identify which task(s) and engine(s)
- check recent prompt/schema updates
- examine 10 raw outputs
Remediation
- roll back prompt/schema
- switch engine/model for affected tasks
- tighten prompt to force JSON-only output
- reduce transcript length or enforce truncation
Validation
- failure rate returns to baseline
- outputs pass schema validation
Runbook 4: Suspected cross-tenant data exposure (SEV-1)
Symptom
- Any report of one tenant seeing another tenant’s data.
Immediate response
- treat as SEV‑1
- freeze changes and restrict access
- collect evidence and preserve logs (do not delete)
Investigation
- confirm tenant_id on affected records
- review access logs and query paths
- isolate the exposure vector (UI/API/search index)
Remediation
- hotfix enforcement at the boundary
- invalidate caches/indexes if needed
- communicate to affected parties per policy
Validation
- confirm isolation restored with synthetic tests
- run full regression suite on tenant access boundaries
Runbook 5: Cost anomaly / runaway spend
Symptom
- tokens or cost exceeds threshold; often after enabling tasks or editing prompts.
Immediate checks
- which tenant(s) and which task(s)
- retry rate
- transcript size distribution
- any unintended enablement across many tenants
Remediation
- disable the task (globally or for impacted tenants)
- tighten filters
- throttle backfill
- roll back prompt changes
- if provider rate limiting causes retries, implement fail-fast temporarily
Validation
- usage returns to baseline within expected time window
Implementation notes
Job logs are available via the Logs tab on the AI Assistant Job detail page. Each log entry can be expanded to view full details including error messages and processing context.
Figure: AI Assistant Job logs showing processing history.
When collecting information for MiaRec support, include:
- Conversation IDs (affected records)
- Job name and tenant
- Timestamp range of the issue
- Screenshots of error messages or log entries
- Any recent configuration changes
