Runbooks

This chapter contains deeper operational runbooks for common incidents. Each runbook follows a consistent structure:

Symptom
Scope assessment
Immediate checks
Root cause isolation
Remediation steps
Validation
Escalation and what to collect
Prevention

Runbook 1: Transcription backlog growing

Symptom

Time from call ingestion → transcript available is increasing, backlog rising.

Scope assessment

Affects: ☐ all tenants ☐ subset of tenants ☐ specific region/provider

Immediate checks

transcription worker health and queue depth
provider status/quota/rate limits
recent deployments/config changes

Remediation

scale transcription workers
switch to alternate transcription engine/provider (if supported)
throttle ingestion (temporary) if backlog threatens stability

Validation

backlog decreases; latency returns to baseline
new transcripts appear for test tenant

Escalation / collect

provider error codes
sample conversation IDs
queue depth metrics
worker logs

Runbook 2: AI Assistant job stopped / not processing

Symptom

No new insights appear; job heartbeat missing; lag increasing.

Immediate checks

job process health (service status)
queue depth and worker availability
credentials to AI engines still valid

Remediation

restart job workers
roll back recent changes
temporarily disable expensive/unstable tasks

Validation

job heartbeat restored
test tenant conversation gets processed

Runbook 3: Spike in invalid JSON responses

Symptom

Task failure rate spikes; schema validation errors increase.

Immediate checks

identify which task(s) and engine(s)
check recent prompt/schema updates
examine 10 raw outputs

Remediation

roll back prompt/schema
switch engine/model for affected tasks
tighten prompt to force JSON-only output
reduce transcript length or enforce truncation

Validation

failure rate returns to baseline
outputs pass schema validation

Runbook 4: Suspected cross-tenant data exposure (SEV-1)

Symptom

Any report of one tenant seeing another tenant’s data.

Immediate response

treat as SEV‑1
freeze changes and restrict access
collect evidence and preserve logs (do not delete)

Investigation

confirm tenant_id on affected records
review access logs and query paths
isolate the exposure vector (UI/API/search index)

Remediation

hotfix enforcement at the boundary
invalidate caches/indexes if needed
communicate to affected parties per policy

Validation

confirm isolation restored with synthetic tests
run full regression suite on tenant access boundaries

Runbook 5: Cost anomaly / runaway spend

Symptom

tokens or cost exceeds threshold; often after enabling tasks or editing prompts.

Immediate checks

which tenant(s) and which task(s)
retry rate
transcript size distribution
any unintended enablement across many tenants

Remediation

disable the task (globally or for impacted tenants)
tighten filters
throttle backfill
roll back prompt changes
if provider rate limiting causes retries, implement fail-fast temporarily

Validation

usage returns to baseline within expected time window

Implementation notes

Job logs are available via the Logs tab on the AI Assistant Job detail page. Each log entry can be expanded to view full details including error messages and processing context.

Figure: AI Assistant Job logs showing processing history.

When collecting information for MiaRec support, include:

Conversation IDs (affected records)
Job name and tenant
Timestamp range of the issue
Screenshots of error messages or log entries
Any recent configuration changes