Skip to content

Runbooks

This chapter contains deeper operational runbooks for common incidents. Each runbook follows a consistent structure:

  • Symptom
  • Scope assessment
  • Immediate checks
  • Root cause isolation
  • Remediation steps
  • Validation
  • Escalation and what to collect
  • Prevention

Runbook 1: Transcription backlog growing

Symptom

  • Time from call ingestion → transcript available is increasing, backlog rising.

Scope assessment

  • Affects: ☐ all tenants ☐ subset of tenants ☐ specific region/provider

Immediate checks

  • transcription worker health and queue depth
  • provider status/quota/rate limits
  • recent deployments/config changes

Remediation

  • scale transcription workers
  • switch to alternate transcription engine/provider (if supported)
  • throttle ingestion (temporary) if backlog threatens stability

Validation

  • backlog decreases; latency returns to baseline
  • new transcripts appear for test tenant

Escalation / collect

  • provider error codes
  • sample conversation IDs
  • queue depth metrics
  • worker logs

Runbook 2: AI Assistant job stopped / not processing

Symptom

  • No new insights appear; job heartbeat missing; lag increasing.

Immediate checks

  • job process health (service status)
  • queue depth and worker availability
  • credentials to AI engines still valid

Remediation

  • restart job workers
  • roll back recent changes
  • temporarily disable expensive/unstable tasks

Validation

  • job heartbeat restored
  • test tenant conversation gets processed

Runbook 3: Spike in invalid JSON responses

Symptom

  • Task failure rate spikes; schema validation errors increase.

Immediate checks

  • identify which task(s) and engine(s)
  • check recent prompt/schema updates
  • examine 10 raw outputs

Remediation

  • roll back prompt/schema
  • switch engine/model for affected tasks
  • tighten prompt to force JSON-only output
  • reduce transcript length or enforce truncation

Validation

  • failure rate returns to baseline
  • outputs pass schema validation

Runbook 4: Suspected cross-tenant data exposure (SEV-1)

Symptom

  • Any report of one tenant seeing another tenant’s data.

Immediate response

  • treat as SEV‑1
  • freeze changes and restrict access
  • collect evidence and preserve logs (do not delete)

Investigation

  • confirm tenant_id on affected records
  • review access logs and query paths
  • isolate the exposure vector (UI/API/search index)

Remediation

  • hotfix enforcement at the boundary
  • invalidate caches/indexes if needed
  • communicate to affected parties per policy

Validation

  • confirm isolation restored with synthetic tests
  • run full regression suite on tenant access boundaries

Runbook 5: Cost anomaly / runaway spend

Symptom

  • tokens or cost exceeds threshold; often after enabling tasks or editing prompts.

Immediate checks

  • which tenant(s) and which task(s)
  • retry rate
  • transcript size distribution
  • any unintended enablement across many tenants

Remediation

  • disable the task (globally or for impacted tenants)
  • tighten filters
  • throttle backfill
  • roll back prompt changes
  • if provider rate limiting causes retries, implement fail-fast temporarily

Validation

  • usage returns to baseline within expected time window

Implementation notes

Job logs are available via the Logs tab on the AI Assistant Job detail page. Each log entry can be expanded to view full details including error messages and processing context.

Job Logs

Figure: AI Assistant Job logs showing processing history.

When collecting information for MiaRec support, include:

  • Conversation IDs (affected records)
  • Job name and tenant
  • Timestamp range of the issue
  • Screenshots of error messages or log entries
  • Any recent configuration changes