Monitoring and Alerting

This chapter defines what “healthy” looks like for Conversation Analytics and what operators should monitor and alert on.

Goal: detect issues before tenants notice (missing transcripts, stale dashboards, failed AI Tasks, provider outages).

Monitoring layers (recommended)

1) Ingestion health

ingestion throughput (conversations/hour)
ingestion error rate
duplicate rate (if applicable)
latency from source → platform
per-tenant ingestion volume anomalies

2) Transcription health (voice)

transcript completion latency (p50/p95)
transcript failure rate
percent of calls with missing transcript
language detection distribution and “unknown” rate
provider-specific error codes/quota events

3) AI Assistant job health

backlog/lag (time from transcript ready → tasks completed)
execution success rate
retry rate (high retries often = provider instability)
dead-letter/poison count

4) AI Task health

task-level failure rate (per task, per tenant)
schema validation failures (JSON invalid)
average tokens/request (cost proxy)
number of conversations skipped due to filters (sanity check)

5) AI Engine health

API availability (errors, timeouts)
rate limiting/quota
latency
cost per tenant/task

6) User-facing indicators (synthetic checks)

can you open a transcript?
do dashboards load?
does search return results for known test filters?

Alerting recommendations

Severity levels

SEV-1: platform-wide outage or major data loss (ingestion down, job stopped)
SEV-2: partial outage (one provider down, one region impacted)
SEV-3: degradation (latency high, retries spiking)
SEV-4: tenant-specific issues

Suggested alert conditions

ingestion backlog grows continuously for > X minutes
transcript missing rate exceeds Y%
AI Assistant job lag exceeds Z minutes
AI engine error rate > threshold
schema validation failures spike after task change
per-tenant usage spikes unexpectedly (cost guardrail)

Operators should tune X/Y/Z based on tenant expectations and workload.

Operational dashboards (recommended)

Maintain at least: - Pipeline health dashboard (ingestion → transcription → AI tasks) - Top failing tasks (by failures and by tenants impacted) - Usage dashboard (requests/tokens/cost by tenant and by task) - Overrides dashboard (tenants with prompt/filter overrides)

Incident response workflow (recommended)

When an alert fires: 1. Identify scope: platform-wide vs tenant-specific 2. Identify stage: ingestion vs transcription vs AI job vs engine 3. Apply the runbook (see Troubleshooting → Runbooks) 4. Communicate status to impacted tenants (template recommended) 5. Post-incident: - root cause - remediation - prevention (guardrails, monitoring improvements)

Built-in monitoring in MiaRec

The AI Assistant job view provides built-in monitoring through several tabs:

Figure: All Runs tab showing execution history with success/failure chart over time.

Job monitoring tabs

Latest run – Current execution status and progress
All runs – Historical execution chart showing patterns over time
Processing records – Individual conversation processing status and results
Logs – Detailed execution logs for troubleshooting

Figure: Processing Records tab showing per-conversation execution status.

The AI Assistant section (Administration > Speech Analytics > AI Assistant) includes:

AI Tasks – Task configuration and tenant activation
Global Tasks – System-wide task definitions
Engines – LLM provider/model configuration
Jobs – Processing job configuration and monitoring

EDITOR NOTE: fill in with product specifics

Purpose of this section

Operators need explicit health signals and alert thresholds. This reduces MTTR and support load.

Missing / unclear (confirm with Engineering)

Built-in monitoring UI
A) Dedicated monitoring pages exist in MiaRec UI
B) Only logs/metrics externally (Prometheus/Grafana/etc.)
C) Both
AI Assistant tabs
A) UI includes tabs like Jobs / Usage / Overrides / Engines
B) Only Tasks/Global Tasks are visible
C) Unknown
Execution status model
A) Task statuses: Succeeded/Failed/Skipped/FilteredOut
B) Only success/fail
C) Unknown
Audit logging
A) Config changes logged (who/what/when)
B) Partial
C) Not available

Best-guess recommendations

Provide a "synthetic tenant" with test conversations used for automated health checks.
Alert on cost anomalies (tokens/requests) to catch prompt regressions early.