Usage, Limits, and Cost Controls
LLM-powered analytics introduces variable cost and performance. This chapter describes the controls operators should implement and document to keep the platform predictable.
Cost drivers (what matters most)
- Number of executions
- conversations processed × enabled AI Tasks per tenant
- Prompt + transcript size
- long transcripts increase tokens dramatically
- Retries
- failures with retries can multiply cost
- Model choice
- higher-end models cost more per token and may have higher latency
Primary cost controls (recommended)
1) Default tasks to Disabled
Publish global tasks but keep them disabled for new tenants until explicitly enabled.
2) Use task filters
Examples: - duration > 15 seconds for voice calls - minimum text length for chats/emails/tickets - exclude internal/test queues
3) Cap transcript/thread size (if supported)
Options: - truncate to last N minutes - summarize in stages (two-pass approach) - drop low-value content (hold music segments)
4) Standardize prompt templates
Avoid per-tenant prompt sprawl: - provide recommended prompts - allow overrides but track them (audit/monitor) - consider approval workflows for high-cost tasks (if practical)
5) Rate limiting / quotas (if supported)
- per-tenant execution limits per hour/day
- per-tenant token budget
- per-engine concurrency limits
Usage reporting (what to expose)
At minimum, track: - executions per tenant per task - tokens per tenant per task (input/output) - average tokens/execution - estimated cost (if provider pricing known)
Recommended UI/report: - top 10 tenants by cost - top 10 tasks by cost - anomaly detection (spikes after a prompt change)
Performance and latency considerations
- Long transcripts increase latency.
- JSON schema validation failures may increase retries (if the platform retries on invalid JSON).
- If near-real-time dashboards are needed, avoid expensive multi-output tasks in the hot path.
Guardrails for prompt changes (recommended)
Because prompts can change cost and quality: - run prompt changes through a test suite (representative transcripts) - measure token usage before/after - stage rollouts (pilot tenants) - document rollback steps
Implementation notes
- Track "tokens per execution" as the most actionable cost KPI
- Use task filters aggressively to control eligible conversations
- Monitor job processing records and logs for error patterns
- If no built-in quotas exist, implement operational procedures: manual throttling and staged rollouts
- Contact MiaRec for specific rate limiting and usage reporting capabilities in your deployment