Skip to content

Transcription System Setup

Transcription is a required foundation for AI insights on voice calls. If a call has no transcript (or a very low-quality transcript), AI Tasks cannot reliably produce insights.

This chapter covers transcription from a platform operator perspective: configuring transcription engines and the transcription pipeline for a multi-tenant environment.


What transcription provides (operator view)

  • Converts audio to text (transcripts) per call
  • Stores transcripts so they can be:
  • viewed by users
  • searched
  • analyzed by AI Tasks (CSAT, topics, summaries, etc.)

Typical transcription pipeline stages

  1. Audio ingestion (recordings + metadata)
  2. Transcription job creation (queueing)
  3. Speech-to-text execution (provider/model)
  4. Transcript post-processing (punctuation, diarization, redaction as applicable)
  5. Transcript persistence and indexing
  6. Availability in UI and downstream analytics

Engine selection and language strategy

Engine selection (examples)

Operators may configure one or more transcription engines/providers. Consider documenting: - which providers are supported - how to choose engines per tenant, per language, or per region (if applicable)

Language strategy

Common approaches: - Auto-detect language (simplest, may be less accurate for short calls) - Tenant-configured languages (more accurate, requires setup) - Per-conversation language hint (from ingestion metadata)


Validation / smoke test (operator)

For a test tenant: 1. Ingest a short call (30–60 seconds) with known audio clarity. 2. Confirm transcript is produced within expected latency. 3. Verify: - transcript is visible in call details - speaker separation (if supported) is reasonable - timestamps align with the audio 4. Confirm the transcript is available to AI Tasks (see AI Assistant job smoke test).


Monitoring and alerting (transcription)

Track, at minimum: - transcription backlog/lag - job failure rate - provider timeouts/quotas - average transcription latency - percent of calls with missing/empty transcripts - language distribution and “unknown language” rate


Operational best practices

  • Provide a reprocessing mechanism (retranscribe) for:
  • provider incidents
  • improved models
  • bug fixes in diarization/punctuation
  • Define tenant-level retention for audio and transcripts (aligned with compliance requirements)
  • If using external providers, document:
  • credential rotation
  • regional/data residency constraints
  • rate limits and quotas

Implementation notes

  • Transcription is typically configured via Administration > Speech Analytics > Transcription
  • Multiple transcription engines may be available with per-tenant or per-language selection
  • Features like speaker diarization and punctuation depend on the transcription engine
  • Document a baseline "supported audio quality" guideline and minimum call duration thresholds
  • Implement and document a "transcription health dashboard" for operators
  • Provide a "reprocess transcription" runbook and clearly describe expected impact (cost/time)