Add New Rule
To create a new rule, navigate to Administration > Speech Analytics > Data Redaction page, click the Add button to create a new rule or the Edit button to modify the existing rule.
On the Edit Data Redaction Rule page, you can configure:
- Name of the rule
- Color, which helps to visually distinguish different rules in the redaction results, when multiple rules apply to the same recording
- Optional description
- The option to redact the transcript/audio of the matched side or both sides (see Note 1)
- One or multiple expressions to match sensitive data
Note 1. Redact both sides vs the matched side only
MiaRec application by default records audio in two-channel format (stereo), where each side of a conversation is stored in a separate audio channel (left and right).
In most cases, we recommend redacting data in both audio channels due to a potential echo effect in the audio signal. Echo is a common flaw in telephony when the spoken words on one side of a conversation are replayed back on another side with a slight delay and distortion.
Let's look at a common case of redacting credit card numbers in recordings. A target for redaction would be a caller speaking consecutive digits after an agent spoke some trigger phrases like "What is your credit card number?" or "What is a three-digit security code?". Due to the echo effect, the spoken digits can be heard in both audio channels, possibly with slight distortion, for example, a clear "five" on one side and a distorted "fine" on another side. If a redaction is applied to one side only, sensitive data could be potentially recovered from the echoed signal.
Another common case is when an agent repeats after the caller each spoken word as some form of confirmation.
A redaction of both sides provides extra safety in removing sensitive data.
For each expression, you can configure the Replacement text that will substitute the original text in a transcript,
for example, ******
or [redacted]
, and left and right padding (see Note 2).
Note 2. Left and right padding
The padding setting specifies how much data will be redacted to the left and right side from the targeted data.
We recommend redacting at least 500ms to the left and right from the detected sensitive data to compensate for the negative effects of echo and potential inaccuracy in transcription timestamps.
Transcription timestamps can fluctuate for 200-300ms on average from the ground truth. Such small fluctuation is hardly noticeable by a human, but can expose a word or two in audio if not redacted properly.
Expression Examples
This chapter demonstrates a few common examples of using the expression syntax to redact sensitive data in call recordings. We will demonstrate the capabilities of the engine by gradually increasing the complexity of redaction rules.
Goal: redact a credit card number from call recordings.
To perform a redaction, we will use the MQL expression to search for sensitive data in a transcript.
Assuming a credit card number consists of 16 digits, we could specify a simple rule, like:
R"[0-9]{16}"
Such a rule will search for the digits 0 to 9 that appear in a text exactly 16 times, for example, like in the phrase "My credit card number is 1234567890123456".
In reality, a credit card number could be pronounced by a speaker in many different ways:
- each digit spoken individually, like in "My credit card number is 1 2 3 4 5 6 7 ..."
- digits are spoken in groups, like in "My credit card number is 1234 5678 ..."
- digits are separated by a dash or other punctuation symbol, like in "My credit card number is 1234-5678-..."
- a speaker uses filler words like in "My credit card number is 1234 um 5689 ..."
- the digits are not a credit card number at all, like in "A tracking number is 123456789"
To cover all those variations, we need a more sophisticated rule. Let's improve it.
Our next try will be the following rule:
R"[0-9][0-9\-\,. ]{2,}[0-9]"
This rule will search for a single digit 0 to 9 at the beginning of a text (defined by the first [0-9]
expression),
then, it expects to find two or more digits or punctuation symbols (defined by the middle [0-9\-\,. ]{2,}
expression),
and, finally, it expects to find a single digit in the end (defined by the last [0-9]
expression).
Such an expression would allow us to match all of the above-mentioned variations of the credit card number, but it will match the tracking number phrase as well, which we don't want to redact.
To exclude the tracking number from a redaction, we can improve the expression further and add a trigger phrase to the search expression:
R"[0-9][0-9\-\,. ]{2,}[0-9]" AFTER:10 ("credit card" OR "card number")
In this new expression, we use the operator AFTER:10
, which instructs the redaction engine to search only for the digits that are spoken
after triggering phrases "credit card" or "card number".
10
in AFTER:10
means a maximum allowed distance between a trigger phrase and the searched digits (a distance is specified in words).
With such an expression, we can redact the digits that are related to credit card numbers but ignore moments in conversations when digits are spoken in other contexts.
We recommend testing the expression rules on your real call recordings before enabling the rule in production.
To test the expression, click the Test Expressions button and follow the instructions in the Test Expressions section.