Free Text Redaction (Algorithm frameworks)

The Free Text Redaction Algorithm Framework is designed to effectively remove sensitive information from unstructured text fields, such as "Notes" columns in databases. Utilizing this framework involves configuring the algorithm to recognize and mask sensitive data embedded within text, which requires some expertise.

To identify the sensitive information, the algorithm relies on a list of predetermined lookup words associated with the type of data to be masked. For instance, to redact addresses, users might configure the algorithm to search for key indicators like “St,” “Cir,” or “Blvd.” Additionally, the framework supports pattern matching to flag potentially sensitive data patterns. A common example is the pattern "123-45-6789" used to detect Social Security Numbers.

Both lookup words and regular expressions employed by the algorithm focus on matching entire words within the text. Consequently, even if a regular expression is set to match a part of a word, the algorithm is programmed to redact the entire word, ensuring that partial data elements are not inadvertently exposed. This method guarantees that the sensitive data is fully obscured, enhancing the security of the masked text.

Example

An example is provided below to help clarify how the information above works.

In this example, the redacted (masked) value is field_2 in a delimited file.

field_1,field_2,field_3
1,000123000,last
2,000123,000123,last
3,123000,last
4,000 123 000,last
5,000 123,last
6,123 000,last
7,123,last

Making it easier to read - I have only shown field 2 below.

Example1:

  • RegEx: '123'

  • Redacted with: 'xxx'

orig         redact         Comment
0000123000   0000123000      < Not redacted - 123 is part of the string.
0000123      0000123         < Not redacted - 123 is part of the string.123000       123000          < Not redacted - 123 is part of the string.
000 123      000 xxx         < Redacts the 'word' 123 to 'xxx'
000 123 000  000 xxx 000     < Redacts the 'word' 123 to 'xxx'
123          xxx             < Redacts the 'word' 123 to 'xxx'
000 123.     000 xxx.        < Redacts the 'word' 123 to 'xxx'

Example2:

  • RegEx: '123.*'

  • Redacted with: 'xxx'

orig         redact         Comment
0000123000   0000123000      < Not redacted - 123 is part of the string.
0000123      0000123         < Not redacted - 123 is part of the string.
123000       123000          < Redacts the 'word' 123000 to 'xxx'
000 123      000 xxx         < Redacts the 'word' 123 to 'xxx'
000 123 000  000 xxx 000     < Redacts the 'word' 123 to 'xxx'
123          xxx             < Redacts the 'word' 123 to 'xxx'
000 123.     000 xxx.        < Redacts the 'word' 123 to 'xxx'

Example3:

  • RegEx: '.*123'

  • Redacted with: 'xxx'

orig         redact         Comment
0000123000   0000123000      < Not redacted - 123 is part of the string.
0000123      0000123         < Redacts the 'word' 000123 to 'xxx'
123000       123000          < Not redacted - 123 is part of the string.
000 123      000 xxx         < Redacts the 'word' 123 to 'xxx'
000 123 000  000 xxx 000     < Redacts the 'word' 123 to 'xxx'
123          xxx             < Redacts the 'word' 123 to 'xxx'
000 123.     000 xxx.        < Redacts the 'word' 123 to 'xxx'

As can be seen - if we match a value ('word') the complete 'word' is redacted.
The Free Text Redaction can't redact a segment of characters in a word.

In order to mask/redact a segment - one need to use a PlugIn (from TS).

You can use a Free Text Redaction Algorithm Framework to show or hide information by displaying either a “DenyList” or an “AllowList.”

DenyList – Designated material will be redacted (removed). For example, you can set a deny list to hide patient names and addresses. The deny list feature will match the data in the lookup file to the input.

AllowList – ONLY designated material will be visible. For example, if a drug company wants to assess how often a particular drug is being prescribed, you can use an allow list so that only the name of the drug will appear in the notes.

Creating a free text redaction algorithm via UI

  1. At the top right of the Algorithms page, click + Algorithm.free text

  2. Enter an Algorithm Name.

    This MUST be unique.
  3. Enter a Description.

  4. Select Free Text Redaction as the Framework Name and click Next.free text

  5. Select a Redact Type: the Deny List or Allow List.

  6. Select a Lookup File and enter a Redaction Value OR/AND

  7. Enter Regular Expressions by clicking on edit icon using the List editor section. More information on the List Editor section can be found here.

  8. Enter a Redaction Value for Regular Expression.

  9. Click Next to verify details on the Summary step.free text

  10. Click Save.

Existing limitations:
  1. The maximum number of supported Regular Expressions is 50. Exceeding this number will lead to the Component Configuration exception.

  2. The maximum number of supported words in the Lookup File is 1000. Exceeding this number may affect the algorithm performance.

  3. The Lookup File format must be txt.

  4. Every entry in the Lookup File must be a new line separated. Phrases are not supported. Case sensitive.

  5. The maximum length of an input text to mask is 32768. Exceeding this number will lead to the Non-Conformant data exception.

For information on creating Free Text Redaction algorithms through the API, see API Calls for Creating Algorithms - Free Text Redaction.

Examples

Input:

The customer Bob Jones is satisfied with the terms of the sales
agreement. Please call to confirm at 718-223-7896.
Algorithm configuration:
  1. The Redact Type is DenyList

  2. Lookup File entries:

    Bob
    Jones
    agreement
  3. The Lookup File Redaction Value is XXXX

  4. Regular Expressions entry:

    [0-9]{3}-[0-9]{3}-[0-9]{4}
    1. The Regular Expression Redaction Value is YYYY

Masking result:
The customer XXXX XXXX is satisfied with the terms
of the sales XXXX. Please call to confirm at YYYY.

"Bob", "Jones", "agreement" and the phone number are redacted.