Skip to content

Latest commit

 

History

History
105 lines (72 loc) · 4.27 KB

File metadata and controls

105 lines (72 loc) · 4.27 KB

PII Redactor Transform

This transform redacts Personally Identifiable Information (PII) from the input data.

The transform leverages the Microsoft Presidio SDK for PII detection and uses the Flair recognizer for entity recognition.

Contributors

Supported Entities

The transform detects the following PII entities by default:

  • PERSON: Names of individuals
  • EMAIL_ADDRESS: Email addresses
  • ORGANIZATION: Names of organizations
  • DATE_TIME: Dates and times
  • PHONE_NUMBER: Phone number
  • CREDIT_CARD: Credit card numbers

You can configure the entities to detect by passing the required entities as argument param ( --pii_redactor_entities ). To know more about different entity types supported - Entities

Redaction Techniques

Two redaction techniques are supported:

  • replace: Replaces detected PII with a placeholder (default)
  • redact: Removes the detected PII from the text

You can choose the redaction technique by passing it as an argument parameter (--pii_redactor_operator).

Input and Output

Input

The input data should be a py.Table with a column containing the text where PII detection and redaction will be applied. By default, this column is named contents.

Example Input Table Structure: Table 1: Sample input to the pii redactor transform

contents doc_id
My name is John Doe doc001
I work at apple doc002

Output

The output table will include the original columns plus an additional column new_contents which is configurable with redacted text and detected_pii column consisting the type of PII entities detected in that document for replace operator.

Example Output Table Structure for replace operator:

contents doc_id new_contents detected_pii
My name is John Doe doc001 My name is <PERSON> [PERSON]
I work at apple doc002 I work at <ORGANIZATION> [ORGANIZATION]

When redact operator is chosen the output will look like below

Example Output Table Structure for redact operator

contents doc_id new_contents detected_pii
My name is John Doe doc001 My name is [PERSON]
I work at apple doc002 I work at [ORGANIZATION]

Launched Command Line Options

The following command line arguments are available in addition to the options provided by the python launcher.

  --pii_redactor_entities PII_ENTITIES
                        list of PII entities to be captured for example: ["PERSON", "EMAIL"]
  --pii_redactor_operator REDACTOR_OPERATOR
                        Two redaction techniques are supported - replace(default), redact 
  --pii_redactor_transformed_contents PII_TRANSFORMED_CONTENT_COLUMN_NAME
                        Mention the column name in which transformed contents will be added. This is required argument. 
  --pii_redactor_score_threshold SCORE_THRESHOLD
                        The score_threshold is a parameter that sets the minimum confidence score required for an entity to be considered a match.
                        Provide a value above 0.6

PII Redactor Ray Transform

Please see the set of transform project conventions for details on general project conventions, transform configuration, testing and IDE set up.

Summary

This project wraps the pii redactor transform with a Ray runtime.

Launched Command Line Options

In addition to those available to the transform as defined here, the set of ray launcher options are available.

Transforming data using the transform image

To use the transform image to transform your data, please refer to the running images quickstart, substituting the name of this transform image and runtime as appropriate.