Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

IN_PAN indian pan number predefined recognizer overwrites every 10 character alphanumeric string- fitting to regex - identifier. #1526

Open
sasever opened this issue Feb 14, 2025 · 1 comment

Comments

@sasever
Copy link

sasever commented Feb 14, 2025

Describe the bug
IN_PAN Indian pan number added with "PR: New Predefined Recognizer: IN_PAN #1100" by @devopam , does not have an implemented checksum although mentioned multiple times in the PR and only looks at the regex. therefore overwrites every 10 digit alphanumeric identifier that fits the regex. being it predefined also creates an issue in disabling, even after disabling with supported_entities. remove keeps marking unrelated words with IN_PAN

To Reproduce
providing 3 faker generated sample text that it should not mark as IN_PAN. words that are falsely marked as IN_PAN are highlighted with bold:

  1. "Near way game. Rather full alone some meeting medical. His yes whether. The user, EMP34050, reported an issue with their phone. The problem is described as Unable to receive messages, and the issue occurs at 771 Victor Land
    East Evan, IL 70246. Further details include that the phone is often showing no signal."
  2. "Actually break point world trade skin federal. Others head building message yeah. Team bad skill fly couple environment. The user, 188.22.80.89, reported an issue with their phone. The problem is described as Error connecting to network, and the issue occurs at Unit 1179 Box 9200
    DPO AP 77290. Further details include that the phone is often showing no signal."
  3. "Bar senior star effect commercial room able. Beat blood beyond whatever. Day special thing able.
    Weight bad that. Measure poor push role certain. The user, LIC8208114, reported an issue with their phone. The problem is described as Call drops intermittently, and the issue occurs at 99796 Kennedy Courts
    Zacharytown, VT 28689. Further details include that the phone is often showing no signal."

running with no predefined as removed and adding custom recognizers:

emp_patterns = [Pattern(name="emp_id", regex=r'EMP\d{5}', score=0.85)]
emp_recognizer = PatternRecognizer(supported_entity="EMP_ID", patterns=emp_patterns)

and anonymize function:

def anonymize_text(text: str) -> str:
    if text:
        results = analyzer.analyze(text=text, language='en')
        anonymized_results = anonymizer.anonymize(
            text=text,
            analyzer_results=results,
            operators={
                "EMAIL_ADDRESS": OperatorConfig("replace", {"new_value": "<ANONYMIZED_EMAIL>"}),
                "PHONE_NUMBER": OperatorConfig("mask", {"type": "mask", "masking_char": "*", "chars_to_mask": 12, "from_end": True}),
                "LICENSE_NUMBER": OperatorConfig("replace", {"new_value": "<ANONYMIZED_LN>"}),
                "PERSON": OperatorConfig("replace", {"new_value": "<ANONYMIZED_PERSON>"}),
               # "FIRST_NAME": OperatorConfig("replace", {"new_value": "<ANONYMIZED_FIRST_NAME>"}),
               # "LAST_NAME": OperatorConfig("replace", {"new_value": "<ANONYMIZED_LAST_NAME>"}),
                "IP_ADDRESS": OperatorConfig("replace", {"new_value": "<ANONYMIZED_IP>"}),
                "EMP_ID": OperatorConfig("replace", {"new_value": "<SOME_EMP_ID>"}),
                "Location": OperatorConfig("replace", {"new_value": "<SOME_LOCATION>"}),
                "TITLE": OperatorConfig("redact", {})
            }
        )
        return anonymized_results.text
    return text

result:

  1. "Near way game. Rather full alone some meeting medical. His yes whether. The user, <<IN_PAN>>, reported an issue with their phone. The problem is described as Unable to receive messages, and the issue occurs at 771 Victor Land
    East Evan, IL 70246. Further details include that the phone is often showing no signal."
  2. "Actually break point world trade skin federal. Others head building message yeah. Team bad skill fly couple environment. The user, <<IN_PAN>>, reported an issue with their phone. The problem is described as Error <IN_PAN> to network, and the issue occurs at Unit 1179 Box 9200
    DPO AP 77290. Further details include that the phone is often showing no signal."
  3. "Bar senior star effect <IN_PAN> room able. Beat blood beyond whatever. Day special thing able.
    Weight bad that. Measure poor push role certain. The user, <IN_VOTER>, reported an issue with their phone. The problem is described as Call drops intermittently, and the issue occurs at <DATE_TIME> <ANONYMIZED_PERSON>
    , . Further details include that the phone is often showing no signal."

removing the IN_PAN and IN_VOTER as below and rerun using the same anonymize function:

analyzer = AnalyzerEngine()
supported_entities = analyzer.get_supported_entities()
analyzer.registry.add_recognizer(license_recognizer)
supported_entities.remove('IN_PAN')
supported_entities.remove('IN_VOTER')
print("remaining supported entities")
print(supported_entities)
analyzer.registry.add_recognizer(ip_recognizer)
analyzer.registry.add_recognizer(emp_recognizer)

output of prints:

inital supported entities
['AU_ABN', 'US_ITIN', 'US_PASSPORT', 'URL', 'AU_ACN', 'CREDIT_CARD', 'UK_NHS', 'CRYPTO', 'DATE_TIME', 'IN_VOTER', 'US_BANK_NUMBER', 'IN_VEHICLE_REGISTRATION', 'NRP', 'IP_ADDRESS', 'AU_MEDICARE', 'LOCATION', 'US_SSN', 'SG_NRIC_FIN', 'PERSON', 'EMAIL_ADDRESS', 'IN_PASSPORT', 'IN_AADHAAR', 'IN_PAN', 'UK_NINO', 'US_DRIVER_LICENSE', 'IBAN_CODE', 'PHONE_NUMBER', 'AU_TFN', 'MEDICAL_LICENSE']
remaining supported entities
['AU_ABN', 'US_ITIN', 'US_PASSPORT', 'URL', 'AU_ACN', 'CREDIT_CARD', 'UK_NHS', 'CRYPTO', 'DATE_TIME', 'US_BANK_NUMBER', 'IN_VEHICLE_REGISTRATION', 'NRP', 'IP_ADDRESS', 'AU_MEDICARE', 'LOCATION', 'US_SSN', 'SG_NRIC_FIN', 'PERSON', 'EMAIL_ADDRESS', 'IN_PASSPORT', 'IN_AADHAAR', 'UK_NINO', 'US_DRIVER_LICENSE', 'IBAN_CODE', 'PHONE_NUMBER', 'AU_TFN', 'MEDICAL_LICENSE']

results:
Now there is no double reduction as previously done. but even removed still somehow redacts some common words as IN_PAN (and IN_VOTER).

  1. "Near way game. Rather full alone some meeting medical. His yes whether. The user, <SOME_EMP_ID>, reported an issue with their phone. The problem is described as Unable to receive messages, and the issue occurs at 771 Victor Land
    East Evan, IL 70246. Further details include that the phone is often showing no signal."
  2. "Actually break point world trade skin federal. Others head building message yeah. Team bad skill fly couple environment. The user, <ANONYMIZED_IP>, reported an issue with their phone. The problem is described as Error <IN_PAN> to network, and the issue occurs at Unit 1179 Box 9200
    DPO AP 77290. Further details include that the phone is often showing no signal."
  3. "Bar senior star effect <IN_PAN> room able. Beat blood beyond whatever. Day special thing able.
    Weight bad that. Measure poor push role certain. The user, <IN_VOTER>, reported an issue with their phone. The problem is described as Call drops intermittently, and the issue occurs at <DATE_TIME> Kennedy Courts
    , VT 28689. Further details include that the phone is often showing no signal."

Expected behavior

  1. IN_PAN only replacing id information not common words with proper checksum.
  2. IN_PAN not double redacting something already reducted as shown in first output example
  3. IN_PAN after being removed entities not replacing anything

Additional context
similiar issue exists with IN_VOTER

@omri374
Copy link
Contributor

omri374 commented Feb 16, 2025

Hi,
Thanks for the feedback, we'll look into improving the regex patterns (we'd also be happy to receive community contributions around this).
Presidio comes with a lot of predefined entities, but it doesn't mean one should use all of them. If the data isn't likely to contain IN_PAN or IN_VOTER entities, then the best way to remove those is to remove the recognizer (and not the supported entity).

In many cases, it's actually advised to start with an empty list of recognizers, and add only those who are needed, to avoid false positives.
See a more detailed example here where section 3.2 shows how to customize the recognizers.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants