You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Describe the bug
IN_PAN Indian pan number added with "PR: New Predefined Recognizer: IN_PAN #1100" by @devopam , does not have an implemented checksum although mentioned multiple times in the PR and only looks at the regex. therefore overwrites every 10 digit alphanumeric identifier that fits the regex. being it predefined also creates an issue in disabling, even after disabling with supported_entities. remove keeps marking unrelated words with IN_PAN
To Reproduce
providing 3 faker generated sample text that it should not mark as IN_PAN. words that are falsely marked as IN_PAN are highlighted with bold:
"Near way game. Rather full alone some meeting medical. His yes whether. The user, EMP34050, reported an issue with their phone. The problem is described as Unable to receive messages, and the issue occurs at 771 Victor Land
East Evan, IL 70246. Further details include that the phone is often showing no signal."
"Actually break point world trade skin federal. Others head building message yeah. Team bad skill fly couple environment. The user, 188.22.80.89, reported an issue with their phone. The problem is described as Error connecting to network, and the issue occurs at Unit 1179 Box 9200
DPO AP 77290. Further details include that the phone is often showing no signal."
"Bar senior star effect commercial room able. Beat blood beyond whatever. Day special thing able.
Weight bad that. Measure poor push role certain. The user, LIC8208114, reported an issue with their phone. The problem is described as Call drops intermittently, and the issue occurs at 99796 Kennedy Courts
Zacharytown, VT 28689. Further details include that the phone is often showing no signal."
running with no predefined as removed and adding custom recognizers:
"Near way game. Rather full alone some meeting medical. His yes whether. The user, <<IN_PAN>>, reported an issue with their phone. The problem is described as Unable to receive messages, and the issue occurs at 771 Victor Land
East Evan, IL 70246. Further details include that the phone is often showing no signal."
"Actually break point world trade skin federal. Others head building message yeah. Team bad skill fly couple environment. The user, <<IN_PAN>>, reported an issue with their phone. The problem is described as Error <IN_PAN> to network, and the issue occurs at Unit 1179 Box 9200
DPO AP 77290. Further details include that the phone is often showing no signal."
"Bar senior star effect <IN_PAN> room able. Beat blood beyond whatever. Day special thing able.
Weight bad that. Measure poor push role certain. The user, <IN_VOTER>, reported an issue with their phone. The problem is described as Call drops intermittently, and the issue occurs at <DATE_TIME> <ANONYMIZED_PERSON>
, . Further details include that the phone is often showing no signal."
removing the IN_PAN and IN_VOTER as below and rerun using the same anonymize function:
results:
Now there is no double reduction as previously done. but even removed still somehow redacts some common words as IN_PAN (and IN_VOTER).
"Near way game. Rather full alone some meeting medical. His yes whether. The user, <SOME_EMP_ID>, reported an issue with their phone. The problem is described as Unable to receive messages, and the issue occurs at 771 Victor Land
East Evan, IL 70246. Further details include that the phone is often showing no signal."
"Actually break point world trade skin federal. Others head building message yeah. Team bad skill fly couple environment. The user, <ANONYMIZED_IP>, reported an issue with their phone. The problem is described as Error <IN_PAN> to network, and the issue occurs at Unit 1179 Box 9200
DPO AP 77290. Further details include that the phone is often showing no signal."
"Bar senior star effect <IN_PAN> room able. Beat blood beyond whatever. Day special thing able.
Weight bad that. Measure poor push role certain. The user, <IN_VOTER>, reported an issue with their phone. The problem is described as Call drops intermittently, and the issue occurs at <DATE_TIME> Kennedy Courts
, VT 28689. Further details include that the phone is often showing no signal."
Expected behavior
IN_PAN only replacing id information not common words with proper checksum.
IN_PAN not double redacting something already reducted as shown in first output example
IN_PAN after being removed entities not replacing anything
Additional context
similiar issue exists with IN_VOTER
The text was updated successfully, but these errors were encountered:
Hi,
Thanks for the feedback, we'll look into improving the regex patterns (we'd also be happy to receive community contributions around this).
Presidio comes with a lot of predefined entities, but it doesn't mean one should use all of them. If the data isn't likely to contain IN_PAN or IN_VOTER entities, then the best way to remove those is to remove the recognizer (and not the supported entity).
In many cases, it's actually advised to start with an empty list of recognizers, and add only those who are needed, to avoid false positives.
See a more detailed example here where section 3.2 shows how to customize the recognizers.
Describe the bug
IN_PAN Indian pan number added with "PR: New Predefined Recognizer: IN_PAN #1100" by @devopam , does not have an implemented checksum although mentioned multiple times in the PR and only looks at the regex. therefore overwrites every 10 digit alphanumeric identifier that fits the regex. being it predefined also creates an issue in disabling, even after disabling with supported_entities. remove keeps marking unrelated words with IN_PAN
To Reproduce
providing 3 faker generated sample text that it should not mark as IN_PAN. words that are falsely marked as IN_PAN are highlighted with bold:
East Evan, IL 70246. Further details include that the phone is often showing no signal."
DPO AP 77290. Further details include that the phone is often showing no signal."
Weight bad that. Measure poor push role certain. The user, LIC8208114, reported an issue with their phone. The problem is described as Call drops intermittently, and the issue occurs at 99796 Kennedy Courts
Zacharytown, VT 28689. Further details include that the phone is often showing no signal."
running with no predefined as removed and adding custom recognizers:
and anonymize function:
result:
East Evan, IL 70246. Further details include that the phone is often showing no signal."
DPO AP 77290. Further details include that the phone is often showing no signal."
Weight bad that. Measure poor push role certain. The user, <IN_VOTER>, reported an issue with their phone. The problem is described as Call drops intermittently, and the issue occurs at <DATE_TIME> <ANONYMIZED_PERSON>
, . Further details include that the phone is often showing no signal."
removing the IN_PAN and IN_VOTER as below and rerun using the same anonymize function:
output of prints:
results:
Now there is no double reduction as previously done. but even removed still somehow redacts some common words as IN_PAN (and IN_VOTER).
East Evan, IL 70246. Further details include that the phone is often showing no signal."
DPO AP 77290. Further details include that the phone is often showing no signal."
Weight bad that. Measure poor push role certain. The user, <IN_VOTER>, reported an issue with their phone. The problem is described as Call drops intermittently, and the issue occurs at <DATE_TIME> Kennedy Courts
, VT 28689. Further details include that the phone is often showing no signal."
Expected behavior
Additional context
similiar issue exists with IN_VOTER
The text was updated successfully, but these errors were encountered: