You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Describe the bug
mail-parser fails to run regex on domains that ends with ".id"
My scraper fails on them specifically
To Reproduce
Change domain to end with .id
Change in const.py:51 the regex to negative lookbehind with whitespace and dot:
(
r"[^\w](?:id\s+(?P.+?)(?:\s*[(]?envelope-from|\s*"
r"[(]?envelope-sender|\s+from|\s+by|\s+with"
r"(?! cipher)|\s+for|\s+via|;))"
)
(
r"(?<![^\w\.])(?:id\s+(?P.+?)(?:\s*[(]?envelope-from|\s*"
r"[(]?envelope-sender|\s+from|\s+by|\s+with"
r"(?! cipher)|\s+for|\s+via|;))"
)
Expected behavior
ID extracted correctly and only once
Environment:
OS: Linux
Docker: yes
mail-parser version 4.1.2
The text was updated successfully, but these errors were encountered:
We've got the same issue/behavior (also on v4.1.2).
The parsing is erroring if any of the mail "received" contains a domain in .id
Here is a simple reproducible example:
The email containing a domain with .id
Received: from web.myhost.id
by smtp.domain.com (Proxmox) with ESMTPS id SOMEIDHERE
for <[email protected]>; Wed, 19 Feb 2025 15:00:00 +0700 (WIB)
From: "Someone" <[email protected]>
To: <[email protected]>
Subject: OK
Message-ID: <[email protected]>
Date: Wed, 19 Feb 2025 12:00:53 +0000
Content-Type: multipart/mixed; boundary="--_BOUND"
MIME-Version: 1.0
----_BOUND
Content-Type: text/plain; name="hello_world.txt"
Content-Transfer-Encoding: base64
aGVsbG8gd29ybGQK
----_BOUND--
The code
importpathlibfrommailparserimportparse_from_stringwithpathlib.Path('/path/to/file/test.txt').open() asfile:
parse_from_string(file.read()) # Gives an error log# More than one match found for [^\w](?:id\s+(?P<id>.+?)(?:\s*[(]?envelope-from|\s*[(]?envelope-sender|\s+from|\s+by|\s+with(?! cipher)|\s+for|\s+via|;)) in from web.myhost.id by smtp.domain.com Proxmox with ESMTPS id SOMEIDHERE for <[email protected]>; Wed, 19 Feb 2025 15:00:00 +0700 WIB# More than one match found for [^\w](?:id\s+(?P<id>.+?)(?:\s*[(]?envelope-from|\s*[(]?envelope-sender|\s+from|\s+by|\s+with(?! cipher)|\s+for|\s+via|;)) in from web.myhost.id by smtp.domain.com Proxmox with ESMTPS id SOMEIDHERE for <[email protected]>; Wed, 19 Feb 2025 15:00:00 +0700 WIB
Replacing the domain by a different tld than .id removes the error.
Describe the bug
mail-parser fails to run regex on domains that ends with ".id"
My scraper fails on them specifically
To Reproduce
Change domain to end with .id
Change in const.py:51 the regex to negative lookbehind with whitespace and dot:
(
r"[^\w](?:id\s+(?P.+?)(?:\s*[(]?envelope-from|\s*"
r"[(]?envelope-sender|\s+from|\s+by|\s+with"
r"(?! cipher)|\s+for|\s+via|;))"
)
(
r"(?<![^\w\.])(?:id\s+(?P.+?)(?:\s*[(]?envelope-from|\s*"
r"[(]?envelope-sender|\s+from|\s+by|\s+with"
r"(?! cipher)|\s+for|\s+via|;))"
)
Expected behavior
ID extracted correctly and only once
Environment:
The text was updated successfully, but these errors were encountered: