Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

More than one match found for domains regex #129

Open
ItayBenAvi opened this issue Feb 5, 2025 · 1 comment
Open

More than one match found for domains regex #129

ItayBenAvi opened this issue Feb 5, 2025 · 1 comment

Comments

@ItayBenAvi
Copy link

Describe the bug
mail-parser fails to run regex on domains that ends with ".id"
My scraper fails on them specifically

To Reproduce
Change domain to end with .id

Change in const.py:51 the regex to negative lookbehind with whitespace and dot:
(
r"[^\w](?:id\s+(?P.+?)(?:\s*[(]?envelope-from|\s*"
r"[(]?envelope-sender|\s+from|\s+by|\s+with"
r"(?! cipher)|\s+for|\s+via|;))"
)
(
r"(?<![^\w\.])(?:id\s+(?P.+?)(?:\s*[(]?envelope-from|\s*"
r"[(]?envelope-sender|\s+from|\s+by|\s+with"
r"(?! cipher)|\s+for|\s+via|;))"
)

Expected behavior
ID extracted correctly and only once

Environment:

  • OS: Linux
  • Docker: yes
  • mail-parser version 4.1.2
@vmeyet
Copy link

vmeyet commented Feb 20, 2025

We've got the same issue/behavior (also on v4.1.2).

The parsing is erroring if any of the mail "received" contains a domain in .id

Here is a simple reproducible example:

The email containing a domain with .id

Received: from web.myhost.id
	by smtp.domain.com (Proxmox) with ESMTPS id SOMEIDHERE
	for <[email protected]>; Wed, 19 Feb 2025 15:00:00 +0700 (WIB)
From: "Someone" <[email protected]>
To: <[email protected]>
Subject: OK
Message-ID: <[email protected]>
Date: Wed, 19 Feb 2025 12:00:53 +0000
Content-Type: multipart/mixed; boundary="--_BOUND"
MIME-Version: 1.0

----_BOUND
Content-Type: text/plain; name="hello_world.txt"
Content-Transfer-Encoding: base64

aGVsbG8gd29ybGQK

----_BOUND--

The code

import pathlib
from mailparser import parse_from_string

with pathlib.Path('/path/to/file/test.txt').open() as file:
    parse_from_string(file.read()) # Gives an error log

# More than one match found for [^\w](?:id\s+(?P<id>.+?)(?:\s*[(]?envelope-from|\s*[(]?envelope-sender|\s+from|\s+by|\s+with(?! cipher)|\s+for|\s+via|;)) in from web.myhost.id by smtp.domain.com Proxmox with ESMTPS id SOMEIDHERE for <[email protected]>; Wed, 19 Feb 2025 15:00:00 +0700 WIB
# More than one match found for [^\w](?:id\s+(?P<id>.+?)(?:\s*[(]?envelope-from|\s*[(]?envelope-sender|\s+from|\s+by|\s+with(?! cipher)|\s+for|\s+via|;)) in from web.myhost.id by smtp.domain.com Proxmox with ESMTPS id SOMEIDHERE for <[email protected]>; Wed, 19 Feb 2025 15:00:00 +0700 WIB

Replacing the domain by a different tld than .id removes the error.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants