Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

I'm unable to catch the error. The error breaks the code despite being in a try block. #333

Open
aditya005 opened this issue Feb 13, 2025 · 1 comment

Comments

@aditya005
Copy link

Code:

            try:
                result = md.convert(str(pdf_file))
            except Exception as e:
                log.error(f"MarkItDown conversion failed for {pdf_file.name}: {e}")
                print(f"DEBUG: Exception caught in conversion - {e}")

Error:

Traceback (most recent call last):
  File "<python_environment>/Lib/site-packages/markitdown/_markitdown.py", line 1239, in _convert
    res = converter.convert(local_path, **_kwargs)
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "<python_environment>/Lib/site-packages/markitdown/_markitdown.py", line 490, in convert
    text_content = pdfminer.high_level.extract_text(local_path),
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "<python_environment>/Lib/site-packages/pdfminer/high_level.py", line 176, in extract_text
    interpreter.process_page(page)
  File "<python_environment>/Lib/site-packages/pdfminer/pdfinterp.py", line 997, in process_page
    self.render_contents(page.resources, page.contents, ctm=ctm)
  File "<python_environment>/Lib/site-packages/pdfminer/pdfinterp.py", line 1014, in render_contents
    self.init_resources(resources)
  File "<python_environment>/Lib/site-packages/pdfminer/pdfinterp.py", line 387, in init_resources
    colorspace = get_colorspace(resolve1(spec))
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "<python_environment>/Lib/site-packages/pdfminer/pdfinterp.py", line 370, in get_colorspace
    return PDFColorSpace(name, stream_value(spec[1])["N"])
                               ~~~~~~~~~~~~~~~~~~~~~^^^^^
  File "<python_environment>/Lib/site-packages/pdfminer/pdftypes.py", line 263, in __getitem__
    return self.attrs[name]
           ~~~~~~~~~~^^^^^^
KeyError: 'N'

The pdf is corrupted and it's fine that it throws an exception. But it's not getting caught to be handled.
I'm using markitdown = "^0.0.1a3" on python = "^3.11"

@2niuhe
Copy link

2niuhe commented Feb 14, 2025

same issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants