Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

utf-8 codec decoding error #22

Open
sfmoreno opened this issue Feb 4, 2022 · 3 comments
Open

utf-8 codec decoding error #22

sfmoreno opened this issue Feb 4, 2022 · 3 comments

Comments

@sfmoreno
Copy link

sfmoreno commented Feb 4, 2022

Hello,

I'm running a simple test, trying to read an PDF417 bar code. I'm getting this error:

... zxing_init_.py", line 159, in parse
raw = raw[:-1].decode()
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xb9 in position 171: invalid start byte

Here is my code:

import zxing
reader = zxing.BarCodeReader()
barcode = reader.decode("pdf417_bin.png", possible_formats=['PDF_417'], try_harder=True)
print(barcode)

And here is the image to decode:

image

I used the zxing online decoder https://zxing.org/w/decode.jspx and it works fine, so I want to know if I'm doing something wrong or it is a bug.

image

Thanks,

@dlenski
Copy link
Owner

dlenski commented Feb 7, 2022

I used the zxing online decoder https://zxing.org/w/decode.jspx and it works fine, so I want to know if I'm doing something wrong

It certainly doesn't look like the contents of your barcode are valid UTF-8: a lot of it appears to be horribly mangled.

… or it is a bug.

It's mainly a bug in how the ZXing command-line runner (which this Python module relies on as a wrapper script!) handles its output: namely, it mangles bytes that aren't validly encoded… and the encoding it targets depends on the operating system being used. ☹️ 👎.

On Windows, this mostly works, because the default encoding can reversibly encode and decode unknown bytes, but on Linux it's UTF-8, where certain byte sequences are invalid, and they get mangled by the ZXing Java command-line-runner to an extent that the Python wrapper can't recover them.

See #17 (comment), specifically points (2) and (3) for more thoughts:

  1. The ZXing command-line-runner mangles the output of raw bytes beyond recognition on some operating systems, if they can't be correctly interpreted as UTF-8. For example, the QR code you give as an example gets completely borked on Linux. (I'm guessing you're testing on Windows?) See aff3dde where I added your file as an example, along with some other possible changes:
  2. In order to improve this situation, the ZXing command-line runner would have to be improved to not mangle unknown bytes.

One possible fix for this is #19.

@dlenski
Copy link
Owner

dlenski commented Feb 7, 2022

It's possible to use non-default encodings in PDF-17 (and other 2D barcodes like QR or DataMatrix), but you're at the mercy of having encoding and decoding and wrapping software that understands how to do this exactly right… and most of it doesn't.

I'll make the immodest claim that I understand and care about this better than almost anyone in the world, and have contributed code to the Java ZXing library to improve the correctness of its handling of nonstandard character sets (see zxing/zxing#1328, zxing/zxing#1330), but I haven't had time to improve the CommandLineRunner's output.

@mborus
Copy link

mborus commented Sep 1, 2023

I'm coming here because of the same problem - trying to decode example barcodes from Deutsche Bahn (https://www.bahn.de/angebot/regio/barcode, there's a zip folder "Muster-Tickets nach UIC 918.9 (ZIP, 2 MB)"). The Aztec code contains no valid utf-8 - most of it is in zip format. So to process it, the library needs to return the bytes in the raw content to be useful here.

Patching the __init__.py file at line 189 to return the bytes as string like this makes it possible to see the content and it is consistent with barcode scanning apps. (I'm using similar hacks whenever I need to paste the aztec code into web pages)

    try: 
        raw = raw[:-1].decode()
        parsed = parsed[:-1].decode()
    except UnicodeDecodeError:
        raw = ' '.join(f"{i:>02x}" for i in raw[:-1])
        parsed = "" 

It would be useful if an optional parameter switches the raw result to bytes and turns off any parsing attempts.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants