utf-8 codec decoding error #22

sfmoreno · 2022-02-04T16:08:43Z

Hello,

I'm running a simple test, trying to read an PDF417 bar code. I'm getting this error:

... zxing_init_.py", line 159, in parse
raw = raw[:-1].decode()
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xb9 in position 171: invalid start byte

Here is my code:

import zxing
reader = zxing.BarCodeReader()
barcode = reader.decode("pdf417_bin.png", possible_formats=['PDF_417'], try_harder=True)
print(barcode)

And here is the image to decode:

I used the zxing online decoder https://zxing.org/w/decode.jspx and it works fine, so I want to know if I'm doing something wrong or it is a bug.

Thanks,

The text was updated successfully, but these errors were encountered:

dlenski · 2022-02-07T03:32:54Z

I used the zxing online decoder https://zxing.org/w/decode.jspx and it works fine, so I want to know if I'm doing something wrong

It certainly doesn't look like the contents of your barcode are valid UTF-8: a lot of it appears to be horribly mangled.

… or it is a bug.

It's mainly a bug in how the ZXing command-line runner (which this Python module relies on as a wrapper script!) handles its output: namely, it mangles bytes that aren't validly encoded… and the encoding it targets depends on the operating system being used. ☹️ 👎.

On Windows, this mostly works, because the default encoding can reversibly encode and decode unknown bytes, but on Linux it's UTF-8, where certain byte sequences are invalid, and they get mangled by the ZXing Java command-line-runner to an extent that the Python wrapper can't recover them.

See #17 (comment), specifically points (2) and (3) for more thoughts:

The ZXing command-line-runner mangles the output of raw bytes beyond recognition on some operating systems, if they can't be correctly interpreted as UTF-8. For example, the QR code you give as an example gets completely borked on Linux. (I'm guessing you're testing on Windows?) See aff3dde where I added your file as an example, along with some other possible changes:
…

In order to improve this situation, the ZXing command-line runner would have to be improved to not mangle unknown bytes.

One possible fix for this is #19.

dlenski · 2022-02-07T03:51:36Z

It's possible to use non-default encodings in PDF-17 (and other 2D barcodes like QR or DataMatrix), but you're at the mercy of having encoding and decoding and wrapping software that understands how to do this exactly right… and most of it doesn't.

I'll make the immodest claim that I understand and care about this better than almost anyone in the world, and have contributed code to the Java ZXing library to improve the correctness of its handling of nonstandard character sets (see zxing/zxing#1328, zxing/zxing#1330), but I haven't had time to improve the CommandLineRunner's output.

mborus · 2023-09-01T10:27:10Z

I'm coming here because of the same problem - trying to decode example barcodes from Deutsche Bahn (https://www.bahn.de/angebot/regio/barcode, there's a zip folder "Muster-Tickets nach UIC 918.9 (ZIP, 2 MB)"). The Aztec code contains no valid utf-8 - most of it is in zip format. So to process it, the library needs to return the bytes in the raw content to be useful here.

Patching the __init__.py file at line 189 to return the bytes as string like this makes it possible to see the content and it is consistent with barcode scanning apps. (I'm using similar hacks whenever I need to paste the aztec code into web pages)

    try: 
        raw = raw[:-1].decode()
        parsed = parsed[:-1].decode()
    except UnicodeDecodeError:
        raw = ' '.join(f"{i:>02x}" for i in raw[:-1])
        parsed = ""

It would be useful if an optional parameter switches the raw result to bytes and turns off any parsing attempts.

dlenski mentioned this issue Sep 6, 2023

Parse raw bits if possible #26

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

utf-8 codec decoding error #22

utf-8 codec decoding error #22

sfmoreno commented Feb 4, 2022

dlenski commented Feb 7, 2022

dlenski commented Feb 7, 2022

mborus commented Sep 1, 2023 •

edited

Loading

utf-8 codec decoding error #22

utf-8 codec decoding error #22

Comments

sfmoreno commented Feb 4, 2022

dlenski commented Feb 7, 2022

dlenski commented Feb 7, 2022

mborus commented Sep 1, 2023 • edited Loading

mborus commented Sep 1, 2023 •

edited

Loading