Handle UTF-8 errors when reading PBOs #34

BrettMayson · 2019-04-27T08:47:34Z

thread '<unnamed>' panicked at 'called `Result::unwrap()` on an `Err` value: FromUtf8Error { bytes: [82, 80, 95, 67, 111, 109, 109, 97, 110, 100, 115, 32, 151, 32, 107, 111, 112, 105, 97, 46, 104, 112, 112], error: Utf8Error { valid_up_to: 12, error_len: Some(1) } }', src/libcore/result.rs:997:5

Comes from read_cstring , one file I know that causes it is @rhsusaf/addons/rhsusf_c_radio.pbo while trying to use PBO::read or armake2 unpack

The text was updated successfully, but these errors were encountered:

BrettMayson · 2019-04-27T09:10:57Z

It's being caused by an em dash (—) in the filename RP_Commands — kopia.hpp

BrettMayson · 2019-05-03T02:09:18Z

I determined this was being caused by the PBO being encoded with Windows-1252 filenames. The em dash (—) character is not able to be opened in a UTF-8 context and causes this bug.

Krzmbrzl · 2019-05-03T05:33:23Z

So is this even a real bug in armake then? 🤔

BrettMayson · 2019-05-03T05:35:31Z

Yes, it can be handled

Krzmbrzl · 2019-05-03T10:48:56Z

But only if the encoding has some kind of identifying header, right? Is this the case?

BrettMayson · 2019-05-03T10:50:09Z

It can also be dropped by doing a lossy conversion to UTF-8, that is what I did for a workaround

Krzmbrzl · 2019-05-03T12:43:23Z

By that I assume you mean decoding the input with UTF-8 and throwing away all invalid bits and pieces that come up during that conversion?

BrettMayson · 2019-05-03T12:44:56Z

Yes

Krzmbrzl · 2019-05-03T15:17:37Z

Not sure if that is a good practice that should be applied in the general case. If this messes something up, then the user might get really confused because the error message complains about character sequences that don't exist in the original.
Or even worse: armake doesn't even detect the error but some of the scripts messed up. This error would then be detected in Arma leading to some really awkward bugs.

I think it would be better to error out on such things. At least by default. I think we could add another option that can be set to force the lossy conersion in such cases.

KoffeinFlummi · 2019-05-13T18:11:05Z

Handling this sloppily could lead to problems. Different tools handling this issue differently could lead to those tools calculating different hashes when signing or verifying PBOs, so a warning should be shown at least. Obviously it should be possible to handle it in some way since it would very simple to prevent unpacking of PBOs otherwise.

Soldia1138 · 2019-05-28T20:24:37Z

We could add multiple tries to read the PBO in different encodings (Try UTF-8, error -> Try another).

As an alternative we could do a lossy conversion if the normal reading fails and somehow mark the file, like adding a fat comment on top

Krzmbrzl · 2019-05-29T04:20:26Z

I think the problem is to detect whether decoding with the current charset has failed. It's not like it woul throw exceptions when attempting to use the wrong charset.
You'd have to actually go through the decoded content and try to check whether this looks like proper text/code or whether that is rubbish...

BrettMayson · 2019-05-29T05:19:18Z

Using non lossy utf-8 will fail, at least with the RHS file from the original comment

Krzmbrzl · 2019-05-29T07:32:03Z

Maybe this here could be handy for this: https://github.com/lifthrasiir/rust-encoding
We could iterate all charets that are available in that library and see if the decoding works properly. If not (from the README I assume that this library does also throw errors if the decoding fails), try the next one (and so on).

Soldia1138 · 2019-05-29T07:40:47Z

That would be exactly what I meant. Good catch.

As I‘m doing a mass sign of files, I can say that probably 3-5% of all files (roughly 70GB of pbos) have non-UTF8 encoding.
So it would be very beneficial, if armake2 would automatically switch through encodings if one fails.

I‘m wondering how armake1 handles this. Used that until now and it never throws any errors on this.

Reidond · 2019-07-21T13:42:12Z

Another error but with packing
error: Failed to write PBO: Windows stdio in console mode does not support writing non-UTF-8 byte sequences

ArwynFr · 2021-05-03T16:50:07Z

The error seems also to occur when reading an LZSS compressed PBO

BrettMayson changed the title ~~read_cstring doesn't support full UTF-8~~ read_cstring UTF-8 error Apr 27, 2019

KoffeinFlummi changed the title ~~read_cstring UTF-8 error~~ Handle UTF-8 errors when reading PBOs May 13, 2019

KoffeinFlummi added the bug Something isn't working label May 13, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Handle UTF-8 errors when reading PBOs #34

Handle UTF-8 errors when reading PBOs #34

BrettMayson commented Apr 27, 2019 •

edited

Loading

BrettMayson commented Apr 27, 2019

BrettMayson commented May 3, 2019 •

edited

Loading

Krzmbrzl commented May 3, 2019

BrettMayson commented May 3, 2019

Krzmbrzl commented May 3, 2019

BrettMayson commented May 3, 2019

Krzmbrzl commented May 3, 2019

BrettMayson commented May 3, 2019

Krzmbrzl commented May 3, 2019

KoffeinFlummi commented May 13, 2019

Soldia1138 commented May 28, 2019 •

edited

Loading

Krzmbrzl commented May 29, 2019

BrettMayson commented May 29, 2019

Krzmbrzl commented May 29, 2019 •

edited

Loading

Soldia1138 commented May 29, 2019

Reidond commented Jul 21, 2019

ArwynFr commented May 3, 2021

Handle UTF-8 errors when reading PBOs #34

Handle UTF-8 errors when reading PBOs #34

Comments

BrettMayson commented Apr 27, 2019 • edited Loading

BrettMayson commented Apr 27, 2019

BrettMayson commented May 3, 2019 • edited Loading

Krzmbrzl commented May 3, 2019

BrettMayson commented May 3, 2019

Krzmbrzl commented May 3, 2019

BrettMayson commented May 3, 2019

Krzmbrzl commented May 3, 2019

BrettMayson commented May 3, 2019

Krzmbrzl commented May 3, 2019

KoffeinFlummi commented May 13, 2019

Soldia1138 commented May 28, 2019 • edited Loading

Krzmbrzl commented May 29, 2019

BrettMayson commented May 29, 2019

Krzmbrzl commented May 29, 2019 • edited Loading

Soldia1138 commented May 29, 2019

Reidond commented Jul 21, 2019

ArwynFr commented May 3, 2021

BrettMayson commented Apr 27, 2019 •

edited

Loading

BrettMayson commented May 3, 2019 •

edited

Loading

Soldia1138 commented May 28, 2019 •

edited

Loading

Krzmbrzl commented May 29, 2019 •

edited

Loading