-
Notifications
You must be signed in to change notification settings - Fork 17
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Handle UTF-8 errors when reading PBOs #34
Comments
It's being caused by an em dash (—) in the filename |
I determined this was being caused by the PBO being encoded with Windows-1252 filenames. The em dash (—) character is not able to be opened in a UTF-8 context and causes this bug. |
So is this even a real bug in armake then? 🤔 |
Yes, it can be handled |
But only if the encoding has some kind of identifying header, right? Is this the case? |
It can also be dropped by doing a lossy conversion to UTF-8, that is what I did for a workaround |
By that I assume you mean decoding the input with UTF-8 and throwing away all invalid bits and pieces that come up during that conversion? |
Yes |
Not sure if that is a good practice that should be applied in the general case. If this messes something up, then the user might get really confused because the error message complains about character sequences that don't exist in the original. I think it would be better to error out on such things. At least by default. I think we could add another option that can be set to force the lossy conersion in such cases. |
Handling this sloppily could lead to problems. Different tools handling this issue differently could lead to those tools calculating different hashes when signing or verifying PBOs, so a warning should be shown at least. Obviously it should be possible to handle it in some way since it would very simple to prevent unpacking of PBOs otherwise. |
We could add multiple tries to read the PBO in different encodings (Try UTF-8, error -> Try another). As an alternative we could do a lossy conversion if the normal reading fails and somehow mark the file, like adding a fat comment on top |
I think the problem is to detect whether decoding with the current charset has failed. It's not like it woul throw exceptions when attempting to use the wrong charset. |
Using non lossy utf-8 will fail, at least with the RHS file from the original comment |
Maybe this here could be handy for this: https://github.com/lifthrasiir/rust-encoding |
That would be exactly what I meant. Good catch. As I‘m doing a mass sign of files, I can say that probably 3-5% of all files (roughly 70GB of pbos) have non-UTF8 encoding. I‘m wondering how armake1 handles this. Used that until now and it never throws any errors on this. |
Another error but with packing |
The error seems also to occur when reading an LZSS compressed PBO |
Comes from
read_cstring
, one file I know that causes it is@rhsusaf/addons/rhsusf_c_radio.pbo
while trying to usePBO::read
orarmake2 unpack
The text was updated successfully, but these errors were encountered: