Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

more boms for testing #13

Open
jayvdb opened this issue Jul 26, 2019 · 8 comments
Open

more boms for testing #13

jayvdb opened this issue Jul 26, 2019 · 8 comments

Comments

@jayvdb
Copy link
Contributor

jayvdb commented Jul 26, 2019

https://github.com/jeff00seattle/pyfortified-requests/blob/master/pyfortified_requests/support/bom_encoding.py has a nice list of boms and the encoding name, which could be used to create more test data files and ensure they also work.

@jayvdb
Copy link
Contributor Author

jayvdb commented Jul 26, 2019

The cp* boms there are being detected as utf-8 , but the bom comes out and is scrambled, like hello

More good ones at https://en.wikipedia.org/wiki/Byte_order_mark#Byte_order_marks_by_encoding

@jayvdb
Copy link
Contributor Author

jayvdb commented Jul 29, 2019

More good ones at https://en.wikipedia.org/wiki/Byte_order_mark#Byte_order_marks_by_encoding

All of those (except utf6/16/32) also detect as either ascii, iso-8859-1, windows-1252

The detection algorithm needs to be improved, as chardet doesnt seem to be working very well.

@jayvdb
Copy link
Contributor Author

jayvdb commented Jul 29, 2019

https://github.com/PyYoshi/cChardet looks good.

@jayvdb
Copy link
Contributor Author

jayvdb commented Jul 29, 2019

PyYoshi/cChardet looks good.

Not so good, as PyYoshi/cChardet#26 is a regression when compared to chardet, and curly quotes is an unfortunately really common problem.

Other boms not detected: https://github.com/PyYoshi/uchardet/issues/4

We cant use the highest confidence of the two either, as cchardet has a higher confidence for curlies, but it is wrong.

curlies:

  • chardet: {'encoding': 'Windows-1252', 'confidence': 0.73, 'language': ''}
  • cchardet: {'encoding': 'MacCentralEurope', 'confidence': 0.8183978796005249}

gb-18030:

  • chardet: {'encoding': 'Windows-1252', 'confidence': 0.73, 'language': ''}
  • cchardet: {'encoding': 'GB18030', 'confidence': 0.9900000095367432}

We might be able to use a rule like "if cchardet confidence is >90% then use its choice, otherwise use chardet."

@jayvdb
Copy link
Contributor Author

jayvdb commented Jul 29, 2019

Also created Joungkyun/python-chardet#3 and thombashi/mbstrdecoder#2 , because those two have good results, but all have one or more problems.

@timrburnham
Copy link
Owner

The cp* boms there are being detected as utf-8 , but the bom comes out and is scrambled, like hello

I would expect this. Non-Unicode encodings, such as Windows-1252, don't have BOMs. CP1252 characters represented within (let's say) UTF-8 don't affect the BOM that started the file. And multiple international code pages can be represented within the same Unicode file. Do you understand what jeff00seattle is trying to represent with those 6-byte sequences?

@jayvdb
Copy link
Contributor Author

jayvdb commented Aug 1, 2019

Not sure. The python codecs dont encode the unicode BOM as that

>>> '\ufeff'.encode('cp1250')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/lib64/python3.7/encodings/cp1250.py", line 12, in encode
    return codecs.charmap_encode(input,errors,encoding_table)
UnicodeEncodeError: 'charmap' codec can't encode character '\ufeff' in position 0: character maps to <undefined>

I thought maybe they are some 'magic' heuristic, but unix command file sees those bytes as UTF-8 Unicode text

That part of the code comes from the initial commit, jeff00seattle/pyfortified-requests@3894b2d , so the commit history doesnt tell us much.

He also committed the same at hasdank/facebook-ads-worker@401da89#diff-89cb7719c9b5c9342ab6e6ba745c75b3

And 12 hours ago at https://github.com/jeff00seattle/facebook-api-scripts/blob/076b8ee7b09dc912f0adb2f20faa5c0ca007eab0/scripts/py_sources/utils.py

ping @jeff00seattle , maybe you can enlighten us where you encountered those.

@jayvdb
Copy link
Contributor Author

jayvdb commented Aug 1, 2019

Cant make much sense out of these either

>>> b'\xc4\x8f\xc2\xbb\xc5\xbc'.decode('cp1250')
''
>>> b'\xd0\xbf\xc2\xbb\xd1\x97'.decode('cp1251')
'п»ї'
>>> b'\xc3\xaf\xc2\xbb\xc2\xbf'.decode('cp1252')
''
>>> b'\xce\xbf\xc2\xbb\xce\x8f'.decode('cp1253')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/lib64/python3.7/encodings/cp1253.py", line 15, in decode
    return codecs.charmap_decode(input,errors,decoding_table)
UnicodeDecodeError: 'charmap' codec can't decode byte 0x8f in position 5: character maps to <undefined>
>>> b'\xc3\xaf\xc2\xbb\xc2\xbf'.decode('cp1254')
''
>>> b'\xd7\x9f\xc2\xbb\xc2\xbf'.decode('cp1255')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/lib64/python3.7/encodings/cp1255.py", line 15, in decode
    return codecs.charmap_decode(input,errors,decoding_table)
UnicodeDecodeError: 'charmap' codec can't decode byte 0x9f in position 1: character maps to <undefined>
>>> b'\xc3\xaf\xc2\xbb\xd8\x9f'.decode('cp1256')
'أ¯آ»طں'
>>> b'\xc4\xbc\xc2\xbb\xc3\xa6'.decode('cp1257')
'ļĀ»Ć¦'
>>> b'\xc3\xaf\xc2\xbb\xc2\xbf'.decode('cp1258')
'Ă¯Â»Â¿'

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants