more boms for testing #13

jayvdb · 2019-07-26T07:56:33Z

https://github.com/jeff00seattle/pyfortified-requests/blob/master/pyfortified_requests/support/bom_encoding.py has a nice list of boms and the encoding name, which could be used to create more test data files and ensure they also work.

jayvdb · 2019-07-26T14:29:23Z

The cp* boms there are being detected as utf-8 , but the bom comes out and is scrambled, like ď»żhello

More good ones at https://en.wikipedia.org/wiki/Byte_order_mark#Byte_order_marks_by_encoding

jayvdb · 2019-07-29T01:06:58Z

More good ones at https://en.wikipedia.org/wiki/Byte_order_mark#Byte_order_marks_by_encoding

All of those (except utf6/16/32) also detect as either ascii, iso-8859-1, windows-1252

The detection algorithm needs to be improved, as chardet doesnt seem to be working very well.

jayvdb · 2019-07-29T01:41:06Z

https://github.com/PyYoshi/cChardet looks good.

jayvdb · 2019-07-29T07:44:22Z

PyYoshi/cChardet looks good.

Not so good, as PyYoshi/cChardet#26 is a regression when compared to chardet, and curly quotes is an unfortunately really common problem.

Other boms not detected: https://github.com/PyYoshi/uchardet/issues/4

We cant use the highest confidence of the two either, as cchardet has a higher confidence for curlies, but it is wrong.

curlies:

chardet: {'encoding': 'Windows-1252', 'confidence': 0.73, 'language': ''}
cchardet: {'encoding': 'MacCentralEurope', 'confidence': 0.8183978796005249}

gb-18030:

chardet: {'encoding': 'Windows-1252', 'confidence': 0.73, 'language': ''}
cchardet: {'encoding': 'GB18030', 'confidence': 0.9900000095367432}

We might be able to use a rule like "if cchardet confidence is >90% then use its choice, otherwise use chardet."

jayvdb · 2019-07-29T09:39:37Z

Also created Joungkyun/python-chardet#3 and thombashi/mbstrdecoder#2 , because those two have good results, but all have one or more problems.

timrburnham · 2019-07-31T23:39:03Z

The cp* boms there are being detected as utf-8 , but the bom comes out and is scrambled, like ď»żhello

I would expect this. Non-Unicode encodings, such as Windows-1252, don't have BOMs. CP1252 characters represented within (let's say) UTF-8 don't affect the BOM that started the file. And multiple international code pages can be represented within the same Unicode file. Do you understand what jeff00seattle is trying to represent with those 6-byte sequences?

jayvdb · 2019-08-01T01:52:06Z

Not sure. The python codecs dont encode the unicode BOM as that

>>> '\ufeff'.encode('cp1250')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/lib64/python3.7/encodings/cp1250.py", line 12, in encode
    return codecs.charmap_encode(input,errors,encoding_table)
UnicodeEncodeError: 'charmap' codec can't encode character '\ufeff' in position 0: character maps to <undefined>

I thought maybe they are some 'magic' heuristic, but unix command file sees those bytes as UTF-8 Unicode text

That part of the code comes from the initial commit, jeff00seattle/pyfortified-requests@3894b2d , so the commit history doesnt tell us much.

He also committed the same at hasdank/facebook-ads-worker@401da89#diff-89cb7719c9b5c9342ab6e6ba745c75b3

And 12 hours ago at https://github.com/jeff00seattle/facebook-api-scripts/blob/076b8ee7b09dc912f0adb2f20faa5c0ca007eab0/scripts/py_sources/utils.py

ping @jeff00seattle , maybe you can enlighten us where you encountered those.

jayvdb · 2019-08-01T02:42:13Z

Cant make much sense out of these either

>>> b'\xc4\x8f\xc2\xbb\xc5\xbc'.decode('cp1250')
'ÄŹÂ»ĹĽ'
>>> b'\xd0\xbf\xc2\xbb\xd1\x97'.decode('cp1251')
'РїВ»С—'
>>> b'\xc3\xaf\xc2\xbb\xc2\xbf'.decode('cp1252')
'Ã¯Â»Â¿'
>>> b'\xce\xbf\xc2\xbb\xce\x8f'.decode('cp1253')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/lib64/python3.7/encodings/cp1253.py", line 15, in decode
    return codecs.charmap_decode(input,errors,decoding_table)
UnicodeDecodeError: 'charmap' codec can't decode byte 0x8f in position 5: character maps to <undefined>
>>> b'\xc3\xaf\xc2\xbb\xc2\xbf'.decode('cp1254')
'Ã¯Â»Â¿'
>>> b'\xd7\x9f\xc2\xbb\xc2\xbf'.decode('cp1255')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/lib64/python3.7/encodings/cp1255.py", line 15, in decode
    return codecs.charmap_decode(input,errors,decoding_table)
UnicodeDecodeError: 'charmap' codec can't decode byte 0x9f in position 1: character maps to <undefined>
>>> b'\xc3\xaf\xc2\xbb\xd8\x9f'.decode('cp1256')
'أ¯آ»طں'
>>> b'\xc4\xbc\xc2\xbb\xc3\xa6'.decode('cp1257')
'Ä¼Ā»Ć¦'
>>> b'\xc3\xaf\xc2\xbb\xc2\xbf'.decode('cp1258')
'Ă¯Â»Â¿'

jayvdb mentioned this issue Sep 3, 2019

Use unicodedata2 if available jawah/charset_normalizer#4

Closed

This was referenced Oct 11, 2019

Support chardet alternatives #25

Open

bom_info query jeff00seattle/pyfortified-requests#6

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

more boms for testing #13

more boms for testing #13

jayvdb commented Jul 26, 2019

jayvdb commented Jul 26, 2019

jayvdb commented Jul 29, 2019

jayvdb commented Jul 29, 2019

jayvdb commented Jul 29, 2019

jayvdb commented Jul 29, 2019

timrburnham commented Jul 31, 2019

jayvdb commented Aug 1, 2019

jayvdb commented Aug 1, 2019

more boms for testing #13

more boms for testing #13

Comments

jayvdb commented Jul 26, 2019

jayvdb commented Jul 26, 2019

jayvdb commented Jul 29, 2019

jayvdb commented Jul 29, 2019

jayvdb commented Jul 29, 2019

jayvdb commented Jul 29, 2019

timrburnham commented Jul 31, 2019

jayvdb commented Aug 1, 2019

jayvdb commented Aug 1, 2019