-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
more boms for testing #13
Comments
The cp* boms there are being detected as utf-8 , but the bom comes out and is scrambled, like More good ones at https://en.wikipedia.org/wiki/Byte_order_mark#Byte_order_marks_by_encoding |
All of those (except utf6/16/32) also detect as either ascii, iso-8859-1, windows-1252 The detection algorithm needs to be improved, as chardet doesnt seem to be working very well. |
https://github.com/PyYoshi/cChardet looks good. |
Not so good, as PyYoshi/cChardet#26 is a regression when compared to Other boms not detected: https://github.com/PyYoshi/uchardet/issues/4 We cant use the highest confidence of the two either, as cchardet has a higher confidence for curlies, but it is wrong. curlies:
gb-18030:
We might be able to use a rule like "if cchardet confidence is >90% then use its choice, otherwise use chardet." |
Also created Joungkyun/python-chardet#3 and thombashi/mbstrdecoder#2 , because those two have good results, but all have one or more problems. |
I would expect this. Non-Unicode encodings, such as Windows-1252, don't have BOMs. CP1252 characters represented within (let's say) UTF-8 don't affect the BOM that started the file. And multiple international code pages can be represented within the same Unicode file. Do you understand what jeff00seattle is trying to represent with those 6-byte sequences? |
Not sure. The python codecs dont encode the unicode BOM as that >>> '\ufeff'.encode('cp1250')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/lib64/python3.7/encodings/cp1250.py", line 12, in encode
return codecs.charmap_encode(input,errors,encoding_table)
UnicodeEncodeError: 'charmap' codec can't encode character '\ufeff' in position 0: character maps to <undefined> I thought maybe they are some 'magic' heuristic, but unix command That part of the code comes from the initial commit, jeff00seattle/pyfortified-requests@3894b2d , so the commit history doesnt tell us much. He also committed the same at hasdank/facebook-ads-worker@401da89#diff-89cb7719c9b5c9342ab6e6ba745c75b3 And 12 hours ago at https://github.com/jeff00seattle/facebook-api-scripts/blob/076b8ee7b09dc912f0adb2f20faa5c0ca007eab0/scripts/py_sources/utils.py ping @jeff00seattle , maybe you can enlighten us where you encountered those. |
Cant make much sense out of these either >>> b'\xc4\x8f\xc2\xbb\xc5\xbc'.decode('cp1250')
''
>>> b'\xd0\xbf\xc2\xbb\xd1\x97'.decode('cp1251')
'п»ї'
>>> b'\xc3\xaf\xc2\xbb\xc2\xbf'.decode('cp1252')
''
>>> b'\xce\xbf\xc2\xbb\xce\x8f'.decode('cp1253')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/lib64/python3.7/encodings/cp1253.py", line 15, in decode
return codecs.charmap_decode(input,errors,decoding_table)
UnicodeDecodeError: 'charmap' codec can't decode byte 0x8f in position 5: character maps to <undefined>
>>> b'\xc3\xaf\xc2\xbb\xc2\xbf'.decode('cp1254')
''
>>> b'\xd7\x9f\xc2\xbb\xc2\xbf'.decode('cp1255')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/lib64/python3.7/encodings/cp1255.py", line 15, in decode
return codecs.charmap_decode(input,errors,decoding_table)
UnicodeDecodeError: 'charmap' codec can't decode byte 0x9f in position 1: character maps to <undefined>
>>> b'\xc3\xaf\xc2\xbb\xd8\x9f'.decode('cp1256')
'أ¯آ»طں'
>>> b'\xc4\xbc\xc2\xbb\xc3\xa6'.decode('cp1257')
'ļĀ»Ć¦'
>>> b'\xc3\xaf\xc2\xbb\xc2\xbf'.decode('cp1258')
'Ă¯Â»Â¿' |
https://github.com/jeff00seattle/pyfortified-requests/blob/master/pyfortified_requests/support/bom_encoding.py has a nice list of boms and the encoding name, which could be used to create more test data files and ensure they also work.
The text was updated successfully, but these errors were encountered: