Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OCTETS support for database connection #12

Open
savingorup opened this issue Jan 22, 2020 · 10 comments
Open

OCTETS support for database connection #12

savingorup opened this issue Jan 22, 2020 · 10 comments

Comments

@savingorup
Copy link

When connecting with fdb + python3 to legacy databases with mixed/non-proper encodings, you receive a lot of encoding/decoding errors.
If there is no option to actually correct this databases, Firebird offers OCTETS encoding to bypass encodings entierly.
However, calling fdb.connect(..., charset='OCTETS') produces multiple errors. In fact, this never worked as intended.

The first fix is to change in ibase.py None->'latin-1'
charset_map = {
'OCTETS' : 'latin-1', # Allow to pass through unchanged.
'latin-1' is identity encoding in python (as 'octets' is in Firebird).

The second fix is in fbcore.py in build_dpb():
...
if charset and charset!='OCTETS': # added ... and charset!='OCTETS'
dpb.add_string_parameter(isc_dpb_lc_ctype, charset.upper())
...
Firebird does not allow to pass OCTETS ad LC_TYPE parameter, so we skip it.

With this two changes, I can pass charset='OCTETS' and do selects/inserts, including BLOBS.
If you are interested to merge, I can provide patch against latest version. I am not used to GitHub, so I will probably need a little help with it.

@dyemanov
Copy link
Member

To bypass encoding entirely, Firebird supports NONE as a connection charset (rather than OCTETS). Specifying OCTETS at the connection level and then skipping it when passing to the engine (which actually means NONE charset is used) seems weird to me.

@savingorup
Copy link
Author

When passing 'NONE' as a connection charset, fdb uses as a python codec the one returned by getpreferredencoding() function (see charset_map definition). This depends on current system configuration and is never 'latin-1' (which maps all bytes 0-255 to the same characters, identity encoding). This in turn produces decoding errors when reading fields, which have string encoded in forms, which getpreferredencoding() can't decode.
I assumed that 'NONE' could be actually used by someone, so I opted to change 'OCTETS' which never worked before to keep things backward compatible.

@mrotteveel
Copy link
Member

If this is possible to do (and I don't know if it is), I would strongly recommend against using the name OCTETS in a non-standard manner, that would overload meaning of this.

I would rather recommend using a separate property to enable this behaviour. For example, in Jaybird, there are two properties controlling character sets, one using Firebird names and one using Java character set names. When only one is specified, either will configure the connection character set, but specifying both properties then uses one to specify the Firebird connection encoding, and the other will define the Java character set to read/write bytes from/to Firebird.

@savingorup
Copy link
Author

This issue is specific to fdb python interface. There is no standard manner of OCTETS use in fdb currently as using it in
db = fdb.connect(user='SYSDBA', password='secret', dsn='127.0.0.1:/tmp/test.fdb', charset='OCTETS')
simply does not work (throws exception), and AFAIK never did work.
Python3 enforces bytes-to-string conversion using known encoding, and python encoding must follow the encoding given in charset=... above.
Currently, fdb has no option to set python and FB encodings separatelly (looking at the code, it may be possible internally, but no such interface is exposed). It may be helpfull to see fdb contributors stance on the issue.
Anyway, it is not a big deal if such patch does not make it upstream. For us it works and it is surely not hard to maintain. I think it may be helpful for somebody else too.

@pcisar
Copy link
Contributor

pcisar commented Jan 24, 2020

The correct method (patch) should use the None/"NONE" connection charset in FDB with its redefinition in ibase.py/charset_map from 'getpreferredencoding()' to 'latin-1', instead hijacking OCTETS.

@savingorup
Copy link
Author

I totally agree. However, such a change would be backwards incompatible and could break existing programs, so I searched for alternatives.
I can create a patch or pull request with change proposed by pcisar (and test if it works) if there is interest to incorporate it in HEAD.

@pcisar
Copy link
Contributor

pcisar commented Jan 24, 2020

Well, I was not aware of 'latin-1' meaning at the time I was creating FDB, so using getpreferredencoding() was the best from bad solutions I was able to came with. So
you don't need to do a PR, I'll update the driver myself.

@mrotteveel
Copy link
Member

@savingorup This problem is wider than FDB, because NONE has a special meaning in Firebird, and OCTETS cannot be specified as a connection character set in general. Allowing FDB to overload this, makes the behaviour more surprising to people familiar with Firebird on other platforms.

@pcisar Redefining NONE to always map to latin-1 would break expectations of Firebird users that using connection character set none will use the default encoding of a system. Doesn't Python have an option to convert invalid byte combinations to either a ? or the unicode replacement character (, U+FFFD)?

Making this change will also break things for users unknowingly relying on the current encoding behaviour, receiving garbage output (or sending garbage as input to the database). Consider carefully if you really want to make this change. I'd sooner recommend defining a special character set name that maps to NONE + latin-1.

@savingorup
Copy link
Author

savingorup commented Jan 24, 2020 via email

@pcisar
Copy link
Contributor

pcisar commented Jan 24, 2020

@mrotteveel I will change the mapping for 'NONE' charset, not for None value (when connection charset is not specified at all). This should not break any application that does not use connection charset, and will fix the expected behavior of NONE for applications that explicitly define 'NONE'.

@savingorup Automatic conversion is not dependent on Python version, but actual type passed. It's applied for P3.str or P2.unicode, and do not for P3.bytes or P2.str.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants