PythonObjectSerializer raise 'utf-8' codec can't decode byte #639

MASARIwot · 2023-07-18T14:41:54Z

Good time of the day team

When I tried to cache API responses, it failed as sometimes I have UnicodeDecodeError.

Looking deeper, I found that this is happening because of this code out.write_string(cPickle.dumps(obj, 0).decode("utf-8")) in PythonObjectSerializer
Full Code:

class PythonObjectSerializer(BaseSerializer):
    def read(self, inp):
        str = inp.read_string().encode()
        return cPickle.loads(str)

    def write(self, out, obj):
        out.write_string(cPickle.dumps(obj, 0).decode("utf-8"))

    def get_type_id(self):
        return PYTHON_TYPE_PICKLE

Issue example:

>>> import pickle

>>> pickle.dumps("\u00e4").decode("utf-8")

Traceback (most recent call last):

  File "<stdin>", line 1, in <module>

UnicodeDecodeError: 'utf-8' codec can't decode byte 0x80 in position 0: invalid start byte

As a workaround, I created such a custom serializer:

class HazelcastJsonSerializer(StreamSerializer):

  def read(self, inp):
       return json.loads(inp.read_string())

  def write(self, out, obj):
     out.write_string(json.dumps(obj))

   def get_type_id(self):
         …

Is there any better solution?

python version: 3.6

The text was updated successfully, but these errors were encountered:

mehmettokgoz · 2023-08-08T08:57:34Z

Hi @MASARIwot

You can use HazelcastJsonValue to store JSON values in Hazelcast IMap. No need for a custom serializer for this. Check out this example: https://github.com/hazelcast/hazelcast-python-client/blob/master/examples/pandas/pandas_example.py.

If the fields of your object is known and stable, then it's better if you use Compact serialization which is our newest serialization mechanism.

I believe following docs can help:

https://hazelcast.com/blog/introduction-to-compact-serialization/

https://github.com/hazelcast/hazelcast-python-client/blob/master/examples/serialization/compact_serialization_example.py

alexjironkin · 2023-08-16T08:14:54Z

The issue here is the default serialiser can't handle special characters e.g. ä (\u00e4) and raises PythonObjectSerializer. The workaround is to use json.dumps serialisation that does handle it, but this is merely a workaround, which is the same as examples (df.to_json()).

I get it, that using pickle.dumps means you can pickle any Python object (that may not be JSON serialisable), if that is the intention please add support for non English characters.

In general using pickle for serialisation is a bad idea. For example see data protocols (here) so python versions need to match, and there are no guarantees of backwards compatibility. Moreover, as pickle doc says at the top The pickle module is not secure. Only unpickle data you trust. clearly clients can can deserialise data stored by other clients in hazelcast, which makes it a path for malicious code execution.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PythonObjectSerializer raise 'utf-8' codec can't decode byte #639

PythonObjectSerializer raise 'utf-8' codec can't decode byte #639

MASARIwot commented Jul 18, 2023 •

edited

Loading

mehmettokgoz commented Aug 8, 2023

alexjironkin commented Aug 16, 2023

PythonObjectSerializer raise 'utf-8' codec can't decode byte #639

PythonObjectSerializer raise 'utf-8' codec can't decode byte #639

Comments

MASARIwot commented Jul 18, 2023 • edited Loading

mehmettokgoz commented Aug 8, 2023

alexjironkin commented Aug 16, 2023

MASARIwot commented Jul 18, 2023 •

edited

Loading