Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TypeError: can not serialize 'TextBlob' object #22

Open
AltfunsMA opened this issue Jun 13, 2022 · 7 comments
Open

TypeError: can not serialize 'TextBlob' object #22

AltfunsMA opened this issue Jun 13, 2022 · 7 comments
Assignees
Labels
enhancement New feature or request help wanted Extra attention is needed

Comments

@AltfunsMA
Copy link

Just trying this expansion, but it doesn't seem like you can use the pipe with more than one process, which makes it far less attractive.

The following yields the error in the subject

import spacy
from spacytextblob.spacytextblob import SpacyTextBlob

nlp = spacy.load("en_core_web_sm")

nlp.add_pipe('spacytextblob')

l = ['This is great. But this is horrible', 'The answer to everything is 42. What did you believe?']

docs = nlp.pipe(l, n_process = 2)

for doc in docs:

  for s in doc.sents:
   
    s._.blob.polarity             
    s._.blob.subjectivity
 
@SamEdwardes
Copy link
Owner

SamEdwardes commented Jun 14, 2022

Can you share the error you get?

I did some testing. The details are below, but the short version is that I am not sure if it is a spacytextblob issue. I could not get n_process=2 to work for me even without spacytextblob.

My understanding is that even with just using nlp.pipe(l, n_process=1) should get you some performance gains. I am sure you could get more process gains with n_process > 1, but I do not have any ideas on how to get it to work. Before this issue I was not familiar with that parameter https://spacy.io/api/language/#pipe.

Test 1 - running the code from AltfunsMA

import spacy
from spacytextblob.spacytextblob import SpacyTextBlob

nlp = spacy.load("en_core_web_sm")

nlp.add_pipe('spacytextblob')

l = ['This is great. But this is horrible', 'The answer to everything is 42. What did you believe?']

docs = nlp.pipe(l, n_process = 2)

for doc in docs:
    for s in doc.sents:
        print(s._.blob.polarity)
        print(s._.blob.subjectivity)

For me, the program just hangs here...

Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "/Users/samedwardes/.pyenv/versions/3.10.4/lib/python3.10/multiprocessing/spawn.py", line 116, in spawn_main
    exitcode = _main(fd, parent_sentinel)
  File "/Users/samedwardes/.pyenv/versions/3.10.4/lib/python3.10/multiprocessing/spawn.py", line 125, in _main
    prepare(preparation_data)
  File "/Users/samedwardes/.pyenv/versions/3.10.4/lib/python3.10/multiprocessing/spawn.py", line 236, in prepare
    _fixup_main_from_path(data['init_main_from_path'])
  File "/Users/samedwardes/.pyenv/versions/3.10.4/lib/python3.10/multiprocessing/spawn.py", line 287, in _fixup_main_from_path
    main_content = runpy.run_path(main_path,
  File "/Users/samedwardes/.pyenv/versions/3.10.4/lib/python3.10/runpy.py", line 269, in run_path
    return _run_module_code(code, init_globals, run_name,
  File "/Users/samedwardes/.pyenv/versions/3.10.4/lib/python3.10/runpy.py", line 96, in _run_module_code
    _run_code(code, mod_globals, init_globals,
  File "/Users/samedwardes/.pyenv/versions/3.10.4/lib/python3.10/runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "/Users/samedwardes/git/samedwardes/spacytextblob/_tmp/issue-22.py", line 12, in <module>
    for doc in docs:
  File "/Users/samedwardes/git/samedwardes/spacytextblob/_tmp/venv/lib/python3.10/site-packages/spacy/language.py", line 1583, in pipe
    for doc in docs:
  File "/Users/samedwardes/git/samedwardes/spacytextblob/_tmp/venv/lib/python3.10/site-packages/spacy/language.py", line 1649, in _multiprocessing_pipe
    proc.start()
  File "/Users/samedwardes/.pyenv/versions/3.10.4/lib/python3.10/multiprocessing/process.py", line 121, in start
    self._popen = self._Popen(self)
  File "/Users/samedwardes/.pyenv/versions/3.10.4/lib/python3.10/multiprocessing/context.py", line 224, in _Popen
    return _default_context.get_context().Process._Popen(process_obj)
  File "/Users/samedwardes/.pyenv/versions/3.10.4/lib/python3.10/multiprocessing/context.py", line 284, in _Popen
    return Popen(process_obj)
  File "/Users/samedwardes/.pyenv/versions/3.10.4/lib/python3.10/multiprocessing/popen_spawn_posix.py", line 32, in __init__
    super().__init__(process_obj)
  File "/Users/samedwardes/.pyenv/versions/3.10.4/lib/python3.10/multiprocessing/popen_fork.py", line 19, in __init__
    self._launch(process_obj)
  File "/Users/samedwardes/.pyenv/versions/3.10.4/lib/python3.10/multiprocessing/popen_spawn_posix.py", line 42, in _launch
    prep_data = spawn.get_preparation_data(process_obj._name)
  File "/Users/samedwardes/.pyenv/versions/3.10.4/lib/python3.10/multiprocessing/spawn.py", line 154, in get_preparation_data
    _check_not_importing_main()
  File "/Users/samedwardes/.pyenv/versions/3.10.4/lib/python3.10/multiprocessing/spawn.py", line 134, in _check_not_importing_main
    raise RuntimeError('''
RuntimeError: 
        An attempt has been made to start a new process before the
        current process has finished its bootstrapping phase.

        This probably means that you are not using fork to start your
        child processes and you have forgotten to use the proper idiom
        in the main module:

            if __name__ == '__main__':
                freeze_support()
                ...

        The "freeze_support()" line can be omitted if the program
        is not going to be frozen to produce an executable.

Test 2 - running the code without spacytextblob

import spacy

nlp = spacy.load("en_core_web_sm")

l = ['This is great. But this is horrible', 'The answer to everything is 42. What did you believe?']

docs = nlp.pipe(l, n_process = 2)

for doc in docs:
    print(doc)

I get the same error as above.

Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "/Users/samedwardes/.pyenv/versions/3.10.4/lib/python3.10/multiprocessing/spawn.py", line 116, in spawn_main
    exitcode = _main(fd, parent_sentinel)
  File "/Users/samedwardes/.pyenv/versions/3.10.4/lib/python3.10/multiprocessing/spawn.py", line 125, in _main
    prepare(preparation_data)
  File "/Users/samedwardes/.pyenv/versions/3.10.4/lib/python3.10/multiprocessing/spawn.py", line 236, in prepare
    _fixup_main_from_path(data['init_main_from_path'])
  File "/Users/samedwardes/.pyenv/versions/3.10.4/lib/python3.10/multiprocessing/spawn.py", line 287, in _fixup_main_from_path
    main_content = runpy.run_path(main_path,
  File "/Users/samedwardes/.pyenv/versions/3.10.4/lib/python3.10/runpy.py", line 269, in run_path
    return _run_module_code(code, init_globals, run_name,
  File "/Users/samedwardes/.pyenv/versions/3.10.4/lib/python3.10/runpy.py", line 96, in _run_module_code
    _run_code(code, mod_globals, init_globals,
  File "/Users/samedwardes/.pyenv/versions/3.10.4/lib/python3.10/runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "/Users/samedwardes/git/samedwardes/spacytextblob/_tmp/issue-22a.py", line 10, in <module>
    for doc in docs:
  File "/Users/samedwardes/git/samedwardes/spacytextblob/_tmp/venv/lib/python3.10/site-packages/spacy/language.py", line 1583, in pipe
    for doc in docs:
  File "/Users/samedwardes/git/samedwardes/spacytextblob/_tmp/venv/lib/python3.10/site-packages/spacy/language.py", line 1649, in _multiprocessing_pipe
    proc.start()
  File "/Users/samedwardes/.pyenv/versions/3.10.4/lib/python3.10/multiprocessing/process.py", line 121, in start
    self._popen = self._Popen(self)
  File "/Users/samedwardes/.pyenv/versions/3.10.4/lib/python3.10/multiprocessing/context.py", line 224, in _Popen
    return _default_context.get_context().Process._Popen(process_obj)
  File "/Users/samedwardes/.pyenv/versions/3.10.4/lib/python3.10/multiprocessing/context.py", line 284, in _Popen
    return Popen(process_obj)
  File "/Users/samedwardes/.pyenv/versions/3.10.4/lib/python3.10/multiprocessing/popen_spawn_posix.py", line 32, in __init__
    super().__init__(process_obj)
  File "/Users/samedwardes/.pyenv/versions/3.10.4/lib/python3.10/multiprocessing/popen_fork.py", line 19, in __init__
    self._launch(process_obj)
  File "/Users/samedwardes/.pyenv/versions/3.10.4/lib/python3.10/multiprocessing/popen_spawn_posix.py", line 42, in _launch
    prep_data = spawn.get_preparation_data(process_obj._name)
  File "/Users/samedwardes/.pyenv/versions/3.10.4/lib/python3.10/multiprocessing/spawn.py", line 154, in get_preparation_data
    _check_not_importing_main()
  File "/Users/samedwardes/.pyenv/versions/3.10.4/lib/python3.10/multiprocessing/spawn.py", line 134, in _check_not_importing_main
    raise RuntimeError('''
RuntimeError: 
        An attempt has been made to start a new process before the
        current process has finished its bootstrapping phase.

        This probably means that you are not using fork to start your
        child processes and you have forgotten to use the proper idiom
        in the main module:

            if __name__ == '__main__':
                freeze_support()
                ...

        The "freeze_support()" line can be omitted if the program
        is not going to be frozen to produce an executable.

System info

OS

macOS Monterey
Apple M1 Pro

Python

$ python --version
Python 3.10.4

Packages

$ pip freeze
blis==0.7.7
catalogue==2.0.7
certifi==2022.5.18.1
charset-normalizer==2.0.12
click==8.1.3
cymem==2.0.6
en-core-web-sm @ https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.3.0/en_core_web_sm-3.3.0-py3-none-any.whl
idna==3.3
Jinja2==3.1.2
joblib==1.1.0
langcodes==3.3.0
MarkupSafe==2.1.1
murmurhash==1.0.7
nltk==3.7
numpy==1.22.4
packaging==21.3
pathy==0.6.1
preshed==3.0.6
pydantic==1.8.2
pyparsing==3.0.9
regex==2022.6.2
requests==2.28.0
smart-open==5.2.1
spacy==3.3.1
spacy-legacy==3.0.9
spacy-loggers==1.0.2
spacytextblob==4.0.0
srsly==2.4.3
textblob==0.15.3
thinc==8.0.17
tqdm==4.64.0
typer==0.4.1
typing_extensions==4.2.0
urllib3==1.26.9
wasabi==0.9.1

@SamEdwardes SamEdwardes added the bug Something isn't working label Jun 14, 2022
@SamEdwardes
Copy link
Owner

I did some more digging. I follow the issue here from the spacy repo: explosion/spaCy#8654

Now here is what I get, this is probably the error you were referring too based on your title.

import spacy
from spacytextblob.spacytextblob import SpacyTextBlob


def main():
    nlp = spacy.load("en_core_web_sm")

    nlp.add_pipe('spacytextblob')

    l = ['This is great. But this is horrible', 'The answer to everything is 42. What did you believe?']

    docs = nlp.pipe(l, n_process=2)

    for doc in docs:
        for s in doc.sents:
            print(s._.blob.polarity)
            print(s._.blob.subjectivity)


if __name__ == '__main__':
    main()

Here is the output

Traceback (most recent call last):
  File "/Users/samedwardes/git/samedwardes/spacytextblob/_tmp/issue-22.py", line 21, in <module>
    main()
  File "/Users/samedwardes/git/samedwardes/spacytextblob/_tmp/issue-22.py", line 14, in main
    for doc in docs:
  File "/Users/samedwardes/git/samedwardes/spacytextblob/_tmp/venv/lib/python3.10/site-packages/spacy/language.py", line 1583, in pipe
    for doc in docs:
  File "/Users/samedwardes/git/samedwardes/spacytextblob/_tmp/venv/lib/python3.10/site-packages/spacy/language.py", line 1666, in _multiprocessing_pipe
    self.default_error_handler(
  File "/Users/samedwardes/git/samedwardes/spacytextblob/_tmp/venv/lib/python3.10/site-packages/spacy/util.py", line 1630, in raise_error
    raise e
ValueError: [E871] Error encountered in nlp.pipe with multiprocessing:

Traceback (most recent call last):
  File "/Users/samedwardes/git/samedwardes/spacytextblob/_tmp/venv/lib/python3.10/site-packages/spacy/language.py", line 2217, in _apply_pipes
    byte_docs = [(doc.to_bytes(), doc._context, None) for doc in docs]
  File "/Users/samedwardes/git/samedwardes/spacytextblob/_tmp/venv/lib/python3.10/site-packages/spacy/language.py", line 2217, in <listcomp>
    byte_docs = [(doc.to_bytes(), doc._context, None) for doc in docs]
  File "spacy/tokens/doc.pyx", line 1314, in spacy.tokens.doc.Doc.to_bytes
  File "spacy/tokens/doc.pyx", line 1373, in spacy.tokens.doc.Doc.to_dict
  File "/Users/samedwardes/git/samedwardes/spacytextblob/_tmp/venv/lib/python3.10/site-packages/spacy/util.py", line 1272, in to_dict
    serialized[key] = getter()
  File "spacy/tokens/doc.pyx", line 1370, in spacy.tokens.doc.Doc.to_dict.lambda19
  File "/Users/samedwardes/git/samedwardes/spacytextblob/_tmp/venv/lib/python3.10/site-packages/srsly/_msgpack_api.py", line 14, in msgpack_dumps
    return msgpack.dumps(data, use_bin_type=True)
  File "/Users/samedwardes/git/samedwardes/spacytextblob/_tmp/venv/lib/python3.10/site-packages/srsly/msgpack/__init__.py", line 55, in packb
    return Packer(**kwargs).pack(o)
  File "srsly/msgpack/_packer.pyx", line 285, in srsly.msgpack._packer.Packer.pack
  File "srsly/msgpack/_packer.pyx", line 291, in srsly.msgpack._packer.Packer.pack
  File "srsly/msgpack/_packer.pyx", line 288, in srsly.msgpack._packer.Packer.pack
  File "srsly/msgpack/_packer.pyx", line 264, in srsly.msgpack._packer.Packer._pack
  File "srsly/msgpack/_packer.pyx", line 282, in srsly.msgpack._packer.Packer._pack
TypeError: can not serialize 'TextBlob' object

@getorca
Copy link

getorca commented Aug 15, 2022

Same issue as above, haven't really dug into why, but appear the the spacy .pipe can't handle the class objects.

@SamEdwardes
Copy link
Owner

SamEdwardes commented Aug 16, 2022

I think that is correct @getorca. spacy.pipe wants to serialize the data, and it is not possible to do that with a TextBlob object. I will leave this issue open because it would be helpful if we could find a way to make this work.

@getorca
Copy link

getorca commented Aug 16, 2022

I think that is correct @getorca. spacy .pip wants to serialize the data, and it is not possible to do that with a TextBlob object. I will leave this issue open because it would be helpful if we could find a way to make this work.

The only thing I could think of when I was looking at it was actually returning the response from TextBlob on all the ents, sentences, etc. But that seems less than ideal because of the high overhead, especially on longer docs.

@SamEdwardes
Copy link
Owner

One approach could be to return a dict or just attributes instead of the TextBlob object. I used to have spacytextblob do this. The downside is that it was very hard to work TextBlob extensions that way - so I switched to returning the TextBlob object.

@getorca
Copy link

getorca commented Nov 4, 2022

Reference

As mentioned above I think that will add significant overhead, as well as eating a lot memory. I've been experience a lot of overhead from serialisation recently in relation to multiprocessing. Best thing I can suggest is recommend users write a custom pipeline, only return the data they need from textblob.

@SamEdwardes SamEdwardes self-assigned this Nov 4, 2022
@SamEdwardes SamEdwardes added enhancement New feature or request help wanted Extra attention is needed and removed bug Something isn't working labels Nov 4, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request help wanted Extra attention is needed
Projects
None yet
Development

No branches or pull requests

3 participants