Release Paragraph/chunker system, improved metadata extraction · exorde-labs/exorde-client

added SoTa model+system to split text (multilingual) in many sentences: https://arxiv.org/pdf/2305.18893.pdf, wtpsplit https://github.com/bminixhofer/wtpsplit. Paragraphs are recomposed from splitter sentences, to make sure they remain below the new token max count per item.
fixed \n replacement with spaces -> will improve some top keywords quality
the chunker system will fix "tensor size" issues, and therefore increase the data output (instead of losing some batches once in a while)
improved pre_install procedure to have 2 more models in the docker base image
added tiktoken (OpenAI gpt3 tokenizer) library to count (& print) the number of tokens for each item, to help decide if the client has to split an item in several pieces (paragraphs)

Provide feedback