Skip to content

Paragraph/chunker system, improved metadata extraction

Compare
Choose a tag to compare
@MathiasExorde MathiasExorde released this 21 Jul 13:34
· 82 commits to main since this release
306c071
  • added SoTa model+system to split text (multilingual) in many sentences: https://arxiv.org/pdf/2305.18893.pdf, wtpsplit https://github.com/bminixhofer/wtpsplit. Paragraphs are recomposed from splitter sentences, to make sure they remain below the new token max count per item.
  • fixed \n replacement with spaces -> will improve some top keywords quality
  • the chunker system will fix "tensor size" issues, and therefore increase the data output (instead of losing some batches once in a while)
  • improved pre_install procedure to have 2 more models in the docker base image
  • added tiktoken (OpenAI gpt3 tokenizer) library to count (& print) the number of tokens for each item, to help decide if the client has to split an item in several pieces (paragraphs)