You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I have the next Image that I am trying to extract the text on it.
The expected value would some '\n' or blank spaces. The problem comes when I try to proccess it with the recognize function to stop the process in case it delays too much time but it does not stop and stays for 3h until the recognize function comes back with a False.
To set the image on the API I am using setImageFile, as far as I know it could avoid some trouble when loading it into the api (although I have also used "setImage(Image.open('image2test.jpg'))" ). Also mention that this page is being processed next to other pages extracted from a same PDF file. From this file PDF, of 2 pages, this page is the only one giving problems, causing the TesseractOCR 3h to extract its text. Considering that the image has no relevant information the process must be stopped and not processed. That page can not be deleted before the OCR, having into account that a lot of PDF might be processed and this isolated case may repeat in a unknown future wanting to catch this possible error.
The code I am using is the next:
Here you have more information of the API configuration, Tesseract or PIL version.
PSM: AUTO_OSD (1)
OEM: DEFAULT (3)
LANG: Spanish ('spa')
Tesseract: 4.1.1
Tesserocr: 2.5.1
Pillow: 8.4.0
Thanks for all the help 😉
The text was updated successfully, but these errors were encountered:
It took me a long while, but I finally managed to make it work by imposing Python to work on one thread at a time.
It's enough to add the following at the beginning of your script:
import os os.environ['OMP_THREAD_LIMIT'] = '1'
As far as I am concerned, it seems to be also working under multithreading, as long as the process under multithreading is not directly related to the function for which you use the Recognize (e.g. you run the multithreading on a bigger function, which at some point calls the function using the api).
Hi everyone,
I have the next Image that I am trying to extract the text on it.
The expected value would some '\n' or blank spaces. The problem comes when I try to proccess it with the recognize function to stop the process in case it delays too much time but it does not stop and stays for 3h until the recognize function comes back with a False.
To set the image on the API I am using setImageFile, as far as I know it could avoid some trouble when loading it into the api (although I have also used "setImage(Image.open('image2test.jpg'))" ). Also mention that this page is being processed next to other pages extracted from a same PDF file. From this file PDF, of 2 pages, this page is the only one giving problems, causing the TesseractOCR 3h to extract its text. Considering that the image has no relevant information the process must be stopped and not processed. That page can not be deleted before the OCR, having into account that a lot of PDF might be processed and this isolated case may repeat in a unknown future wanting to catch this possible error.
The code I am using is the next:
Here you have more information of the API configuration, Tesseract or PIL version.
Thanks for all the help 😉
The text was updated successfully, but these errors were encountered: