Skip to content
This repository has been archived by the owner on Apr 23, 2018. It is now read-only.

Tesseract 3.03 RC1 hOCR format #8

Open
GoogleCodeExporter opened this issue Jun 2, 2015 · 0 comments
Open

Tesseract 3.03 RC1 hOCR format #8

GoogleCodeExporter opened this issue Jun 2, 2015 · 0 comments

Comments

@GoogleCodeExporter
Copy link

Hi, 

hOCR generated by the latest Tesseract 3.03 RC1 contains some new additions and 
changes. Below is an example of the new hOCR output produced. Will jhocr be 
updated to support it? 

Thanks,
Don

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
    "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en">
 <head>
  <title>
</title>
<meta http-equiv="Content-Type" content="text/html;charset=utf-8" />
  <meta name='ocr-system' content='tesseract 3.03' />
  <meta name='ocr-capabilities' content='ocr_page ocr_carea ocr_par ocr_line ocrx_word'/>
</head>
<body>
  <div class='ocr_page' id='page_1' title='image "20140312_132615.jpg"; bbox 0 0 2048 1152; ppageno 0'>
   <div class='ocr_carea' id='block_1_1' title="bbox 910 68 2048 1152">
    <p class='ocr_par' dir='ltr' id='par_1_1' title="bbox 910 68 2048 1152">
     <span class='ocr_line' id='line_1_1' title="bbox 1054 68 2048 694; baseline 0 0"><span class='ocrx_word' id='word_1_1' title='bbox 1054 68 2048 694; x_wconf 95' lang='eng' dir='ltr'>   </span> 
     </span>
     <span class='ocr_line' id='line_1_2' title="bbox 910 694 2048 1152; baseline 0 -377"><span class='ocrx_word' id='word_1_2' title='bbox 910 694 2048 1152; x_wconf 95' lang='eng' dir='ltr'>  </span> 
     </span>
    </p>
   </div>
   <div class='ocr_carea' id='block_1_2' title="bbox 635 530 1056 597">
    <p class='ocr_par' dir='ltr' id='par_1_2' title="bbox 635 530 1056 597">
     <span class='ocr_line' id='line_1_3' title="bbox 635 530 1056 597; baseline 0 -1"><span class='ocrx_word' id='word_1_3' title='bbox 635 530 947 597; x_wconf 89' lang='eng' dir='ltr'>USER’S</span> <span class='ocrx_word' id='word_1_4' title='bbox 994 531 1056 596; x_wconf 95' lang='eng' dir='ltr'>M</span> 
     </span>
    </p>
   </div>
   <div class='ocr_carea' id='block_1_3' title="bbox 831 714 1082 764">
    <p class='ocr_par' dir='ltr' id='par_1_3' title="bbox 831 714 1082 764">
     <span class='ocr_line' id='line_1_4' title="bbox 831 714 1082 764; baseline 0.004 -10"><span class='ocrx_word' id='word_1_5' title='bbox 831 714 1029 755; x_wconf 92' lang='eng' dir='ltr'>Revision</span> <span class='ocrx_word' id='word_1_6' title='bbox 1078 759 1082 764; x_wconf 14' lang='eng'><strong><em>1</em></strong></span> 
     </span>
    </p>
   </div>
  </div>
 </body>
</html>

Original issue reported on code.google.com by [email protected] on 12 Mar 2014 at 5:41

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

2 participants