Skip to content

Old book pages (with groundtruth), formerly used for OCR studies. There are several versions of the set (concerning resolution and binarization). Noised and denoised sets (done by several methods) are eventually going to be uploaded.

License

Notifications You must be signed in to change notification settings

PedroBarcha/old-books-dataset

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

8 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Old scanned books dataset with groundtruth. The groundtruth was built with Project Gutenberg ebooks. All the .tiff pages were converted from project Internet Archive's books (PDFs). They were selected among the following books:

-Betrayed Armenia, de Diana Agabeg Apcar

-The Boy Apprenticed to an Enchanter, de Padraic Colum

-The Child of the Moat, de Stoughton Holborn

-The Corset and the Crinoline, de W.B.L

-Engraving of Lions, Tigers, Panthers, Leopards, Dogs, &C., de Thomas Landseer

-Half-Hours with Highwaymen, de Charles G. Harper

-Historical Sketches of Colonial Florida, de Richard L. Campbell

-Horton Genealogy, de Geo. F. Horton

-The Lusitania's Last Voyage, de Charles E. Lauriat

-Seat Weaving, de L. Day Perry

The dataset is presented in several resolutions: 300dpi,500dpi,1000dpi. Also there are severa sets of 300dpi binarized with different methods.

Feel free to use and study the sets contained here :)

About

Old book pages (with groundtruth), formerly used for OCR studies. There are several versions of the set (concerning resolution and binarization). Noised and denoised sets (done by several methods) are eventually going to be uploaded.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages