Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Textual transcript via OCR #1

Open
leovinus2001 opened this issue Apr 18, 2024 · 19 comments
Open

Textual transcript via OCR #1

leovinus2001 opened this issue Apr 18, 2024 · 19 comments

Comments

@leovinus2001
Copy link

Nice images but could someone elaborate what the plane is here? Probably something like (1) make images more readable (2) convert image to text (3) run something on a simulator. Which directory first?
For (1), as example, attached a cleaner version of the top part of
https://github.com/pdp11/camexec/blob/master/photos/ddt/20240331_091745.jpg
a

@larsbrinkhoff
Copy link
Member

Yes, that would be the plan. I asked @rcornwell about OCR, so I'm curious to see what results he can get from these photos. Maybe a special purpose Tesseract model could be another option.

The camexec directory should probably be the first, since it's the operating system. Next ddt.

@leovinus2001
Copy link
Author

Yes, that would be the plan. I asked @rcornwell about OCR, so I'm curious to see what results he can get from these photos. Maybe a special purpose Tesseract model could be another option.

That makes sense as actually @rcornwell pointed this topic out to me so there we are.

The camexec directory should probably be the first, since it's the operating system. Next ddt.

Will have a look and will keep you posted

@larsbrinkhoff
Copy link
Member

Oh, so you are Rich's OCR friend? Thanks so much for taking a look! I'm curious what you can come up with.

@leovinus2001
Copy link
Author

While I am looking at five pages at random, I can see an issue straight away. The paper was sometimes folded-up and that leads to subtle but non-linear distortions which complicate the OCR. One of the five is out-of-focus as well. Is there any chance that the photos/scans could be taken again but with proper "flat" paper scan? It would save work ;) In the meantime, I'll try a few things.

@larsbrinkhoff
Copy link
Member

The pages with a crease can be typed in manually. Only the first page in each listing has been folded so it should not be a big job. Those that were out of focus have been photographed again:
864de35

Please let me know if you see any other pages that are too blurry.

@leovinus2001
Copy link
Author

Ok. The question had to be asked :) It is a fun challenge for a hobby. To set expectations - reasonable text can be extracted here but it will not be perfect due to the quality of the inputs. In any case, am working on those five testpages at the moment to gauge what transcription quality I can get out. That involves image enhancement, producing high contrast PNGs for OCR, transcribing the groundtruth, retraining a model with this type of data. It is progressing but will take a few days. Then, I will attach the OCR input PNGs , recognition result and groundtruth. Then, you can check the PNGs against the recognition results and groundtruth and check whether that matches expectations. After all, the goal here is to produce a compilable text file and decide which process is the easiest to achieve that.
PS: The top part of 20240331_085227 is blurry as well

@larsbrinkhoff
Copy link
Member

Thanks!

20240331_085227 should be the same as camexec-003-1.jpg

@leovinus2001
Copy link
Author

Attached a tgz file with OCR on 8 different pages. Seven from camex/ and one from ddt/ ( the orginal example in post #1 )

Please have a look whether you like the quality of the "recognized text" files. A side-by-side comparison of reference vs recognized should show the effects.
apr23.tgz
I also included data/camex/error.pattern.apr23.png as one example of the error patterns.

As for the process, I took your JPG files like 20240331_085227. Then split the page in two pages _1 and _2, do image enhancement via ImageMagick and then run an OCR tool. The zero characters here are quite different :) The low contrast original JPG also leads to image artifacts in the black/white/gray PNGs but the OCR can handle most of it. Due to the non-linear "bendy" deformations I had to slice some pages into 2 or 3 vertical or horizontal slices like _1_1 _1_2 _1_3 and then combine again.

If you know a better image enhancement then that would help as well :)

On the four pages of training data, the error-rate per-character is less than ~1% or so. Most issues are with characters like ':' and 'W' which need more examples. On other pages somewhat higher errorrate due to layout issues and unknown characters (like a blob for '#'). Accuracy can be improved with a 20 minute re-training. This is via my own tools as Tesseract does not like this data.

Before doing any more work, I'd like to check with you that this is useful for you?

It does require later manual error checking for every page to get text like the "references/pdp11/camex/" files which can be compiled ultimately. On the good side, sometimes the OCR is better than me and can tell me what is a '=' or '-'. On the bad side, proof reading all text will require time by you.

If you like this, I'll see that I convert next week another 5 or 10 pages to verify the process and then we take it from there. You have only 123 JPG in camex so not the end of the world.

PS: The tar ball has "per image" something like

  • data/camex/20240331_085227/20240331_085227_2.png (part 2 or your JPG)
  • data/camex/20240331_085227/recog.20240331_085227_2.txt (recognized text)
  • references/pdp11/camex/ref.20240331_085227_2.txt ( manually checked text as reference for error count)

@larsbrinkhoff
Copy link
Member

Thanks! I have downloaded the tarball and will take a look.

@larsbrinkhoff
Copy link
Member

I checked, and this looks very useful indeed. Thanks a lot. Of course, some pages fare worse than others, but those can be fixed manually or typed in from scratch. The majority seem good and only need minor corrections.

If it's useful for training, I can take a set of pages, do the corrections and send them back to you.

@leovinus2001
Copy link
Author

Yeah, it seems that a useful transcript can be produced with a mix of some manual work plus mostly automatic transcription.

A few more pages (5?) with reference transcriptions would be useful. Especially if you select pages with more "rare" symbols such as >, <, ;, W, Y symbols which I can use that to retrain and apply to everything again.

@larsbrinkhoff
Copy link
Member

Here are five more pages. I tried to select those that have the glyphs you requested.

samples.zip

@leovinus2001
Copy link
Author

Here are five more pages. I tried to select those that have the glyphs you requested.

Cool! Thanks for those. Will integrate this into training materials and then we can start to process the other pages.

Am in the middle of a move but I'll see what I can do.

@larsbrinkhoff
Copy link
Member

Thanks! No urgency, take your time. For me this is more a long-term back-burner project.

@larsbrinkhoff
Copy link
Member

This may be of interest to you: https://mzucker.github.io/2016/08/15/page-dewarping.html

@leovinus2001
Copy link
Author

Just a quick note that I have not forgotten about this. Will have another go in a few weeks.

@larsbrinkhoff
Copy link
Member

Thanks, sounds great!

@leovinus2001
Copy link
Author

While this is a low priority project, it is fair to say that life keeps pulling me in other directions. That means that I have no idea when I will get around to do the actual OCR aka image to text. Maybe later this year. It probably is only a week or so of solid work to get a great result but other priorities atm. You can see in the attached tmp_with_contours_3.png that character detection is fine but there is some effort left to update the listing layout and accuracy.

In the meantime, I have attached a tar file with black/white preprocessed images for easier OCR compared to the original scans. Something I made earlier this year. Better input, better output :) Parts of these are easy to process with commercial OCR software or even ChatGPT, Claude et al. None of those give a great result though. However, these images should be a better starting point for anyone who want to pick up from here quickly. Note that each original scan has two pages of listing and therefore there are two PNGs per page.

Anyway, fun project and I hope to revisit this another time.

allfiles.sep6.tgz
tmp_with_contours_3

@larsbrinkhoff
Copy link
Member

Thanks for the ping @leovinus2001. No worries about the time scale; it's a long-term project for me too. Thanks so much for your help so far!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants