Textual transcript via OCR #1

leovinus2001 · 2024-04-18T20:58:36Z

Nice images but could someone elaborate what the plane is here? Probably something like (1) make images more readable (2) convert image to text (3) run something on a simulator. Which directory first?
For (1), as example, attached a cleaner version of the top part of
https://github.com/pdp11/camexec/blob/master/photos/ddt/20240331_091745.jpg

larsbrinkhoff · 2024-04-19T06:22:14Z

Yes, that would be the plan. I asked @rcornwell about OCR, so I'm curious to see what results he can get from these photos. Maybe a special purpose Tesseract model could be another option.

The camexec directory should probably be the first, since it's the operating system. Next ddt.

leovinus2001 · 2024-04-20T16:10:36Z

Yes, that would be the plan. I asked @rcornwell about OCR, so I'm curious to see what results he can get from these photos. Maybe a special purpose Tesseract model could be another option.

That makes sense as actually @rcornwell pointed this topic out to me so there we are.

The camexec directory should probably be the first, since it's the operating system. Next ddt.

Will have a look and will keep you posted

larsbrinkhoff · 2024-04-21T07:54:55Z

Oh, so you are Rich's OCR friend? Thanks so much for taking a look! I'm curious what you can come up with.

leovinus2001 · 2024-04-22T18:51:55Z

While I am looking at five pages at random, I can see an issue straight away. The paper was sometimes folded-up and that leads to subtle but non-linear distortions which complicate the OCR. One of the five is out-of-focus as well. Is there any chance that the photos/scans could be taken again but with proper "flat" paper scan? It would save work ;) In the meantime, I'll try a few things.

larsbrinkhoff · 2024-04-23T06:57:49Z

The pages with a crease can be typed in manually. Only the first page in each listing has been folded so it should not be a big job. Those that were out of focus have been photographed again:
864de35

Please let me know if you see any other pages that are too blurry.

leovinus2001 · 2024-04-23T12:56:43Z

Ok. The question had to be asked :) It is a fun challenge for a hobby. To set expectations - reasonable text can be extracted here but it will not be perfect due to the quality of the inputs. In any case, am working on those five testpages at the moment to gauge what transcription quality I can get out. That involves image enhancement, producing high contrast PNGs for OCR, transcribing the groundtruth, retraining a model with this type of data. It is progressing but will take a few days. Then, I will attach the OCR input PNGs , recognition result and groundtruth. Then, you can check the PNGs against the recognition results and groundtruth and check whether that matches expectations. After all, the goal here is to produce a compilable text file and decide which process is the easiest to achieve that.
PS: The top part of 20240331_085227 is blurry as well

larsbrinkhoff · 2024-04-23T14:44:54Z

Thanks!

20240331_085227 should be the same as camexec-003-1.jpg

leovinus2001 · 2024-04-23T17:47:13Z

Attached a tgz file with OCR on 8 different pages. Seven from camex/ and one from ddt/ ( the orginal example in post #1 )

Please have a look whether you like the quality of the "recognized text" files. A side-by-side comparison of reference vs recognized should show the effects.
apr23.tgz
I also included data/camex/error.pattern.apr23.png as one example of the error patterns.

As for the process, I took your JPG files like 20240331_085227. Then split the page in two pages _1 and _2, do image enhancement via ImageMagick and then run an OCR tool. The zero characters here are quite different :) The low contrast original JPG also leads to image artifacts in the black/white/gray PNGs but the OCR can handle most of it. Due to the non-linear "bendy" deformations I had to slice some pages into 2 or 3 vertical or horizontal slices like _1_1 _1_2 _1_3 and then combine again.

If you know a better image enhancement then that would help as well :)

On the four pages of training data, the error-rate per-character is less than ~1% or so. Most issues are with characters like ':' and 'W' which need more examples. On other pages somewhat higher errorrate due to layout issues and unknown characters (like a blob for '#'). Accuracy can be improved with a 20 minute re-training. This is via my own tools as Tesseract does not like this data.

Before doing any more work, I'd like to check with you that this is useful for you?

It does require later manual error checking for every page to get text like the "references/pdp11/camex/" files which can be compiled ultimately. On the good side, sometimes the OCR is better than me and can tell me what is a '=' or '-'. On the bad side, proof reading all text will require time by you.

If you like this, I'll see that I convert next week another 5 or 10 pages to verify the process and then we take it from there. You have only 123 JPG in camex so not the end of the world.

PS: The tar ball has "per image" something like

data/camex/20240331_085227/20240331_085227_2.png (part 2 or your JPG)
data/camex/20240331_085227/recog.20240331_085227_2.txt (recognized text)
references/pdp11/camex/ref.20240331_085227_2.txt ( manually checked text as reference for error count)

larsbrinkhoff · 2024-04-24T16:15:39Z

Thanks! I have downloaded the tarball and will take a look.

larsbrinkhoff · 2024-04-26T07:07:33Z

I checked, and this looks very useful indeed. Thanks a lot. Of course, some pages fare worse than others, but those can be fixed manually or typed in from scratch. The majority seem good and only need minor corrections.

If it's useful for training, I can take a set of pages, do the corrections and send them back to you.

leovinus2001 · 2024-04-26T11:02:57Z

Yeah, it seems that a useful transcript can be produced with a mix of some manual work plus mostly automatic transcription.

A few more pages (5?) with reference transcriptions would be useful. Especially if you select pages with more "rare" symbols such as >, <, ;, W, Y symbols which I can use that to retrain and apply to everything again.

larsbrinkhoff · 2024-05-05T08:51:09Z

Here are five more pages. I tried to select those that have the glyphs you requested.

samples.zip

leovinus2001 · 2024-05-05T10:20:40Z

Here are five more pages. I tried to select those that have the glyphs you requested.

Cool! Thanks for those. Will integrate this into training materials and then we can start to process the other pages.

Am in the middle of a move but I'll see what I can do.

larsbrinkhoff · 2024-05-05T17:55:20Z

Thanks! No urgency, take your time. For me this is more a long-term back-burner project.

larsbrinkhoff · 2024-05-13T06:09:56Z

This may be of interest to you: https://mzucker.github.io/2016/08/15/page-dewarping.html

leovinus2001 · 2024-06-24T19:51:22Z

Just a quick note that I have not forgotten about this. Will have another go in a few weeks.

larsbrinkhoff · 2024-06-26T16:21:05Z

Thanks, sounds great!

leovinus2001 · 2024-09-06T12:19:08Z

While this is a low priority project, it is fair to say that life keeps pulling me in other directions. That means that I have no idea when I will get around to do the actual OCR aka image to text. Maybe later this year. It probably is only a week or so of solid work to get a great result but other priorities atm. You can see in the attached tmp_with_contours_3.png that character detection is fine but there is some effort left to update the listing layout and accuracy.

In the meantime, I have attached a tar file with black/white preprocessed images for easier OCR compared to the original scans. Something I made earlier this year. Better input, better output :) Parts of these are easy to process with commercial OCR software or even ChatGPT, Claude et al. None of those give a great result though. However, these images should be a better starting point for anyone who want to pick up from here quickly. Note that each original scan has two pages of listing and therefore there are two PNGs per page.

Anyway, fun project and I hope to revisit this another time.

allfiles.sep6.tgz

larsbrinkhoff · 2024-09-09T06:04:23Z

Thanks for the ping @leovinus2001. No worries about the time scale; it's a long-term project for me too. Thanks so much for your help so far!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Textual transcript via OCR #1

Textual transcript via OCR #1

leovinus2001 commented Apr 18, 2024

larsbrinkhoff commented Apr 19, 2024

leovinus2001 commented Apr 20, 2024

larsbrinkhoff commented Apr 21, 2024

leovinus2001 commented Apr 22, 2024

larsbrinkhoff commented Apr 23, 2024

leovinus2001 commented Apr 23, 2024

larsbrinkhoff commented Apr 23, 2024

leovinus2001 commented Apr 23, 2024

larsbrinkhoff commented Apr 24, 2024

larsbrinkhoff commented Apr 26, 2024

leovinus2001 commented Apr 26, 2024

larsbrinkhoff commented May 5, 2024

leovinus2001 commented May 5, 2024

larsbrinkhoff commented May 5, 2024

larsbrinkhoff commented May 13, 2024

leovinus2001 commented Jun 24, 2024

larsbrinkhoff commented Jun 26, 2024

leovinus2001 commented Sep 6, 2024

larsbrinkhoff commented Sep 9, 2024

Textual transcript via OCR #1

Textual transcript via OCR #1

Comments

leovinus2001 commented Apr 18, 2024

larsbrinkhoff commented Apr 19, 2024

leovinus2001 commented Apr 20, 2024

larsbrinkhoff commented Apr 21, 2024

leovinus2001 commented Apr 22, 2024

larsbrinkhoff commented Apr 23, 2024

leovinus2001 commented Apr 23, 2024

larsbrinkhoff commented Apr 23, 2024

leovinus2001 commented Apr 23, 2024

larsbrinkhoff commented Apr 24, 2024

larsbrinkhoff commented Apr 26, 2024

leovinus2001 commented Apr 26, 2024

larsbrinkhoff commented May 5, 2024

leovinus2001 commented May 5, 2024

larsbrinkhoff commented May 5, 2024

larsbrinkhoff commented May 13, 2024

leovinus2001 commented Jun 24, 2024

larsbrinkhoff commented Jun 26, 2024

leovinus2001 commented Sep 6, 2024

larsbrinkhoff commented Sep 9, 2024