Improve usability of ocrd-dummy #803

kba · 2022-02-15T15:35:49Z

Current situation

ocrd-dummy is a builtin processor of OCR-D/core that serves as a minimalist processor for testing. It does nothing except copy from --input-filegrp to --output-filegrp, so ocrd-copy would be a better name for it.

While the intention was just for testing, it is actually useful in certain cases. For example, if one starts with a workspace that contains just a folder of images and one needs the corresponding PAGE-XML, copying the files to a new fileGrp with ocrd-dummy acomplishes that.#803

Pertinent Discussion in gitter

@stefanCCS: Is there are OCR-D Processor which creates an initial more-or-less empty Page-XML, when I only have my source images (OCR-D-IMG) in mets.xml?

@kba: You could use the ocrd-dummy processor that is bundled with core. It copies from --input-filegrp to --output-filegrp and creates corresp. PAGE-XML as well. I am not sure whether there's a simpler way, but I could imagine @bertsky has some processor for that in his arsenal.

@stefanCCS: this sound simple enough - I will try ...

@bertsky: Yes, ocrd-dummy would do, but ocrd-preprocess-image (with some no-op for the derived image, like cp) or ocrd-page-transform (with an identity XSLT) would also work.

@stefanCCS: Will "page-transform" work without having a Page-XML before?

@kba: No, I don't think so, AFAICS it will probably try to apply XSLT to the image, not the PAGE-XML that represents this image (which doesn't exist yet).

@bertsky: Yes, sorry, you are right, ocrd-page-transform will not work on pure image fileGrps currently. (We have no page_from_file in bashlib yet – the only bashlib processor which is able to compensate for this is ocrd-olena-binarize.)

@stefanCCS: I just have tried out ocrd-dummy -> works fine as expected.
Just an idea: What about an extension (parameter) for ocrd-dummy, which just creates the PAGE (without copying the Images)?
(e.g. the OCR-engine processors also do not create new images...)

@kba: That would be possible. I am not sure whether that is advisable over removing the source filegroup and renaming the target filegroup afterwards. If we just skip writing out the image, the imageFilename attribute in the PAGE will point to the wrong image.

@kba: Maybe we should really have a dedicated, minimal processor ocrd-page-from-image that does just that and allows, different than all the processors we have, to read and write to the same filegroup.

@stefanCCS: How the OCR-Engine processors are doing this? (as far as I know they also "only" create a new filegroup with new PAGE.xmls with the OCR result)
-> or maybe I am wrong?

@stefanCCS: ocrd-page-from-image -> would be very nice, of course :-)

@kba: As a general rule, we never change the imageFilename between steps, only add AlternativeImages to the pc:Page.

@stefanCCS: well, this means in my opionion, you have the choice what to implement:
variant a): ocrd_page_from_image (which puts PAGE files to existing Images in existing file group (which is "unusual" in comparison to all other processors, but still I think a nice feature
or variant b): Act like an OCR engine, just take a given file groups with images, and create a new file groups with just the (in this case very simple) PAGE files.
==> both variants sound fine for me

@bertsky: I think that ocrd-dummy should not copy non-PAGE content as it is now. The proper name for the current behaviour would be ocrd-copy. A truly neutral processor should just output PAGE with no additional information, i.e. a copy of the input PAGE or a newly created PAGE if the input was image-only.

@stefanCCS: sounds reasonably

@bertsky: BTW the question of whether writing to the same fileGrp is allowed is not tied to a single processor, but should be a general option IMHO.

How it should be

There should be a parameter disabling the copying of files, so that only the generated PAGE-XML is written out.

ocrd-dummy is a bad name, we should call it something more intuitive like ocrd-copy or provide the non-copying-PAGE-generating functionality in a separate processor ocrd-page-from-images.

We should specify whether and if so under what circumstances a processor may operate directly on the input filegrp. it's something we strongly discouraged in the past, because it's a reasonable contract to expect ("processors never change the input data"). For reali-life requirements, this should be revisited and spelt out in the specs.

The text was updated successfully, but these errors were encountered:

kba · 2022-12-13T16:43:07Z

There should be a parameter disabling the copying of files, so that only the generated PAGE-XML is written out.

This has been implemented in #814 and released in 2.45.0

ocrd-dummy is a bad name, we should call it something more intuitive like ocrd-copy or provide the non-copying-PAGE-generating functionality in a separate processor ocrd-page-from-images.

These are still open. Thinking some more about it, I think it would be best to "namespace" such builtin processors as ocrd-util-*. So, in this case ocrd-util-copy, the post-processing/cleanup processor could be ocrd-util-cleanup.

kba added the Epic label Feb 15, 2022

kba added a commit that referenced this issue Mar 7, 2022

ocrd-dummy: make copying optional and disable by default, #803

4097791

krvoigt assigned kba Jul 18, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve usability of ocrd-dummy #803

Improve usability of ocrd-dummy #803

kba commented Feb 15, 2022 •

edited

Loading

kba commented Dec 13, 2022

Improve usability of ocrd-dummy #803

Improve usability of ocrd-dummy #803

Comments

kba commented Feb 15, 2022 • edited Loading

Current situation

How it should be

kba commented Dec 13, 2022

kba commented Feb 15, 2022 •

edited

Loading