Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Keep local copies of files in a separate mets:FLocat #1079

Merged
merged 13 commits into from
Sep 11, 2023
Merged

Conversation

kba
Copy link
Member

@kba kba commented Aug 20, 2023

One early decision that has haunted us for years now is that we have been using a single mets:FLocat for both the original URL of a mets:file and the local copy in the workspace we use for processing.

This PR tries to solve #323 by changing OcrdFile and the download logic in Resolver and Workspace:

  • the original URL (OcrdFile.url) remains in mets:FLocat[@LOCTYPE="URL"]/xlink:href
  • the local filename (OcrdFile.local_filename) will now be written to an additional mets:Flocat[@LOCTYPE="OTHER"][@OTHERLOCTYPE="FILE"]/xlink:href
  • The contract of Workspace.download_file is that after calling it with OcrdFile f, f will have a local_filename attribute and that is what processors should use rather than the url.
  • The logic of when to download and when to copy files in Resolver.download_to_directory and Workspace.download_file has been adapted accordingly.

The goal here is to make the OCR-D processing non-invasive. Currently, once you do ocrd workspace find --download, the original URL will be gone. With this PR, ocrd workspace find --download will add an additional mets:Flocat which can then be removed after processing is finished (to be compliant with the DFG Viewer METS profile) with ocrd workspace find --undo-download.

@kba kba requested a review from bertsky August 20, 2023 13:43
@kba kba marked this pull request as ready for review September 8, 2023 10:19
@MehmedGIT
Copy link
Contributor

When the files are downloaded there is an output with relative file path per line. When I did undo the downloading it returned just None on each line. Since the local path is removed, this makes sense, however, how should the general user interpret that output?

@kba
Copy link
Member Author

kba commented Sep 11, 2023

Since the local path is removed, this makes sense, however, how should the general user interpret that output

Good point, I hadn't thought about that. Should be fixed in 0f26809

For the kant_aufklaerung_1784 test asset:

ocrd workspace find --download
OCR-D-IMG/INPUT_0017.tif
OCR-D-IMG/INPUT_0020.tif
OCR-D-GT-PAGE/PAGE_0017_PAGE.xml
OCR-D-GT-PAGE/PAGE_0020_PAGE.xml
OCR-D-GT-ALTO/PAGE_0017_ALTO.xml
OCR-D-GT-ALTO/PAGE_0020_ALTO.xml

and the reverse:

ocrd workspace find --undo-download
Removed local_filename OCR-D-IMG/INPUT_0017.tif
Removed local_filename OCR-D-IMG/INPUT_0020.tif
Removed local_filename OCR-D-GT-PAGE/PAGE_0017_PAGE.xml
Removed local_filename OCR-D-GT-PAGE/PAGE_0020_PAGE.xml
Removed local_filename OCR-D-GT-ALTO/PAGE_0017_ALTO.xml
Removed local_filename OCR-D-GT-ALTO/PAGE_0020_ALTO.xml

@MehmedGIT
Copy link
Contributor

Yes, that output is more convenient. Great!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants