Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update derivative generation to use derivative rodeo to skip *TN.jpg and .READER.pdf files #431

Open
jeremyf opened this issue Mar 14, 2023 · 5 comments
Labels
blocked other work must be completed first needs discussion has open questions or need for discussion

Comments

@jeremyf
Copy link
Contributor

jeremyf commented Mar 14, 2023

In the <2023-03-14 Tue> conversation with Katharine, we have the following situation:

  • We have archival PDFs for everything
  • Some works have secondary PDFs

We can and should skip derivative generation for PD for those secondary PDFs. All secondary PDFs have the suffix .READER.pdf (make sure to test in a case insensitve manner). Example: =32000812.READER.pdf=

For a reference implementation (albeit with different rules):

Related to:

We also do not want to create derivatives for TN.jpg files.

Testing Instructions

  • In the UI, create a work
  • Attach a simple PDF with unique text (something you could search on) that has the file suffix .READER.pdf (e.g. my-file.READER.pdf)
  • Once the file is attached and processed, attempt to search for the text of that PDF. The created work should not be found.
@jeremyf jeremyf self-assigned this Mar 15, 2023
jeremyf referenced this issue in notch8/adventist-dl Mar 15, 2023
Prior to this commit, we were generating derivatives for all of the
PDFs.  This could be both an archival and access PDF.

And we did not need those duplicate derivatives.

With this commit, we're skipping derivative processing for any of the
non-archival PDFs.

Related to:

- https://github.com/scientist-softserv/adventist-dl/issues/311

References:

- notch8/utk-hyku#353
jeremyf referenced this issue in notch8/adventist-dl Mar 15, 2023
Prior to this commit, we were generating derivatives for all of the
PDFs.  This could be both an archival and access PDF.

And we did not need those duplicate derivatives.

With this commit, we're skipping derivative processing for any of the
non-archival PDFs.

Related to:

- https://github.com/scientist-softserv/adventist-dl/issues/311

References:

- notch8/utk-hyku#353
jeremyf referenced this issue in notch8/adventist-dl Mar 16, 2023
Prior to this commit, we were generating derivatives for all of the
PDFs.  This could be both an archival and access PDF.

And we did not need those duplicate derivatives.

With this commit, we're skipping derivative processing for any of the
non-archival PDFs.

Related to:

- https://github.com/scientist-softserv/adventist-dl/issues/311

References:

- notch8/utk-hyku#353
jeremyf referenced this issue in notch8/adventist-dl Mar 23, 2023
Prior to this commit, we skipped generating derivatives on the
`.reader.pdf` (see [313]).  However, we also wanted to avoid splitting
the reader PDFs.

With this commit, we now have logic that avoids splitting `.reader.pdf`
files into constituent pages.

Related to:

- https://github.com/scientist-softserv/adventist-dl/issues/286
- https://github.com/scientist-softserv/adventist-dl/issues/311

[313]: #313
@KatharineV
Copy link
Collaborator

Team, can we exclude PDFs with .READER.pdf and also .pdf-r.pdf? I recently found a big batch of material our digitization center processed with the .pdf-r.pdf file naming convention at some point in the past. We'd like to exclude these files from the viewer, as they are Reader files (but just didn't receive the correct name).

@jeremyf
Copy link
Contributor Author

jeremyf commented Mar 28, 2023

Absolutely going to add this bit of logic.

@jeremyf
Copy link
Contributor Author

jeremyf commented Mar 28, 2023

I want to de-prioritize this as the derivative work that I’m doing this week should resolve/supercede the changes that I’ve made to attempt to address this ticket.

(The importer process I'm working through will allow for significant improvements but is a complete re-architecture of the approach)

Duplicated/replaced by:

@laritakr
Copy link
Contributor

We also do not want to create derivatives for TN.jpg files.

@jillpe
Copy link

jillpe commented Apr 17, 2023

dependent on derivative rodeo work

@jillpe jillpe changed the title Skip derivative generation for .READER.pdf files Update derivative generation to use derivative rodeo to skip *TN.jpg and .READER.pdf files Apr 17, 2023
@jillpe jillpe added the blocked other work must be completed first label Apr 17, 2023
@jillpe jillpe added the needs discussion has open questions or need for discussion label Aug 31, 2023
jeremyf referenced this issue in notch8/adventist-dl Nov 22, 2023
We're seeing jobs that are trying to find the HOCR of a thumbnail; we
don't need that HOCR and it's spawning 5 jobs that are unecessary.

Related to:

- https://github.com/scientist-softserv/adventist-dl/issues/311
jeremyf referenced this issue in notch8/adventist-dl Nov 22, 2023
We're seeing jobs that are trying to find the HOCR of a thumbnail; we
don't need that HOCR and it's spawning 5 jobs that are unecessary.

Related to:

- https://github.com/scientist-softserv/adventist-dl/issues/311
jeremyf referenced this issue in notch8/adventist-dl Nov 22, 2023
We're seeing jobs that are trying to find the HOCR of a thumbnail; we
don't need that HOCR and it's spawning 5 jobs that are unecessary.

Related to:

- https://github.com/scientist-softserv/adventist-dl/issues/311
@kirkkwang kirkkwang transferred this issue from notch8/adventist-dl May 10, 2024
@jeremyf jeremyf removed their assignment May 30, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
blocked other work must be completed first needs discussion has open questions or need for discussion
Projects
Status: No status
Status: No status
Development

No branches or pull requests

4 participants