EPIC: Import Optimization with Out of Band Processing #421

jeremyf · 2023-03-24T17:50:46Z

Goals

The goal of this epic is to adjust the derivative “creation” process.

At present Hyrax creates a FileSet for each file added to a work either via the UI or via batch processing. We then process each original file of a FileSet creating derivative files that we attach to that FileSet. This is all done “in-band”, which can be non-performant for large imports.

To speed up the imports we can:

add more resources to the import system
perform “out of band” processing to generate derivatives

By introducing the idea of the “out of band” processing, we break the fundamental assumption of Hyrax’s derivative generation. Namely that it will take a FileSet and create all the necessary derivatives. Instead, with pre-processing, we are now saying “There may already exist a derivative for this FileSet, attach that instead.”

Instead of having one conceptual “Create Derivatives” function we are looking to have three:

Pre-processing: doing out of band work to create derivatives.
Locating: Finding the existing derivatives (or possibly indicating that we’d create new ones).
Applying: Taking the found derivative and adding it to the correct FileSet.

We have a “prior art” instance of Pre-processing in the SpaceStone Ruby gem. That gem is code that runs in AWS Lambdas to pull data from Internet Archive (per the client’s previous storage), split apart the PDFs, create OCR derivatives of each page, and create thumbnails.

Further, we have “prior art” for Locating and Applying .hocr files in NNP’s Tesseract module; responsible for first looking for a remote .hocr and failing that generating the .hocr file.

We have some logic for finding the Pre-processing files in NNP’s DistributedRemoteDownloader.

Those are specific implementations that demonstrate some of the necessary concepts. However, those are different from the immediate needs of Adventist and the general needs of other clients.

Scenarios

In terms of Pre-Processing we need the following:

Scenario: Pre-processing
Given an identifier for a FileSet
  And a URL to the original file
  And the file is a PDF
When I pass the identifier and URL to an AWS lambda
Then the Pre-processer Lambda will create one <JPG> per PDF page
  And the Pre-processer Lambda will create one HOCR per PDF page
  And the Pre-processer Lambda will create one Thumbnail per PDF page

In terms of Locating we need the following:

Scenario: Locating derivatives that were pre-processed
Given an identifier for a FileSet
  And the Pre-processer Lambda has processed the FileSet
When I attempt to locate the <JPGs, HOCRs, Thumbnails> for the split pages
Then I will get the location of the S3 files

Scenario: Locating derivatives that were not pre-processed
Given an identifier for a FileSet
  And the Pre-processer Lambda has not processed the FileSet
When I attempt to locate the <JPGs, HOCRs, Thumbnails> for the split pages
Then I will not get a location of the S3 files
  And I will get a “location” that will indicate to use the default Hyrax derivative generation

In terms of Applying we need the following:

Scenario: Applying derivatives that were pre-processed
Given an identifer for a FileSet
  And the Pre-processer Lambda has processed the FileSet
When I attach the located <JPGs, HOCRs, Thumbnails> for the split images
Then I will fetch those derivatives from S3
  And attach those derivatives to the FileSet

Scenario: Applying derivatives that were not pre-processed
Given an identifer for a FileSet
  And the Pre-processer Lambda has not processed the FileSet
When I attach the located <JPGs, HOCRs, Thumbnails> for the split images
Then I will generate those derivatives
  And attach those derivatives to the FileSet

One consideration is that Hyrax has implicit derivatives that are named concepts; for the above features we need to expose those named concepts. Namely, how do I locate and apply the thing called thumbnail?

Pre-Process Tasks

Adventist S3 Convention: there’s an AARK unique for each work. Write files there.

Ingest Task

Task Scratch Pad

Not all of these will be converted to tasks; they instead represent a current workign understanding. Once we create a task from the checkmark, then we’re looking more at actionable tasks.

Spike: Establish the S3 bucket organization structure with consideration for: identifier, original file, and derivative type.
Merge Samvera::Derivatives configuration concerns with IiifPrint configuration
- The APIs are not part of production code and provide the named building blocks for this work.
Rename IiifPrint to Samvera::Derivatives
- IiifPrint seems to be a bit of a misnomer; we’re talking about how we want to make derivatives.
Space Stone:
- Run OCR on individual pages
  - Consideration: how is this logic the same/different from what we have in IIIF Print? What does a canonical function look like?
- Generate a thumbnail for each page
  - Consideration: how is this logic the same/different from what we have in Hyrax? What does a canonical function look like?
Iiif Print:
- Create general S3 locator function
  - This needs to account for the derivative type (e.g. hocr, thumbnail, splits, etc) and the identifier.
- Create general S3 applicator function
  - This needs to write the file to the location expected by Fedora. This is behavior that has prior art.
- Spike: Resolve whether S3 locator pulls file down or if that is the applicator’s purview
  Read OAI feed, sending non-thumbnail (and non-reader.pdf) files to S3

The text was updated successfully, but these errors were encountered:

Prior to this commit, I had a data structure that "built itself" from an internal process. With this refactor, I created a simple data structure and a method to extract the data structure from a given file. The benefit of this setup is that I don't need to run the pdfimages command against a file (well, I probably should run it at least once) every time I want to do something/test a PDF processing. Related to: - notch8/iiif_print#194 - samvera/bulkrax#760 - notch8/utk-hyku#343 - https://github.com/scientist-softserv/adventist-dl/issues/330

This is an almost direct copy of IIIF Print's existing code Related to: - notch8/iiif_print#194 - samvera/bulkrax#760 - notch8/utk-hyku#343 - https://github.com/scientist-softserv/adventist-dl/issues/330

As I'm thinking through the logic, I'm realizing that we have a splitter and a strategy; the splitter leverages the strategy. As implemented the splitter is the strategy is the splitter. Related to: - notch8/iiif_print#194 - samvera/bulkrax#760 - notch8/utk-hyku#343 - https://github.com/scientist-softserv/adventist-dl/issues/330

Related to: - notch8/iiif_print#194 - samvera/bulkrax#760 - notch8/utk-hyku#343 - https://github.com/scientist-softserv/adventist-dl/issues/330

jeremyf · 2023-03-28T13:28:50Z

Sidebar: As I think about SPD, IIIF Print gem, and SpaceStone, I’m wondering if the better approach is to move them all into IIIF Print. Then I would declare what groups to use (akin to what we do with web and worker). This does create a weird space where there could be inadvertent bleed. For now, the extraction and separation of concerns feels like the correct exploratory exercise./

As I think a bit more on this, I don’t believe merging the three gems together is the right approach. Given that one is envisioned as an Engine and the other as conceptual shell scripts. Conflating those three concepts seems like it make it harder to implement towards a clean and crisp interface.

Yesterday, I brought over several classes/utilities from the IIIF Print gem. These were lower level functions that are used by processes within IIIF Print gem. Today, I need to bring over the remaining lower level classes and utilities. I am presently working on bringing over PageOCR.

I have locally setup my SPD development so that I run rubocop and rspec on each commit and each push to the remote repository. I /have not setup continuous integration on the remote repository.

I will also be bringing over the Samvera::Derivatives interfaces; those belong in SPD. The Hyrax specific implementation does not. The crease I’m looking for is now to create locators for SPD and pre-processors for SPD.

The pre-processors will echo what was done in Newman Numismatic Portal. The pre-processor, in general will receive a derivative_type and an identifier.

One derivative_type is “original” (perhaps we should rename this). Other examples, albeit somewhat arbitrary, are thumbnail, text, and hocr.

We will have an SpaceStone “entry” that looks like the following:

identifier: abcd-1234-efgh-5678
original: https://path.to/some/file
thumbnail: https://path.to/some/other/file

Each project that uses SpaceStone will need to name (e.g. “thumbnail”) what derivatives it wants to generate and the function for generating that derivative from the original.

Below is a feature description:

Given an entry with the identifier
  And the original (with corresponding URL)
  And the "thumbnail" derivative type (with corresponding URL)
  And SpaceStone is configured to generate "text" derivatives
  And SpaceStone is configured to generate "thumbnail" derivatives
When the SpaceStone Lambda processes the given entry
Then SpaceStone will not generate the "thumbnail" derivative
  And will fetch the "thumbnail" derivative
  And store the "thumbnail" derivative
  And SpaceStone will generate the "text" derivative
  And SpaceStone will not fetch the "text" derivative
  And SpaceStone will store the "text" derivative

Related to:

jeremyf · 2023-03-29T13:21:55Z

I have brought in the PageOCR logic from IIIF Print. One observed problem/challenge is that it has an interface that will need reworking; the exposed public methods are a bit confusing (so I need to perform more analysis). Another challenge is that it generates three or four different derivatives that go towards text extraction.

My plan for 2023-03-29 is to disentangle the different file creation processes so that we can use the named files provided instead of generating. There is some preliminary work.

Specific tasks are:

Create a SpaceStone::Derivatives::Configuration
- Need a tesseract additional command line options; used for specifying different trained data sets.
Bring in some of Samvera::Derivatives from IIIF Print (folding into the SpaceStone::Derivatives)
Disentangle the “hocr” file creation process to be able to use an existing “hocr” file.

A competing priority is deploying changes to the British Library and ensuring that is in a good state for their end of week priority.

jeremyf · 2023-03-30T03:04:38Z

With a bit of retrospective, the most critical decision I made early was to name each derivative (e.g. :hocr, :text, :monochrome); after all we have a named “file” for each of those. In doing so, I have a conceptual object in which to organize my code.

Yesterday, I also turned a major corner on this project. In the morning I sat down and started writing narrative descriptions in the README of SpaceStone::Derivatives.

This lead to naming the concepts of the SpaceStone::Derivatives::Manifest and the SpaceStone::Derivatives::Repository. With those names, I had eliminated some of my mental barriers regarding the various layers of abstractions and mappings.

A second revelation was when I stepped away from the code and started verbally narrating. I had hit a minor mental block, fixating on a low-level detail to which I had lost the thread to the larger feature requirement. The inspiration came when I named the SpaceStone::Derivatives.pre_process_derivatives_for method.

That named method gave me the clear mental map into the process steps.

Immediately, I knew I would need to resolve the dependency graph of derivatives. I first started with a validator function that could process a hash.

Then I thought about how I would perform the sequence. Which introduced the idea of the SpaceStone::Derivatives::Chain. I moved the validator into that class and began working on a sequencer function; likewise the sequencer would process a hash. As I delved deeper, the Validator and Sequencer were performing duplicate logic.

I had begun stenciling in the SpaceStone::Derivatives::Types in an effort to play with the conceptual idea.

I ripped out the validator and settled on SpaceStone::Derivatives::Chain::Sequencer; again something I could test with a Hash that had symbol keys and values that were arrays of symbols.

With the sequence of derivative generation resolved, I set about the conceptual SpaceStone::Derivative::Processor. It began it’s life named “PreProcessor” but as I was writing the documentation, I wrote “send the pre_process! message to the each of the types in the chain.” With the word “message”, I realized I could use a dispatching strategy (e.g. send(message, repository: repository)).

A key design consideration has been quick and easy testing. And during the day’s development, I refactored the method signatures a few times. Each time spending a few minutes changing and running tests.

For 2023-03-29 I plan to look into the following:

The Repository needs to consider it’s file storage strategy. I have some ideas and will likely be implementing an adapter pattern.
I need to switch the current ported over derivatives to the new SpaceStone::Derivatives::Types.
I feel close to being able to wire this back into IIIF Print; however I think that is a lower priority.
I wrote Manifest::FileLocationSet as a named parameter. I’ll need to play with that a bit.

More important is getting the entire pre-processing ready to run via SpaceStone proper (e.g. AWS Lambda).

As an added benefit, I believe that SpaceStone::Derivatives is almost certainly vendor agnostic (managed instead by the yet to be made repository file storage strategy).

I continue to rely on local tests, both style guides and rspec. These run each time I commit code and each time I push code to the Github.

jeremyf · 2023-03-31T12:53:21Z

Inspired by LeaAnn’s “Project and Task Breakdown” presentation on Thursday, I wanted to write up the task breakdown/algorithm:

For the pre-processing in AWS:

Check if the file exists in the expected AWS location. If it does, return a “handle” to it.
Else, if it doesn’t and the manifest says it has a remote URL, attempt to GET it.
- On a 404, log a warning and return “nil”
- On a 2xx, copy it into the expected location, and return the “handle”
- On any other status, log an error and raise an exception.
Else, if it can’t be remotely fetched, attempt to Generate it.
- On a failure to generate, log an error and raise an exception.
- On a success but there’s no file, log an error and raise an exception.
- On a success with a file, move the file to the expected location and return the “handle”.

What is the file “handle”? Perhaps the path name.

I also say “AWS” but this is really the pre-processing environment; a “loading dock” if you will.

In the above case, we only want to verify that we have a “handle”. If the “handle” does not exist, that is an error in processing the manifest. Put another way, once we’ve processed the manifest, we need to audit it’s integrity.

jeremyf · 2023-04-05T14:41:30Z

Thoughts from 2023-04-05:

I have a working proof of concept for monochrome and hocr. Now I need to look into PDF splitting. I start with an original file that is a PDF.

I want to make a thumbnail of the PDF. I also want to split the PDF. When I split the PDF, I’m probably going to create a manifest for each of the pages. And then feed those manifests to the processor.

I likely want the original PDF and the split files to be in a similar location (for easier finding). What would that look like?

/path/to/:parent_id/:original_file/<original>
/path/to/:parent_id/:original_file/pdf_split/:index/<image>
/path/to/:parent_id/:original_file/pdf_split/:index/<monochrome>
/path/to/:parent_id/:original_file/pdf_split/:index/<hocr>

As written, I do not have a consistent predictable temporary directory creation process. The input for derivatives is:

Parent ID
Original Filename
Original URL
URL
Working directory…if it exists, use that, otherwise, create a new one and assign.

jeremyf · 2023-04-05T15:08:38Z

This morning I realized that I need to lean into the Chain concept. Namely, because of the async nature of AWS, I need to process the chain as follows:

Given a chain of <A>, <B>, <C>, and <D>
When I “schedule” <A>, I need to provide the chain.
Then as part of completing <A>, it “scheduless” <B>.

In the above example, let’s say that <B> is the :split_pdf. It is responsible for launching the “sub-processes” of :ocr. An assumption is that the given Chain creates the files that later links are dependent on. In other words, the sub-processes of :ocr are not dependents of the above <C> nor <D>. And the sibling processes of split pages are not dependent on each other.

Ideally, we don’t need to notify the parent <B> that all children are done. Due to convention, <B> might want to write it’s manifest of indices that it wants to generate.

Given a parent process <B>
And the child chain <Ba>, <Bb>, <Bc>
And the children <0>, <1>, <2>
When I “schedule” <0>, I need to provide the chain.
Then as part of completing <0>’s <Ba>, it “schedules” <Bb> for <0>.

Critical in this is once I start processing an “original” manifest, I need to preserve the storage “handles” for both the “local” and the “remote”. Those handles, along with the chain, help the processing locate either pre-existing files or fetch from a common remote location.

I also need the “processing queue” to provide an in-line option or to send things to AWS’s SQS.

Why as a development dependency? Because the DerivativeRodeo introduces a dependency on Faraday >= 1. And the Valkyrie and ActiveFedora versions which Hyrax 2 and 3 depend on have a Faraday dependency of < 1. I am pushing this up so that I can begin development on the ingest aspect of the Derivative Rodeo. Also to see how this resolves in our CI setup and to see the impact, if any on downstream implementations of IIIF Print (e.g. Adventist, British Library, ATLA, PALNI/PALCI, UTK, and others). The plan is to determine if we want to have this Faraday conflict setup or if we want to swap out something else in the underlying DerivativeRodeo. Related to: - https://github.com/scientist-softserv/adventist-dl/issues/330 - #219 - #220

Why change `< 4.0` to `< 4`? I was encountering issues where bundler was setting the dependency to `hyrax v4.0.0.rc3`. This change remedied that. Why remove `~> 3.1` from `rspec-rails` dependency? I was getting issues with what version of Rails was required. By removing this hard dependency, we're able to let these gems sort things out. Related to: - https://github.com/scientist-softserv/adventist-dl/issues/330

Prior to this commit, if we'd already pre-processed a PDF split, we would again re-process that split (as there was no check for existing pages). With this commit, we check for those pre-processed pages. One critical bit of conversation, is that one work might have multiple PDFs uploaded. Therefore, it is important to have those PDFs pages write to different "sub-directories". I'm putting this hear so we can account for that in a test audit of some kind. Related to: - https://github.com/scientist-softserv/adventist-dl/issues/330 - notch8/iiif_print#220 Co-authored-by: Rob Kaufman <[email protected]> Co-authored-by: Kirk Wang <[email protected]>

Updating a bit of documentation and reworking the filename to account for a work having multiple PDFs. - https://github.com/scientist-softserv/adventist-dl/issues/330 - notch8/iiif_print#220

Prior to this commit, if we'd already pre-processed a PDF split, we would again re-process that split (as there was no check for existing pages). With this commit, we check for those pre-processed pages. One critical bit of conversation, is that one work might have multiple PDFs uploaded. Therefore, it is important to have those PDFs pages write to different "sub-directories". I'm putting this hear so we can account for that in a test audit of some kind. Related to: - https://github.com/scientist-softserv/adventist-dl/issues/330 - notch8/iiif_print#220 Co-authored-by: Rob Kaufman <[email protected]> Co-authored-by: Kirk Wang <[email protected]>

Updating a bit of documentation and reworking the filename to account for a work having multiple PDFs. - https://github.com/scientist-softserv/adventist-dl/issues/330 - notch8/iiif_print#220

Prior to this commit, we didn't have a spec for the S3 behavior. We now have a test for an S3 Faux Bucket. Related to: - https://github.com/scientist-softserv/adventist-dl/issues/330 - notch8/iiif_print#220

* 🎁 Adding PDF Split Page Checks Prior to this commit, if we'd already pre-processed a PDF split, we would again re-process that split (as there was no check for existing pages). With this commit, we check for those pre-processed pages. One critical bit of conversation, is that one work might have multiple PDFs uploaded. Therefore, it is important to have those PDFs pages write to different "sub-directories". I'm putting this hear so we can account for that in a test audit of some kind. Related to: - https://github.com/scientist-softserv/adventist-dl/issues/330 - notch8/iiif_print#220 Co-authored-by: Rob Kaufman <[email protected]> Co-authored-by: Kirk Wang <[email protected]> * ☑️ Verifying pdf splitter finds pre-existing files Updating a bit of documentation and reworking the filename to account for a work having multiple PDFs. - https://github.com/scientist-softserv/adventist-dl/issues/330 - notch8/iiif_print#220 * ☑️ Refining globbed_tail_locations for S3 Prior to this commit, we didn't have a spec for the S3 behavior. We now have a test for an S3 Faux Bucket. Related to: - https://github.com/scientist-softserv/adventist-dl/issues/330 - notch8/iiif_print#220 --------- Co-authored-by: Rob Kaufman <[email protected]> Co-authored-by: Kirk Wang <[email protected]>

With this commit, I'm introducing the most barebones and naive functionality for allowing for configurability by type. This gets us unblocked for scientist-softserv/adventist-dl#330; there's certainly improvements we can make. But we won't know those until we work through additional DerivativeRodeo client needs. Closes: #30 Related to: - #30 - https://github.com/scientist-softserv/adventist-dl/issues/330

Prior to this we were using a naive extension sniffer to determine a "mime_type" (which we used to decide on thumbnail dimensions). With this commit, we leverage Marcel (a Rails gem) that can detect mime based on the magic stings at the beginning of files or based on extensions and some lower "cost" processing (compared to `fits.sh`). Related to: - #30 - https://github.com/scientist-softserv/adventist-dl/issues/330

jeremyf · 2023-06-06T20:49:59Z

June 5 to June 9 Sprint Adventist Tasks

OAI -> CSV

We need to ensure that our conversation with Katharine is “encoded” in the CSV logic. We have PDF and we have Image as the original file. We also can ignore the Periodical images et. al (because those images are the representative image which Hyku does not account for).

Our plan is to add the archival, reader, and txt as three FileSets on a work. The archival we will pre-process derivative generation. The reader we only want thumbnails. And the text we can use Hyrax::FileSetDerivativesService as written.

We need to somehow communicate that the reader fileset does not split nor have much of any other derivatives.

CSV -> SpaceStone

This was pushed up by Rob and merged by Jeremy; he’s working through the logic and how we’ll map accordingly. We can likely re-use the thumbnail for the reader; this might mean copying the original thumbnail twice.

SpaceStone does it thing

We need to deploy the latest SpaceStone. Jeremy needs to be able to do this, as does Kirk. First, we try Kirk as he has successfully deployed once; and we likely won’t need more deploys of SpaceStone.

We want to review the SpaceStone S3 buckets to ensure they are structured as intended.

IIIF Print pulls from SpaceStone

With the latest changes, which are pending review but will be merged by EoD Tuesday, we are prepared to update Adventist to use the Derivative Rodeo enabled IIIF Print.

Configure Adventist’s DerivativeRodeo

We need to update Adventist’s IIIF Print gem to get the Derivative Rodeo; this will help inform how we post to SpaceStone (as there’s a configuration assumption which is marked as a TODO)

jeremyf referenced this issue in notch8/derivative-rodeo Mar 27, 2023

Extracting PNG splitter strategy from IIIF Print

cafb399

Related to: - notch8/iiif_print#194 - samvera/bulkrax#760 - notch8/utk-hyku#343 - https://github.com/scientist-softserv/adventist-dl/issues/330

jeremyf mentioned this issue May 10, 2024

Extract PDF splitting logic into SpaceStone #419

Closed

jeremyf mentioned this issue Mar 28, 2023

Update derivative generation to use derivative rodeo to skip *TN.jpg and .READER.pdf files #431

Open

jeremyf self-assigned this Mar 29, 2023

jeremyf mentioned this issue May 18, 2023

📚 Document the SpaceStone::Serverless workflow for Adventist notch8/space_stone-serverless#2

Open

jeremyf mentioned this issue May 24, 2023

⚙️ Adding derivative_rodeo as dev dependency notch8/iiif_print#243

Closed

jeremyf mentioned this issue May 25, 2023

⚙️ Adding derivative-rodeo gem 🤠 notch8/iiif_print#247

Merged

jeremyf mentioned this issue May 25, 2023

🎁 Adding PDF Split Page Checks notch8/derivative_rodeo#36

Merged

jeremyf mentioned this issue May 31, 2023

🎁 Adding naive configurability for thumbnail dimensions notch8/derivative_rodeo#41

Merged

jeremyf mentioned this issue May 31, 2023

♻️ Leveraging Marcel to better derive "mime_type" notch8/derivative_rodeo#44

Merged

kirkkwang transferred this issue from notch8/adventist-dl May 10, 2024

jeremyf removed their assignment May 30, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

EPIC: Import Optimization with Out of Band Processing #421

EPIC: Import Optimization with Out of Band Processing #421

jeremyf commented Mar 24, 2023 •

edited

Loading

jeremyf commented Mar 28, 2023

jeremyf commented Mar 29, 2023

jeremyf commented Mar 30, 2023

jeremyf commented Mar 31, 2023

jeremyf commented Apr 5, 2023

jeremyf commented Apr 5, 2023

jeremyf commented Jun 6, 2023

EPIC: Import Optimization with Out of Band Processing #421

EPIC: Import Optimization with Out of Band Processing #421

Comments

jeremyf commented Mar 24, 2023 • edited Loading

Pre-Process Tasks

Ingest Task

Task Scratch Pad

jeremyf commented Mar 28, 2023

jeremyf commented Mar 29, 2023

jeremyf commented Mar 30, 2023

jeremyf commented Mar 31, 2023

jeremyf commented Apr 5, 2023

jeremyf commented Apr 5, 2023

jeremyf commented Jun 6, 2023

June 5 to June 9 Sprint Adventist Tasks

OAI -> CSV

CSV -> SpaceStone

SpaceStone does it thing

IIIF Print pulls from SpaceStone

Configure Adventist’s DerivativeRodeo

jeremyf commented Mar 24, 2023 •

edited

Loading