-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
EPIC: Import Optimization with Out of Band Processing #421
Comments
Prior to this commit, I had a data structure that "built itself" from an internal process. With this refactor, I created a simple data structure and a method to extract the data structure from a given file. The benefit of this setup is that I don't need to run the pdfimages command against a file (well, I probably should run it at least once) every time I want to do something/test a PDF processing. Related to: - notch8/iiif_print#194 - samvera/bulkrax#760 - notch8/utk-hyku#343 - https://github.com/scientist-softserv/adventist-dl/issues/330
This is an almost direct copy of IIIF Print's existing code Related to: - notch8/iiif_print#194 - samvera/bulkrax#760 - notch8/utk-hyku#343 - https://github.com/scientist-softserv/adventist-dl/issues/330
As I'm thinking through the logic, I'm realizing that we have a splitter and a strategy; the splitter leverages the strategy. As implemented the splitter is the strategy is the splitter. Related to: - notch8/iiif_print#194 - samvera/bulkrax#760 - notch8/utk-hyku#343 - https://github.com/scientist-softserv/adventist-dl/issues/330
Yesterday, I brought over several classes/utilities from the IIIF Print gem. These were lower level functions that are used by processes within IIIF Print gem. Today, I need to bring over the remaining lower level classes and utilities. I am presently working on bringing over PageOCR. I have locally setup my SPD development so that I run I will also be bringing over the The pre-processors will echo what was done in Newman Numismatic Portal. The pre-processor, in general will receive a One We will have an SpaceStone “entry” that looks like the following:
Each project that uses SpaceStone will need to name (e.g. “thumbnail”) what derivatives it wants to generate and the function for generating that derivative from the Below is a feature description: Given an entry with the identifier
And the original (with corresponding URL)
And the "thumbnail" derivative type (with corresponding URL)
And SpaceStone is configured to generate "text" derivatives
And SpaceStone is configured to generate "thumbnail" derivatives
When the SpaceStone Lambda processes the given entry
Then SpaceStone will not generate the "thumbnail" derivative
And will fetch the "thumbnail" derivative
And store the "thumbnail" derivative
And SpaceStone will generate the "text" derivative
And SpaceStone will not fetch the "text" derivative
And SpaceStone will store the "text" derivative Related to:
|
I have brought in the PageOCR logic from IIIF Print. One observed problem/challenge is that it has an interface that will need reworking; the exposed public methods are a bit confusing (so I need to perform more analysis). Another challenge is that it generates three or four different derivatives that go towards text extraction. My plan for 2023-03-29 is to disentangle the different file creation processes so that we can use the named files provided instead of generating. There is some preliminary work. Specific tasks are:
A competing priority is deploying changes to the British Library and ensuring that is in a good state for their end of week priority. |
With a bit of retrospective, the most critical decision I made early was to name each derivative (e.g. Yesterday, I also turned a major corner on this project. In the morning I sat down and started writing narrative descriptions in the README of SpaceStone::Derivatives. This lead to naming the concepts of the SpaceStone::Derivatives::Manifest and the SpaceStone::Derivatives::Repository. With those names, I had eliminated some of my mental barriers regarding the various layers of abstractions and mappings. A second revelation was when I stepped away from the code and started verbally narrating. I had hit a minor mental block, fixating on a low-level detail to which I had lost the thread to the larger feature requirement. The inspiration came when I named the That named method gave me the clear mental map into the process steps. Immediately, I knew I would need to resolve the dependency graph of derivatives. I first started with a validator function that could process a hash. Then I thought about how I would perform the sequence. Which introduced the idea of the SpaceStone::Derivatives::Chain. I moved the validator into that class and began working on a sequencer function; likewise the sequencer would process a hash. As I delved deeper, the Validator and Sequencer were performing duplicate logic. I had begun stenciling in the SpaceStone::Derivatives::Types in an effort to play with the conceptual idea. I ripped out the validator and settled on SpaceStone::Derivatives::Chain::Sequencer; again something I could test with a Hash that had symbol keys and values that were arrays of symbols. With the sequence of derivative generation resolved, I set about the conceptual SpaceStone::Derivative::Processor. It began it’s life named “PreProcessor” but as I was writing the documentation, I wrote “send the pre_process! message to the each of the types in the chain.” With the word “message”, I realized I could use a dispatching strategy (e.g. A key design consideration has been quick and easy testing. And during the day’s development, I refactored the method signatures a few times. Each time spending a few minutes changing and running tests. For 2023-03-29 I plan to look into the following:
More important is getting the entire pre-processing ready to run via SpaceStone proper (e.g. AWS Lambda). As an added benefit, I believe that I continue to rely on local tests, both style guides and rspec. These run each time I commit code and each time I push code to the Github. |
Inspired by LeaAnn’s “Project and Task Breakdown” presentation on Thursday, I wanted to write up the task breakdown/algorithm: For the pre-processing in AWS:
What is the file “handle”? Perhaps the path name. I also say “AWS” but this is really the pre-processing environment; a “loading dock” if you will. In the above case, we only want to verify that we have a “handle”. If the “handle” does not exist, that is an error in processing the manifest. Put another way, once we’ve processed the manifest, we need to audit it’s integrity. |
Thoughts from 2023-04-05: I have a working proof of concept for monochrome and hocr. Now I need to look into PDF splitting. I start with an original file that is a PDF. I want to make a thumbnail of the PDF. I also want to split the PDF. When I split the PDF, I’m probably going to create a manifest for each of the pages. And then feed those manifests to the processor. I likely want the original PDF and the split files to be in a similar location (for easier finding). What would that look like?
As written, I do not have a consistent predictable temporary directory creation process. The input for derivatives is:
|
This morning I realized that I need to lean into the Given a chain of <A>, <B>, <C>, and <D>
When I “schedule” <A>, I need to provide the chain.
Then as part of completing <A>, it “scheduless” <B>. In the above example, let’s say that Ideally, we don’t need to notify the parent Given a parent process <B>
And the child chain <Ba>, <Bb>, <Bc>
And the children <0>, <1>, <2>
When I “schedule” <0>, I need to provide the chain.
Then as part of completing <0>’s <Ba>, it “schedules” <Bb> for <0>. Critical in this is once I start processing an “original” manifest, I need to preserve the storage “handles” for both the “local” and the “remote”. Those handles, along with the chain, help the processing locate either pre-existing files or fetch from a common remote location. I also need the “processing queue” to provide an in-line option or to send things to AWS’s SQS. |
Why as a development dependency? Because the DerivativeRodeo introduces a dependency on Faraday >= 1. And the Valkyrie and ActiveFedora versions which Hyrax 2 and 3 depend on have a Faraday dependency of < 1. I am pushing this up so that I can begin development on the ingest aspect of the Derivative Rodeo. Also to see how this resolves in our CI setup and to see the impact, if any on downstream implementations of IIIF Print (e.g. Adventist, British Library, ATLA, PALNI/PALCI, UTK, and others). The plan is to determine if we want to have this Faraday conflict setup or if we want to swap out something else in the underlying DerivativeRodeo. Related to: - https://github.com/scientist-softserv/adventist-dl/issues/330 - #219 - #220
Why change `< 4.0` to `< 4`? I was encountering issues where bundler was setting the dependency to `hyrax v4.0.0.rc3`. This change remedied that. Why remove `~> 3.1` from `rspec-rails` dependency? I was getting issues with what version of Rails was required. By removing this hard dependency, we're able to let these gems sort things out. Related to: - https://github.com/scientist-softserv/adventist-dl/issues/330
Prior to this commit, if we'd already pre-processed a PDF split, we would again re-process that split (as there was no check for existing pages). With this commit, we check for those pre-processed pages. One critical bit of conversation, is that one work might have multiple PDFs uploaded. Therefore, it is important to have those PDFs pages write to different "sub-directories". I'm putting this hear so we can account for that in a test audit of some kind. Related to: - https://github.com/scientist-softserv/adventist-dl/issues/330 - notch8/iiif_print#220 Co-authored-by: Rob Kaufman <[email protected]> Co-authored-by: Kirk Wang <[email protected]>
Updating a bit of documentation and reworking the filename to account for a work having multiple PDFs. - https://github.com/scientist-softserv/adventist-dl/issues/330 - notch8/iiif_print#220
Prior to this commit, if we'd already pre-processed a PDF split, we would again re-process that split (as there was no check for existing pages). With this commit, we check for those pre-processed pages. One critical bit of conversation, is that one work might have multiple PDFs uploaded. Therefore, it is important to have those PDFs pages write to different "sub-directories". I'm putting this hear so we can account for that in a test audit of some kind. Related to: - https://github.com/scientist-softserv/adventist-dl/issues/330 - notch8/iiif_print#220 Co-authored-by: Rob Kaufman <[email protected]> Co-authored-by: Kirk Wang <[email protected]>
Updating a bit of documentation and reworking the filename to account for a work having multiple PDFs. - https://github.com/scientist-softserv/adventist-dl/issues/330 - notch8/iiif_print#220
Prior to this commit, we didn't have a spec for the S3 behavior. We now have a test for an S3 Faux Bucket. Related to: - https://github.com/scientist-softserv/adventist-dl/issues/330 - notch8/iiif_print#220
Prior to this commit, we didn't have a spec for the S3 behavior. We now have a test for an S3 Faux Bucket. Related to: - https://github.com/scientist-softserv/adventist-dl/issues/330 - notch8/iiif_print#220
* 🎁 Adding PDF Split Page Checks Prior to this commit, if we'd already pre-processed a PDF split, we would again re-process that split (as there was no check for existing pages). With this commit, we check for those pre-processed pages. One critical bit of conversation, is that one work might have multiple PDFs uploaded. Therefore, it is important to have those PDFs pages write to different "sub-directories". I'm putting this hear so we can account for that in a test audit of some kind. Related to: - https://github.com/scientist-softserv/adventist-dl/issues/330 - notch8/iiif_print#220 Co-authored-by: Rob Kaufman <[email protected]> Co-authored-by: Kirk Wang <[email protected]> * ☑️ Verifying pdf splitter finds pre-existing files Updating a bit of documentation and reworking the filename to account for a work having multiple PDFs. - https://github.com/scientist-softserv/adventist-dl/issues/330 - notch8/iiif_print#220 * ☑️ Refining globbed_tail_locations for S3 Prior to this commit, we didn't have a spec for the S3 behavior. We now have a test for an S3 Faux Bucket. Related to: - https://github.com/scientist-softserv/adventist-dl/issues/330 - notch8/iiif_print#220 --------- Co-authored-by: Rob Kaufman <[email protected]> Co-authored-by: Kirk Wang <[email protected]>
With this commit, I'm introducing the most barebones and naive functionality for allowing for configurability by type. This gets us unblocked for scientist-softserv/adventist-dl#330; there's certainly improvements we can make. But we won't know those until we work through additional DerivativeRodeo client needs. Closes: #30 Related to: - #30 - https://github.com/scientist-softserv/adventist-dl/issues/330
With this commit, I'm introducing the most barebones and naive functionality for allowing for configurability by type. This gets us unblocked for scientist-softserv/adventist-dl#330; there's certainly improvements we can make. But we won't know those until we work through additional DerivativeRodeo client needs. Closes: #30 Related to: - #30 - https://github.com/scientist-softserv/adventist-dl/issues/330
Prior to this we were using a naive extension sniffer to determine a "mime_type" (which we used to decide on thumbnail dimensions). With this commit, we leverage Marcel (a Rails gem) that can detect mime based on the magic stings at the beginning of files or based on extensions and some lower "cost" processing (compared to `fits.sh`). Related to: - #30 - https://github.com/scientist-softserv/adventist-dl/issues/330
Prior to this we were using a naive extension sniffer to determine a "mime_type" (which we used to decide on thumbnail dimensions). With this commit, we leverage Marcel (a Rails gem) that can detect mime based on the magic stings at the beginning of files or based on extensions and some lower "cost" processing (compared to `fits.sh`). Related to: - #30 - https://github.com/scientist-softserv/adventist-dl/issues/330
June 5 to June 9 Sprint Adventist TasksOAI -> CSVWe need to ensure that our conversation with Katharine is “encoded” in the CSV logic. We have PDF and we have Image as the original file. We also can ignore the Periodical images et. al (because those images are the representative image which Hyku does not account for). Our plan is to add the archival, reader, and txt as three FileSets on a work. The archival we will pre-process derivative generation. The reader we only want thumbnails. And the text we can use Hyrax::FileSetDerivativesService as written. We need to somehow communicate that the reader fileset does not split nor have much of any other derivatives. CSV -> SpaceStoneThis was pushed up by Rob and merged by Jeremy; he’s working through the logic and how we’ll map accordingly. We can likely re-use the thumbnail for the reader; this might mean copying the original thumbnail twice. SpaceStone does it thingWe need to deploy the latest SpaceStone. Jeremy needs to be able to do this, as does Kirk. First, we try Kirk as he has successfully deployed once; and we likely won’t need more deploys of SpaceStone. We want to review the SpaceStone S3 buckets to ensure they are structured as intended. IIIF Print pulls from SpaceStoneWith the latest changes, which are pending review but will be merged by EoD Tuesday, we are prepared to update Adventist to use the Derivative Rodeo enabled IIIF Print. Configure Adventist’s DerivativeRodeoWe need to update Adventist’s IIIF Print gem to get the Derivative Rodeo; this will help inform how we post to SpaceStone (as there’s a configuration assumption which is marked as a TODO) |
Goals
The goal of this epic is to adjust the derivative “creation” process.
At present Hyrax creates a
FileSet
for each file added to a work either via the UI or via batch processing. We then process each original file of aFileSet
creating derivative files that we attach to thatFileSet
. This is all done “in-band”, which can be non-performant for large imports.To speed up the imports we can:
By introducing the idea of the “out of band” processing, we break the fundamental assumption of Hyrax’s derivative generation. Namely that it will take a
FileSet
and create all the necessary derivatives. Instead, with pre-processing, we are now saying “There may already exist a derivative for thisFileSet
, attach that instead.”Instead of having one conceptual “Create Derivatives” function we are looking to have three:
We have a “prior art” instance of Pre-processing in the SpaceStone Ruby gem. That gem is code that runs in AWS Lambdas to pull data from Internet Archive (per the client’s previous storage), split apart the PDFs, create OCR derivatives of each page, and create thumbnails.
Further, we have “prior art” for Locating and Applying
.hocr
files in NNP’s Tesseract module; responsible for first looking for a remote.hocr
and failing that generating the.hocr
file.We have some logic for finding the Pre-processing files in NNP’s DistributedRemoteDownloader.
Those are specific implementations that demonstrate some of the necessary concepts. However, those are different from the immediate needs of Adventist and the general needs of other clients.
Scenarios
In terms of Pre-Processing we need the following:
In terms of Locating we need the following:
In terms of Applying we need the following:
One consideration is that Hyrax has implicit derivatives that are named concepts; for the above features we need to expose those named concepts. Namely, how do I locate and apply the thing called
thumbnail
?Pre-Process Tasks
Adventist S3 Convention: there’s an AARK unique for each work. Write files there.
DerivativeRodeo::Generators::WordCoordinateGenerator
derivative_rodeo#5Ingest Task
Task Scratch Pad
Not all of these will be converted to tasks; they instead represent a current workign understanding. Once we create a task from the checkmark, then we’re looking more at actionable tasks.
hocr
,thumbnail
,splits
, etc) and the identifier.Read OAI feed, sending non-thumbnail (and non-reader.pdf) files to S3
The text was updated successfully, but these errors were encountered: