Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Elements -> cRT2 -> eSchol speed ideas #74

Open
DevinSmithWork opened this issue Jan 23, 2025 · 3 comments
Open

Elements -> cRT2 -> eSchol speed ideas #74

DevinSmithWork opened this issue Jan 23, 2025 · 3 comments

Comments

@DevinSmithWork
Copy link
Collaborator

DevinSmithWork commented Jan 23, 2025

What

During the last two years, the increasing traffic between Elements and eScholarship during Elements' diff sync process has necessitated occasional modifications to programs involved in the syncing process.

This card will be used to collect various ideas on improving the diff sync runtime, should we require this. Presently, the diff syncing is running at an acceptable speed -- and frankly, the less we monkey around with this system, the better.

Background

Modifications anywhere in this chain of programs can have DRAMATIC effects on the diff syncing process' runtime. This is especially true with the Elements' Relevance Scheme and Crosswalks, which are the first steps in determining whether a pub should proceed through the diff syncing initially.

This syncing process is very complex, and its runtime is effected by the Elements Relevance Scheme and Crosswalk files; connectRT's transform steps (for both input and output); the eScholarship API; and the prodigious & ever-increasing scholarly output the UC system produces.

Historically, much of the complexity comes from layering new systems atop existing systems. For example (working from the "outside-in"):

  • The back-end of the eSchol GraphQL API generates MySQL queries, and works with the XTF-based "UC Ingest" format, which is used by eScholarhip's legacy systems.
  • connectRT2's...
    • front-end is designed to mimic the input format of dspace, which was selected as the most reasonable output format from Elements' Repository Tools 2 output list.
    • and its back-end translates this dimensions-mimicing input to graphQL output for the eSchol API.
    • It also translates eSchol's graphQL output back into dspace-mimic for ingestion by Elements.
  • Because connectRT2 is only mimicing dspace, we have direct control over its input/output translations, and we've implemented some nonstandard metadata formats (particularly for nested data), which are interfaced with using the Elements xwalk transforms.

The Syncing Process in Detail

Image

These programs involved in this process are as follows, roughly in the order they're triggered by the syncing process:

  • Elements: The Relevance Scheme: Calculates a hash value which determines whether the pub has had modifications requiring it to be diff synced.
  • Elements: The Deposit Crosswalk: The pub is then transformed through this xwalk into the metadata format ingested by connectRT2. This xwalk output will be compared against the metadata returned by eScholarship (via RT2).
  • connectRT2: The transform.rb script includes a function mimicDSpaceOutput, which queries the eScholarship GraphQL API, and transforms that metadata into an identical form as the Elements Deposit xwalk.
  • Elements: The two metadata are compared. If they are identical, the sync stops. If they are different, the sync proceeds.
  • Elements: The pub's Deposit xwalk output is then sent to eScholarship via connectRT2.
  • connectRT2: The metadata is transformed in transform.rb.
  • connectRT2: Performs a final comparison step, where the old and new set of metadata are merged together.
    • If this merged metadata contains no differences than the existing metadata, the sync is stopped.
    • Otherwise, the newly-merged metadata is sent to the eSchol GraphQL API
  • After receiving a deposit confirmation, Elements sends a final GET to eScholarship via connectRT2. (I'm not sure why).

For more information, see this google doc.

@DevinSmithWork
Copy link
Collaborator Author

DevinSmithWork commented Jan 23, 2025

Low-hanging-fruit: Stop printing the full author diffs

  • Hyperauthored papers are the main culprit for unneccesary syncing churn. (If one of a paper's 3,000 authors adds a middle initial, the entire authorship metadata will be resynced.)
  • As part of its normal operation, RT2 outputs the "final diff" metadata, which includes the authorship, like so:
Anticipated diff:
[{"op"=>"add", "path"=>"/authors/0/nameParts/fname", "value"=>""},
 {"op"=>"add",
  "path"=>"/authors/500",
  "value"=>{"nameParts"=>{"fname"=>"JJ", "lname"=>"Chwastowski"}}},
 {"op"=>"add",
  "path"=>"/authors/501",
  "value"=>{"nameParts"=>{"fname"=>"L", "lname"=>"Chytka"}}},
 {"op"=>"add",
  "path"=>"/authors/502",
  "value"=>{"nameParts"=>{"fname"=>"D", "lname"=>"Cinca"}}},
 {"op"=>"add",
  • These author lists can run into the thousands, and these lists are printed hundreds or thousands of times a day. Simply preventing the authorship list from printing may save us 1-3 seconds per syncing operation.
  • This change won't effect the syncing process itself, so it's minimally invasive.

@DevinSmithWork
Copy link
Collaborator Author

DevinSmithWork commented Jan 23, 2025

eSchol API's item access query 500-author limit

Description

  • eSchol APL's access query for Items limits the number of authors it returns to the first 500. This 500 limit is actually added by connectRT2, here.
  • It's unclear why 500 was chosen as the cutoff. This limit is applied in a few other spots in the accessQuery.
  • This limitation does not affect writes to the eScholarship API, many publications in eScholarship have > 500 associated authors in the db.

Effect this has

  • Each time an item is diff synced, its entire author list is sent from Elements to connectRT2.
  • However, during the metadata comparison steps, the authorship lists are compared for diff...
  • Meaning that a sync is guaranteed for all pubs with >500 authors, whether or not it's actually required.

Possible fixes

  • Create a separate “diff sync” access query that returns the entire author list.
  • Investigate whether we can grab the entire author list with a different graphQL query, and
    • split the diff sync query into two separate steps, grabbing non-author metadata, then author metadta,
    • combine the two prior to the comparison steup

Invasiveness level

This change would be somewhat involved, in that we'd have to modify the eSchol API, and possibly connectRT2 -- However, it effects mainly the later, post-Elements stages of the syncing process, meaning it shouldn't (e.g.) trigger a full resync of every publication from the Elements side.

@DevinSmithWork
Copy link
Collaborator Author

DevinSmithWork commented Jan 23, 2025

Journal-specific metadata fields (fpage, lpage, etc)?

During the prod migration, while monitoring the RT2 output and comparing against the resulting eScholarship metadata, I noticed that the updates for fpage, lpage, journal, and few others didn't seem to be registering in eScholarship.

Example (connectRT2 logs):

Found: pubID="1383534"
Anticipated diff:
[{"op"=>"add", "path"=>"/fpage", "value"=>"14"},
 {"op"=>"add", "path"=>"/issn", "value"=>"0022-5193"},
 {"op"=>"add", "path"=>"/journal", "value"=>"Journal of Theoretical Biology"},
 {"op"=>"add", "path"=>"/lpage", "value"=>"22"},
 {"op"=>"add", "path"=>"/volume", "value"=>"390"}]
  • Will differences in these fields between the two systems result in a metadata update from the Relevance scheme? If so, this may result in a TDB amount of unnecessary churn in the diff sync.
  • Are there differences between how eScholarship native journals and "external journals" store this data? If so, do we have a way to either:
    • Store the data in the appropriate metadata fields (if the fields are indeed different)? Or,
    • Use the transform to return two different field sets back to Elements, ergo preventing an update?

Scope of the problem

During December 2024, each day there were between 200 and 1,500 of these journal-related metadata updates. It's unclear how many of these are actually updating anything.

@DevinSmithWork DevinSmithWork changed the title Elements -> RT2 speed ideas Elements -> connectRT2 -> eScholAPI speed ideas Jan 23, 2025
@DevinSmithWork DevinSmithWork changed the title Elements -> connectRT2 -> eScholAPI speed ideas Elements -> cRT2 -> eSchol speed ideas Jan 23, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant