"Have you tried SHROOM? It's magic!"
SHROOM is a proof-of-concept Rails application to explore:
- The use of the GROBID machine learning library to extract bibliographic metadata from scholarly articles to minimize manual entry by users.
- The use of SDR / DSA for the persistence of Cocina model digital objects to avoid synchronization problems between the deposit application and the repository.
To start a local database, as well as required SDR applications: docker compose up
.
LibTorch is used for vector lookups of research organizations.
See https://github.com/ankane/torch.rb?tab=readme-ov-file#installation
Seed a collection: bin/rake development:seed_collection
Completing accessioning for a work: bin/rake "development:accession[druid:ft277ns6842]"
Cleaning up orphaned work files: bin/rails runner "WorkFile.where(work:nil).where('created_at < ?', 1.week.ago).destroy_all"
Enabled caching of Gemini responses and affiliation searches: bin/rails dev:cache
By default the Grobid container is configured to use the faster, less accurate Wapiti CRF models. See compose.yaml
for how to switch to the DeLFT deep learning models.
cap poc deploy
Note that SHROOM doesn't currently use shared configs or Vault. Instead settings are in config/settings/production.yml
and config/credentials/production.yml.enc
.
docker run -d --rm --init --ulimit core=0 -p 8070:8070 lfoppiano/grobid:latest-crf
or
docker run -d --rm --init --ulimit core=0 -p 8070:8070 lfoppiano/grobid:latest-full
The ROR dataset can be downloaded and unzipped from https://zenodo.org/records/13357234.
bin/rails runner "RorEmbeddings::Loader.call(json_filepath: 'ror-data/v1.51-2024-08-21-ror-data_schema_v2.json')
If it crashes, loading can be continued with the resume flag.
bin/rails runner "RorEmbeddings::Loader.call(json_filepath: 'ror-data/v1.51-2024-08-21-ror-data_schema_v2.json', resume: true)
The Geonames dataset can be downloaded and unzipped from https://download.geonames.org/export/dump/allCountries.zip.
bin/rails runner "Geonames::Loader.call(path: 'allCountries.txt')"
Export the metadata for a collection to CSV: bin/rake "export:csv[druid:jk956kb4381]"
Export the metadata for a collection to line-oriented JSON: bin/rake "export:json[druid:jk956kb4381]"
{
"title": STRING,
"authors": [
{
"first_name": STRING (REQUIRED - includes middle name, initials, etc.),
"last_name": STRING (REQUIRED),
"affiliations": [
{
"organization": STRING (REQUIRED)
"ror_id": STRING
}
]
}
],
"abstract": STRING,
"keywords": [
{
value: STRING
}
],
"related_resource_citation": STRING,
"related_resource_doi": STRING (for example, 10.5860/lrts.48n4.8259),
"published": BOOLEAN,
"collection_druid": STRING (for example, druid:jk956kb4381)
}
To compare groundtruth metadata from a JSONL file against metadata produced by a metadata extraction service (e.g., MetadataExtractionService::Grobid):
bin/rails r "EvaluationRunner.call(limit: 2)"
PASS: druid:hc954ws1639
FAIL: druid:nv906wk0020 (6970.pdf)
Authors do not match:
expected: Victor Lee, Christine Bywater, Robert Wachtel Pronovost, Kaifeng Cheng, Daniel Guimaraes
actual: Victor R Lee, Christine Bywater, Robert Wachtel, Kaifeng Cheng, Daniel Guimaraes
Note:
- The first time this is run, all of the PDFs will be downloaded from Google Cloud Storage.
- Entries that are not articles and published before 2019 are filtered.
brew install cmake
brew install automake
brew install wget
git clone https://github.com/kermitt2/pdfalto.git
cd pdfalto
git submodule update --init --recursive
mkdir libs/freetype/mac/arm64
mkdir libs/icu/mac/arm64
mkdir libs/libxml/mac/arm64
mkdir libs/image/png/mac/arm64
mkdir libs/image/zlib/mac/arm64
./install_deps.sh
export C_INCLUDE_PATH=/opt/homebrew/include
export CPLUS_INCLUDE_PATH=/opt/homebrew/include
cmake .
make
Available from https://storage.cloud.google.com/cloud-ai-platform-e215f7f7-a526-4a66-902d-eb69384ef0c4/preprints/metadata.jsonl
bin/rails runner "pp Dataset::Analyzer.summarize"
{"title"=>{:total=>10782, :matches=>7338, :matches_articles=>7329},
"author"=>{:total=>125923, :matches=>97017, :matches_articles=>10188},
"all_authors"=>{:total=>10782, :matches=>6850, :matches_articles=>6841},
"affiliation"=>{:total=>82581, :matches=>31161, :matches_articles=>5272},
"all_affiliations_for_author"=>{:total=>125923, :matches=>47202, :matches_articles=>4967},
"all_affiliations"=>{:total=>10782, :matches=>1818, :matches_articles=>1815},
"abstract"=>{:total=>10782, :matches=>512, :matches_articles=>512}}
bin/rails runner "Dataset::Analyzer.question_dataset(output_filepath: 'question_dataset.jsonl')"
head -1 question_dataset.jsonl | jq
{
"filename": "www.biorxiv.org/W3164692211.pdf",
"question": "What is the title?",
"answer": "Integrated analysis of multimodal single-cell data",
"field": "title"
}