FREX: Feature Representation Extraction Framework for Claim Reproducibility Prediction

Feature extraction Pipeline

The pipeline is designed to extract features from scholarly work. Given a scholarly work, it extracts various features of the scholarly work and output features as a CSV file.

However, it cannot process the PDFs directly, instead, we input preprocessed PDF files using GROBID and pdf2text. This preprocessing can be done both by pipeline or individually before extracting the features.

Preprocessing PDFs

Preprocessing can be done by either using pipeline or separately running GROBID and pdf2text. In both cases, it is required to have a working GROBID and pdf2text installation.

PDF files have to be preprocessed using

While preprocessing with GROBID, it is required to convert using full text mode i.e., /api/processFulltextDocument, please refer to GROBID documentation for more details.

Once GROBID is installed and running, and pdf2text is installed, we can use the pipeline to preprocess the PDF files using the below command:

python process_docs.py --mode process-pdfs --pdf_input DIR_TO_PDFs -out OUTPUT_DIR

Alternatively, one can process them separately without using the pipeline.

Running the pipeline feature extraction

Once the PDFs are processed using GROBID and pdf2text, we can run a pipeline for feature extraction: You can run the pipeline using the below command:

python process_docs.py -out PROCESSED_GROBID_FILES -in TEXT_FILES -m generate-train" -csv OUTPUT_DIR

-out: path to preprocessed PDF files in tei.xml format(grobid output)

-in: path to preprocessed PDF files in txt format(output of pdf2text)

-m: generate-train mode

-csv: csv output directory

For more details, refer to process_docs.py file.

Project Structure

Important files for reference:

File	Description
process_docs.py	code execution starts here, there are 2 main modes (1) Preprocess (2) generate feature set
extractor.py	grobid output gets torn down into various features and extracted information is used to call elsevier/crossref/semantic scholar api
elsevier.py	Output from elsevier api/crossref/semantic scholar gets parsed and returned
XIN.py	acknowlegement section is processed to identify funding information

NOTE: Elsevier api key may expire after certain number of hits. In case of batch processing, it is better to update api key details from elsevier developer portal. Check the same for semantic scholar api.

NOTE Place the citation sentiment model under pipeline/tamu_features/rec_model/pytorch_model.bin from the link

Name		Name	Last commit message	Last commit date
Latest commit History 181 Commits
ingestion		ingestion
pipeline		pipeline
.gitattributes		.gitattributes
.gitignore		.gitignore
README.md		README.md
__init__.py		__init__.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

FREX: Feature Representation Extraction Framework for Claim Reproducibility Prediction

Feature extraction Pipeline

Preprocessing PDFs

Running the pipeline feature extraction

Project Structure

About

Releases

Packages

Contributors 4

Languages

amm-kun/score_psu

Folders and files

Latest commit

History

Repository files navigation

FREX: Feature Representation Extraction Framework for Claim Reproducibility Prediction

Feature extraction Pipeline

Preprocessing PDFs

Running the pipeline feature extraction

Project Structure

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 4

Languages

Packages