Skip to content

scienceverse/papercheck

Repository files navigation

papercheck

The goal of papercheck is to automatically check scientific papers for best practices.

Installation

You can install the development version of papercheck from GitHub with:

# install.packages("devtools")
devtools::install_github("scienceverse/papercheck")

You can launch an interactive shiny app version of the code below with:

papercheck_app()

Example

library(papercheck)

Convert a PDF to grobid XML format, then read it in as a paper object.

pdf <- demopdf() # use the path of your own PDF
grobid <- pdf2grobid(pdf)
paper <- read_grobid(grobid)

Search Text

Search the returned text. The regex pattern below searches for text that looks like statistical values (e.g., N=313 or p = 0.17).

pattern <- "[a-zA-Z]\\S*\\s*(=|<)\\s*[0-9\\.-]*\\d"
text <- search_text(paper, pattern, 
                    return = "match", 
                    perl = TRUE)
text section header div p s id
M = 9.12 results Results 3 1 2 to_err_is_human.xml
M = 10.9 results Results 3 1 2 to_err_is_human.xml
t(97.7) = 2.9 results Results 3 1 2 to_err_is_human.xml
p = 0.005 results Results 3 1 2 to_err_is_human.xml
M = 5.06 results Results 3 2 1 to_err_is_human.xml
M = 4.5 results Results 3 2 1 to_err_is_human.xml
t(97.2) = -1.96 results Results 3 2 1 to_err_is_human.xml
p = 0.152 results Results 3 2 1 to_err_is_human.xml

Large Language Models

You can query the extracted text of papers with LLMs using groq.

Use search_text() first to narrow down the text into what you want to query. Below, we returned the first two papers’ introduction sections, and returned the full section. Then we asked an LLM “What is the hypothesis of this study?”.

hypotheses <- search_text(papers[1:2], 
                          section = "intro", 
                          return = "section")
query <- "What is the hypothesis of this study? Answer as briefly as possible."
llm_hypo <- llm(hypotheses, query)
id answer
eyecolor.xml The hypothesis of this study is that humans exhibit positive sexual imprinting, where individuals choose partners with physical characteristics similar to those of their opposite-sex parent.
incest.xml The hypothesis is that moral opposition to third-party sibling incest is greater among individuals with other-sex siblings than among individuals with same-sex siblings.

Batch Processing

The functions pdf2grobid() and read_grobid() also work on a folder of files, returning a list of XML file paths or paper objects, respectively. The functions search_text(), expand_text() and llm() also work on a list of paper objects.

# read in all the XML files in the demo directory
grobid_dir <- demodir()
papers <- read_grobid(grobid_dir)

# select sentences in the intros containing the text "previous"
previous <- search_text(papers, "previous", 
                        section = "intro", 
                        return = "sentence")
text section header div p s id
Royzman et al’s non-replication potentially calls into question the reliability of previously reported links between having an other-sex sibling and moral opposition to third-party sibling incest. intro Introduction 1 3 3 incest.xml
Previous research has shown that making cost-benefit analyses of using statistical approaches explicit can influence researchers’ attitudes. intro [div-01] 1 8 5 prereg.xml
When exploring difference in responses between previous experience with pre-registration, we see a clear trend where reasearchers who have pre-registered studies in their own research indicate pre-registration is more beneficial, and indicate higher a higher likelihood of pre-registering studies in the future, and higher percentage of studies for which they would consider pre-registering (see Table 2). intro Attitude 3 7 1 prereg.xml

Modules

Papercheck is designed modularly, so you can add modules to check for anything. It comes with a set of pre-defined modules, and we hope people will share more modules.

You can see the list of built-in modules with the function below.

module_list()
  • all-p-values: List all p-values in the text, returning the matched text (e.g., ‘p = 0.04’) and document location in a table.
  • all-urls: List all the URLs in the main text
  • imprecise-p: List any p-values reported with insufficient precision (e.g., p < .05 or p = n.s.)
  • llm-summarise: Generate a 1-sentence summary for each section
  • marginal: List all sentences that describe an effect as ‘marginally significant’.
  • osf-check: List all OSF links and whether they are open, closed, or do not exist.
  • ref-consistency: Check if all references are cited and all citations are referenced
  • retractionwatch: Flag any cited papers in the RetractionWatch database
  • sample-size-ml: [DEMO] Classify each sentence for whether it contains sample-size information, returning only sentences with probable sample-size info.
  • statcheck: Check consistency of p-values and test statistics

To run a built-in module on a paper, you can reference it by name.

p <- module_run(paper, "all-p-values")
text section header div p s id
p = 0.005 results Results 3 1 2 to_err_is_human.xml
p = 0.152 results Results 3 2 1 to_err_is_human.xml
p > .05 results Results 3 2 2 to_err_is_human.xml

Reports

You can generate a report from any set of modules. The default set is c("imprecise-p", "marginal", "osf-check", "retractionwatch", "ref-consistency")

paper_path <- report(paper, output_format = "html")