The goal of papercheck is to automatically check scientific papers for best practices.
You can install the development version of papercheck from GitHub with:
# install.packages("devtools")
devtools::install_github("scienceverse/papercheck")
You can launch an interactive shiny app version of the code below with:
papercheck_app()
library(papercheck)
Convert a PDF to grobid XML format, then read it in as a paper object.
pdf <- demopdf() # use the path of your own PDF
grobid <- pdf2grobid(pdf)
paper <- read_grobid(grobid)
Search the returned text. The regex pattern below searches for text that
looks like statistical values (e.g., N=313
or p = 0.17
).
pattern <- "[a-zA-Z]\\S*\\s*(=|<)\\s*[0-9\\.-]*\\d"
text <- search_text(paper, pattern,
return = "match",
perl = TRUE)
text | section | header | div | p | s | id |
---|---|---|---|---|---|---|
M = 9.12 | results | Results | 3 | 1 | 2 | to_err_is_human.xml |
M = 10.9 | results | Results | 3 | 1 | 2 | to_err_is_human.xml |
t(97.7) = 2.9 | results | Results | 3 | 1 | 2 | to_err_is_human.xml |
p = 0.005 | results | Results | 3 | 1 | 2 | to_err_is_human.xml |
M = 5.06 | results | Results | 3 | 2 | 1 | to_err_is_human.xml |
M = 4.5 | results | Results | 3 | 2 | 1 | to_err_is_human.xml |
t(97.2) = -1.96 | results | Results | 3 | 2 | 1 | to_err_is_human.xml |
p = 0.152 | results | Results | 3 | 2 | 1 | to_err_is_human.xml |
You can query the extracted text of papers with LLMs using groq.
Use search_text()
first to narrow down the text into what you want to
query. Below, we returned the first two papers’ introduction sections,
and returned the full section. Then we asked an LLM “What is the
hypothesis of this study?”.
hypotheses <- search_text(papers[1:2],
section = "intro",
return = "section")
query <- "What is the hypothesis of this study? Answer as briefly as possible."
llm_hypo <- llm(hypotheses, query)
id | answer |
---|---|
eyecolor.xml | The hypothesis of this study is that humans exhibit positive sexual imprinting, where individuals choose partners with physical characteristics similar to those of their opposite-sex parent. |
incest.xml | The hypothesis is that moral opposition to third-party sibling incest is greater among individuals with other-sex siblings than among individuals with same-sex siblings. |
The functions pdf2grobid()
and read_grobid()
also work on a folder
of files, returning a list of XML file paths or paper objects,
respectively. The functions search_text()
, expand_text()
and llm()
also work on a list of paper objects.
# read in all the XML files in the demo directory
grobid_dir <- demodir()
papers <- read_grobid(grobid_dir)
# select sentences in the intros containing the text "previous"
previous <- search_text(papers, "previous",
section = "intro",
return = "sentence")
text | section | header | div | p | s | id |
---|---|---|---|---|---|---|
Royzman et al’s non-replication potentially calls into question the reliability of previously reported links between having an other-sex sibling and moral opposition to third-party sibling incest. | intro | Introduction | 1 | 3 | 3 | incest.xml |
Previous research has shown that making cost-benefit analyses of using statistical approaches explicit can influence researchers’ attitudes. | intro | [div-01] | 1 | 8 | 5 | prereg.xml |
When exploring difference in responses between previous experience with pre-registration, we see a clear trend where reasearchers who have pre-registered studies in their own research indicate pre-registration is more beneficial, and indicate higher a higher likelihood of pre-registering studies in the future, and higher percentage of studies for which they would consider pre-registering (see Table 2). | intro | Attitude | 3 | 7 | 1 | prereg.xml |
Papercheck is designed modularly, so you can add modules to check for anything. It comes with a set of pre-defined modules, and we hope people will share more modules.
You can see the list of built-in modules with the function below.
module_list()
- all-p-values: List all p-values in the text, returning the matched text (e.g., ‘p = 0.04’) and document location in a table.
- all-urls: List all the URLs in the main text
- imprecise-p: List any p-values reported with insufficient precision (e.g., p < .05 or p = n.s.)
- llm-summarise: Generate a 1-sentence summary for each section
- marginal: List all sentences that describe an effect as ‘marginally significant’.
- osf-check: List all OSF links and whether they are open, closed, or do not exist.
- ref-consistency: Check if all references are cited and all citations are referenced
- retractionwatch: Flag any cited papers in the RetractionWatch database
- sample-size-ml: [DEMO] Classify each sentence for whether it contains sample-size information, returning only sentences with probable sample-size info.
- statcheck: Check consistency of p-values and test statistics
To run a built-in module on a paper, you can reference it by name.
p <- module_run(paper, "all-p-values")
text | section | header | div | p | s | id |
---|---|---|---|---|---|---|
p = 0.005 | results | Results | 3 | 1 | 2 | to_err_is_human.xml |
p = 0.152 | results | Results | 3 | 2 | 1 | to_err_is_human.xml |
p > .05 | results | Results | 3 | 2 | 2 | to_err_is_human.xml |
You can generate a report from any set of modules. The default set is
c("imprecise-p", "marginal", "osf-check", "retractionwatch", "ref-consistency")
paper_path <- report(paper, output_format = "html")