-
Notifications
You must be signed in to change notification settings - Fork 8
/
Copy pathREADME.Rmd
177 lines (130 loc) · 7.22 KB
/
README.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
---
output: github_document
---
<!-- README.md is generated from README.Rmd. Please edit that file -->
```{r, eval=FALSE, echo=FALSE}
# Run interactively
devtools::build_readme()
pkgdown::build_site()
```
```{r, include = FALSE}
knitr::opts_chunk$set(
collapse = TRUE,
comment = "#>",
fig.path = "man/figures/README-",
out.width = "100%"
)
```
# biorecap <a href='https://github.com/stephenturner/biorecap'><img src='man/figures/logo.png' align="right" height="250" /></a>
<!-- badges: start -->
[![R-CMD-check](https://github.com/stephenturner/biorecap/actions/workflows/R-CMD-check.yaml/badge.svg)](https://github.com/stephenturner/biorecap/actions/workflows/R-CMD-check.yaml)
[![arXiv](https://img.shields.io/badge/DOI-10.48550/arXiv.2408.11707-AD1429)](https://doi.org/10.48550/arXiv.2408.11707)
[![biorecap-r-universe](https://stephenturner.r-universe.dev/badges/biorecap)](https://stephenturner.r-universe.dev/biorecap)
<!-- badges: end -->
Retrieve and summarize [bioRxiv](https://www.biorxiv.org/) and [medRxiv](https://www.medrxiv.org/) preprints using a local LLM with [Ollama](https://ollama.com/) via [ollamar](https://cran.r-project.org/package=ollamar).
Turner, S. D. (2024). biorecap: an R package for summarizing bioRxiv preprints with a local LLM. _arXiv_, 2408.11707. https://doi.org/10.48550/arXiv.2408.11707.
## Installation
Install biorecap from GitHub (keep `dependencies=TRUE` to get Suggests packages needed to create the HTML report):
```{r, eval=FALSE}
# install.packages("remotes")
remotes::install_github("stephenturner/biorecap", dependencies=TRUE)
```
## Usage
### Quick start
First, load the biorecap library.
```{r}
library(biorecap)
```
Let's make sure Ollama is running and that we can talk to it through R:
```{r, eval=FALSE}
test_connection()
```
```
#> Ollama local server running
#> <httr2_response>
#> GET http://localhost:11434/
#> Status: 200 OK
#> Content-Type: text/plain
#> Body: In memory (17 bytes)
```
Next we can list our available models:
```{r, eval=FALSE}
list_models()
```
```
name size parameter_size quantization_level modified
1 gemma2:latest 5.4 GB 9.2B Q4_0 2024-08-07T07:35:15
3 llama3.1:70b 40 GB 70.6B Q4_0 2024-07-24T10:57:08
4 llama3.1:latest 4.7 GB 8.0B Q4_0 2024-07-31T09:38:38
5 llama3.2:latest 2 GB 3.2B Q4_K_M 2024-09-25T14:54:23
6 phi3:latest 2.2 GB 3.8B Q4_0 2024-08-28T04:37:58
```
Write an HTML report containing summaries of recent preprints in select subject areas to the current working directory. You can include both bioRxiv and medRxiv subjects, and biorecap will know which RSS feed to use.
```{r, eval=FALSE}
biorecap_report(output_dir=".",
subject=c("bioinformatics", "infectious_diseases"),
model="llama3.2")
```
Example HTML report generated from bioRxiv (bioinformatics) and infectious diseases (medRxiv) subjects on September 25, 2024:
```{r, echo=FALSE}
knitr::include_graphics(here::here("man/figures/report_screenshot.jpg"))
```
### Details
The `get_preprints()` function retrieves preprints from the RSS feed of either bioRxiv or medRxiv, based on the subject you provided. You pass one or more subjects to the `subject` argument.
```{r, eval=FALSE}
pp <- get_preprints(subject=c("bioinformatics",
"infectious_diseases"))
head(pp)
tail(pp)
```
```{r, echo=FALSE}
pp <- example_preprints
pp |> dplyr::select(-prompt, -summary) |> head()
pp |> dplyr::select(-prompt, -summary) |> tail()
```
The `add_prompt()` function adds a prompt to each preprint that will be used to prompt the model.
```{r, eval=FALSE}
pp <- pp |> add_prompt()
pp
```
```{r, echo=FALSE}
pp |> dplyr::select(-summary)
```
Let's take a look at one of these prompts:
> I am giving you a paper’s title and abstract. Summarize the paper in as many sentences as I instruct. Do not include any preamble text. Just give me the summary.
>
> Number of sentences in summary: 2
>
> Title: SeuratExtend: Streamlining Single-Cell RNA-Seq Analysis Through an Integrated and Intuitive Framework
>
> Abstract: Single-cell RNA sequencing (scRNA-seq) has revolutionized the study of cellular heterogeneity, but the rapid expansion of analytical tools has proven to be both a blessing and a curse, presenting researchers with significant challenges. Here, we present SeuratExtend, a comprehensive R package built upon the widely adopted Seurat framework, which streamlines scRNA-seq data analysis by integrating essential tools and databases. SeuratExtend offers a user-friendly and intuitive interface for performing a wide range of analyses, including functional enrichment, trajectory inference, gene regulatory network reconstruction, and denoising. The package seamlessly integrates multiple databases, such as Gene Ontology and Reactome, and incorporates popular Python tools like scVelo, Palantir, and SCENIC through a unified R interface. SeuratExtend enhances data visualization with optimized plotting functions and carefully curated color schemes, ensuring both aesthetic appeal and scientific rigor. We demonstrate SeuratExtend’s performance through case studies investigating tumor-associated high-endothelial venules and autoinflammatory diseases, and showcase its novel applications in pathway-Level analysis and cluster annotation. SeuratExtend empowers researchers to harness the full potential of scRNA-seq data, making complex analyses accessible to a wider audience. The package, along with comprehensive documentation and tutorials, is freely available at GitHub, providing a valuable resource for the single-cell genomics community.
The `add_summary()` function uses a locally running LLM available through Ollama to summarize the preprint. Let's add the summary. Notice that we can do this all in a single pipeline. This takes a few minutes!
```{r, eval=FALSE}
pp <-
get_preprints(subject=c("bioinformatics", "infectious_diseases")) |>
add_prompt() |>
add_summary(model="llama3.2")
```
Let's take a look at the results:
```{r}
pp
```
Let's look at one of those summaries. Here's the summary for the SeuratExtend paper (abstract above):
> SeuratExtend is an R package that integrates essential tools and databases for single-cell RNA sequencing (scRNA-seq) data analysis, streamlining the process through a user-friendly interface. The package offers various analyses, including functional enrichment and gene regulatory network reconstruction, and seamlessly integrates multiple databases and popular Python tools.
The `biorecap_report()` function runs this code in an RMarkdown template, writing the resulting HTML and CSV file with results to the current working directory.
```{r, eval=FALSE}
biorecap_report(output_dir=".",
subject=c("bioinformatics", "infectious_diseases"),
model="llama3.2")
```
The built-in `subjects` is a list with vectors containing all the available bioRxiv and medRxiv subjects.
```{r}
subjects$biorxiv
subjects$medrxiv
```
You could create a report for _all_ subjects like this (note, this could take some time):
```{r, eval=FALSE}
biorecap_report(output_dir=".",
subject=c(subjects$biorxiv, subjects$medrxiv)
model="llama3.2")
```