README.Rmd

---
output: github_document
---

<!-- README.md is generated from README.Rmd. Please edit that file -->

```{r, eval=FALSE, echo=FALSE}
# Run interactively
devtools::build_readme()
pkgdown::build_site()
```


```{r, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>",
  fig.path = "man/figures/README-",
  out.width = "100%"
)
```

# biorecap <a href='https://github.com/stephenturner/biorecap'><img src='man/figures/logo.png' align="right" height="250" /></a>

<!-- badges: start -->
[![R-CMD-check](https://github.com/stephenturner/biorecap/actions/workflows/R-CMD-check.yaml/badge.svg)](https://github.com/stephenturner/biorecap/actions/workflows/R-CMD-check.yaml)
[![arXiv](https://img.shields.io/badge/DOI-10.48550/arXiv.2408.11707-AD1429)](https://doi.org/10.48550/arXiv.2408.11707)
[![biorecap-r-universe](https://stephenturner.r-universe.dev/badges/biorecap)](https://stephenturner.r-universe.dev/biorecap)
<!-- badges: end -->

Retrieve and summarize [bioRxiv](https://www.biorxiv.org/) and [medRxiv](https://www.medrxiv.org/) preprints using a local LLM with [Ollama](https://ollama.com/) via [ollamar](https://cran.r-project.org/package=ollamar). 

Turner, S. D. (2024). biorecap: an R package for summarizing bioRxiv preprints with a local LLM. _arXiv_, 2408.11707. https://doi.org/10.48550/arXiv.2408.11707. 

## Installation

Install biorecap from GitHub (keep `dependencies=TRUE` to get Suggests packages needed to create the HTML report):

```{r, eval=FALSE}
# install.packages("remotes")
remotes::install_github("stephenturner/biorecap", dependencies=TRUE)
```

## Usage

### Quick start

First, load the biorecap library.

```{r}
library(biorecap)
```

Let's make sure Ollama is running and that we can talk to it through R:

```{r, eval=FALSE}
test_connection()
```

```
#> Ollama local server running
#> <httr2_response>
#> GET http://localhost:11434/
#> Status: 200 OK
#> Content-Type: text/plain
#> Body: In memory (17 bytes)
```

Next we can list our available models:

```{r, eval=FALSE}
list_models()
```

```
             name   size parameter_size quantization_level            modified
1   gemma2:latest 5.4 GB           9.2B               Q4_0 2024-08-07T07:35:15
3    llama3.1:70b  40 GB          70.6B               Q4_0 2024-07-24T10:57:08
4 llama3.1:latest 4.7 GB           8.0B               Q4_0 2024-07-31T09:38:38
5 llama3.2:latest   2 GB           3.2B             Q4_K_M 2024-09-25T14:54:23
6     phi3:latest 2.2 GB           3.8B               Q4_0 2024-08-28T04:37:58      
```

Write an HTML report containing summaries of recent preprints in select subject areas to the current working directory. You can include both bioRxiv and medRxiv subjects, and biorecap will know which RSS feed to use.

```{r, eval=FALSE}
biorecap_report(output_dir=".", 
                subject=c("bioinformatics", "infectious_diseases"), 
                model="llama3.2")
```

Example HTML report generated from bioRxiv (bioinformatics) and infectious diseases (medRxiv) subjects on September 25, 2024:

```{r, echo=FALSE}
knitr::include_graphics(here::here("man/figures/report_screenshot.jpg"))
```


### Details

The `get_preprints()` function retrieves preprints from the RSS feed of either bioRxiv or medRxiv, based on the subject you provided. You pass one or more subjects to the `subject` argument. 

```{r, eval=FALSE}
pp <- get_preprints(subject=c("bioinformatics", 
                              "infectious_diseases"))
head(pp)
tail(pp)
```

```{r, echo=FALSE}
pp <- example_preprints
pp |> dplyr::select(-prompt, -summary) |> head()
pp |> dplyr::select(-prompt, -summary) |> tail()
```

The `add_prompt()` function adds a prompt to each preprint that will be used to prompt the model.

```{r, eval=FALSE}
pp <- pp |> add_prompt()
pp
```

```{r, echo=FALSE}
pp |> dplyr::select(-summary)
```

Let's take a look at one of these prompts:

> I am giving you a paper’s title and abstract. Summarize the paper in as many sentences as I instruct. Do not include any preamble text. Just give me the summary. 
> 
> Number of sentences in summary: 2 
> 
> Title: SeuratExtend: Streamlining Single-Cell RNA-Seq Analysis Through an Integrated and Intuitive Framework 
> 
> Abstract: Single-cell RNA sequencing (scRNA-seq) has revolutionized the study of cellular heterogeneity, but the rapid expansion of analytical tools has proven to be both a blessing and a curse, presenting researchers with significant challenges. Here, we present SeuratExtend, a comprehensive R package built upon the widely adopted Seurat framework, which streamlines scRNA-seq data analysis by integrating essential tools and databases. SeuratExtend offers a user-friendly and intuitive interface for performing a wide range of analyses, including functional enrichment, trajectory inference, gene regulatory network reconstruction, and denoising. The package seamlessly integrates multiple databases, such as Gene Ontology and Reactome, and incorporates popular Python tools like scVelo, Palantir, and SCENIC through a unified R interface. SeuratExtend enhances data visualization with optimized plotting functions and carefully curated color schemes, ensuring both aesthetic appeal and scientific rigor. We demonstrate SeuratExtend’s performance through case studies investigating tumor-associated high-endothelial venules and autoinflammatory diseases, and showcase its novel applications in pathway-Level analysis and cluster annotation. SeuratExtend empowers researchers to harness the full potential of scRNA-seq data, making complex analyses accessible to a wider audience. The package, along with comprehensive documentation and tutorials, is freely available at GitHub, providing a valuable resource for the single-cell genomics community.

The `add_summary()` function uses a locally running LLM available through Ollama to summarize the preprint. Let's add the summary. Notice that we can do this all in a single pipeline. This takes a few minutes!

```{r, eval=FALSE}
pp <- 
  get_preprints(subject=c("bioinformatics", "infectious_diseases")) |> 
  add_prompt() |> 
  add_summary(model="llama3.2")
```

Let's take a look at the results:

```{r}
pp
```

Let's look at one of those summaries. Here's the summary for the SeuratExtend paper (abstract above):

> SeuratExtend is an R package that integrates essential tools and databases for single-cell RNA sequencing (scRNA-seq) data analysis, streamlining the process through a user-friendly interface. The package offers various analyses, including functional enrichment and gene regulatory network reconstruction, and seamlessly integrates multiple databases and popular Python tools.

The `biorecap_report()` function runs this code in an RMarkdown template, writing the resulting HTML and CSV file with results to the current working directory.

```{r, eval=FALSE}
biorecap_report(output_dir=".", 
                subject=c("bioinformatics", "infectious_diseases"), 
                model="llama3.2")
```

The built-in `subjects` is a list with vectors containing all the available bioRxiv and medRxiv subjects.

```{r}
subjects$biorxiv
subjects$medrxiv
```

You could create a report for _all_ subjects like this (note, this could take some time):

```{r, eval=FALSE}
biorecap_report(output_dir=".", 
                subject=c(subjects$biorxiv, subjects$medrxiv)
                model="llama3.2")
```