README.Rmd

---
output: github_document
---

<!-- README.md is generated from README.Rmd. Please edit that file -->

```{r, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>",
  fig.path = "man/figures/README-",
  out.width = "100%"
)
```

# repeatR

Read and analyse RepeatMasker output in R. 

Very early in development!

## install

```r
library(devtools)
install_github("dwinter/repeatR")
```

# A basic usage

The package comes with a small example dataset, including the repeats from one
scaffold in the [kākāpō assembly](https://www.ncbi.nlm.nih.gov/assembly/GCF_004027225.2/).
We can read this file in memory using `read_rm`

```{r}
library(repeatR)
# create a file path relative to the installed package, this step is not
# necessary for normal usage
rm_file <- system.file("extdata", "kakapo.out", package="repeatR")
kakapo <- read_rm(rm_file)
kakapo
```

As you can see, the function reads tdata and returns a `data.frame` with the
alignment information from RepeatMasker.We can now quickly look at the 
composition of the repeats alignments on this scaffold:

```{r}
library(ggplot2)
ggplot(kakapo, aes(tclass)) + 
    geom_bar() + 
    coord_flip()  + 
    theme_bw(base_size=14) 
```

It is important to note, however, that the alignment between a reference genome
and a given repeat element might be broken up over multiple rows in RepeatMakser
output. This occurs when elements are nested within each other (a pattern that
is very common for some elements in some species). `repeatR` provides a the
function `summarise_rm_ID` to produce a new table with one row per unique
element in the genome.

```{r}
kakapo_aggregated <- summarise_rm_ID(kakapo)
head(kakapo_aggregated)
```

With this data, we can start to analyse the total amount of the scaffold covered
by elements of different classes

```{r}
ggplot(kakapo_aggregated, aes(qlen, tclass)) +
    geom_col() +
    theme_bw(base_size=14) +
    scale_x_continuous(labels=Mb_lab) 
```

Quite often, you will want to remove some fo the sequences that are included in
the output file. For instance, simple repeats and low complexity regions. The
function `filter_by_tclass` will remove thise sequences along with functional
RNAs and `ARTEFACT` sequences.

```{r}
kakapo_just_TEs <-  filter_by_tclass(kakapo_aggregated)
table(kakapo_just_TEs$tclass)
```


Or the distrbution of the `p_sub` statistic (the proportion of bases that
different from the consensus element). The function `make_TE_pallete` includes a
pre-defined pallete for the `tclass` column.

```{r}
ggplot(kakapo_just_TEs, aes(p_sub, fill=tclass)) +
    geom_histogram(colour="black") +
    scale_fill_manual(values=make_TE_pallete(kakapo_aggregated)) +
    theme_bw(base_size=14) 
```