Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Mike TODO: Write up example of using enhancer-promoter overlaps to compute correlation #5

Open
mikelove opened this issue Jul 21, 2023 · 3 comments
Assignees
Labels
documentation Improvements or additions to documentation

Comments

@mikelove
Copy link
Member

mikelove commented Jul 21, 2023

a note to myself

Write up chapter showing how to compute these correlations that doesn't involve copying/modifying the SE data.

https://gist.github.com/mikelove/2e899346d92908e6cbe3448705e4b5de

@mikelove mikelove added the documentation Improvements or additions to documentation label Jul 21, 2023
@mikelove mikelove self-assigned this Jul 26, 2023
@FedericoVann
Copy link

Hello!
I tried to speed up the computation.

By using vapply():

x_overlaps = x_overlaps %>%
  mutate(rho = vapply(seq_along(id.x), function(i) {
    cor(dat_x[id.x[i], ], dat_y[id.y[i], ])
  }, numeric(1))) 

Regardless the method, an idea could be to convert the input into a data.table and then convert it back into a Granges class (if it is a mandatory step).

This seems to speeds everything up drastically.

I microbenched the map2_dbl and vapply methods with the input in both formats:

library(microbenchmark) # to benchmark different methods
library(data.table) # to convert GRanges into data.tables
library(gUtils) # to convert data.tables into GRanges class
library(ggplot2) # to make the autoplot

# input as GRanges object
x_overlaps_GRanges <- x %>% 
  join_overlap_inner(y, maxgap=100) %>%
  filter(tile_id.x == tile_id.y) %>%
  select(tile_id = tile_id.x, id.x, id.y) 

# input as data.table object
x_overlaps_data_table <- x %>% 
  join_overlap_inner(y, maxgap=100) %>%
  filter(tile_id.x == tile_id.y) %>%
  select(tile_id = tile_id.x, id.x, id.y) %>%
  as.data.table()

methods_performance <- microbenchmark(
  setup=set.seed(12),
  
  # Use the input as GRanges formal class
  # Calculate correlations using map2_dbl() and mutate()
  x_overlaps_map2_dbl_GR = x_overlaps_GRanges %>%
    mutate(rho = map2_dbl(id.x, id.y, function(.x, .y) {
      cor(dat_x[.x, ], dat_y[.y, ])
    })),

  
  # Calculate correlations using vapply() and mutate()
  x_overlaps_vapply_GR = x_overlaps_GRanges %>%
    mutate(rho = vapply(seq_along(id.x), function(i) {
      cor(dat_x[id.x[i], ], dat_y[id.y[i], ])
    }, numeric(1))),
  
  
  # Use the input converted into data.table format
  # Calculate correlations using map2_dbl() and mutate()
  x_overlaps_map2_dbl_DT = x_overlaps_data_table %>%
    mutate(rho = map2_dbl(id.x, id.y, function(.x, .y) {
      cor(dat_x[.x, ], dat_y[.y, ])
    })) %>%
    dt2gr(key = NULL, seqlengths = NULL, seqinfo = Seqinfo()),
  

  
  # Calculate correlations using vapply() and mutate()
  x_overlaps_vapply_DT = x_overlaps_data_table %>%
    mutate(rho = vapply(seq_along(id.x), function(i) {
      cor(dat_x[id.x[i], ], dat_y[id.y[i], ])
    }, numeric(1))) %>%
    dt2gr(key = NULL, seqlengths = NULL, seqinfo = Seqinfo()),
    
  
  times = 100
)

# look at the performances
methods_performance

# plot the performances (Time)
autoplot(methods_performance) + theme_bw()

An additional tip could be to parallelize the code with mclapply():

library(parallel)

# input as data.table object
x_overlaps <- x %>% 
  join_overlap_inner(y, maxgap=100) %>%
  filter(tile_id.x == tile_id.y) %>%
  select(tile_id = tile_id.x, id.x, id.y) %>%
  as.data.table()

# Calculate correlations using mclapply() and mutate()
x_overlaps$rho <- unlist(mclapply(seq_along(x_overlaps$id.x),
                                  mc.cores=detectCores(),
                                  function(i) {
  cor(dat_x[x_overlaps$id.x[i], ], dat_y[x_overlaps$id.y[i], ])
}))

All of the code above has been tested.

I hope all this has been helpful

Cheers!

@FedericoVann
Copy link

Performance_plot.pdf

@mikelove
Copy link
Member Author

mikelove commented Jul 30, 2023

Sorry I should have explained this one better.

Yes we can speed up many operations by converting into data.table.

I meant that I have already derived a faster solution to a previous problem that doesn’t involve modifying the original S4 objects, but I haven’t written it up, so I assigned this to myself as a todo.

But I will take a look at your report, thank you!

@mikelove mikelove changed the title Write up example of using enhancer-promoter overlaps to compute correlation Mike TODO: Write up example of using enhancer-promoter overlaps to compute correlation Jul 30, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation Improvements or additions to documentation
Projects
None yet
Development

No branches or pull requests

2 participants