-
Notifications
You must be signed in to change notification settings - Fork 602
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Speedup (~20x) of scanpy.pp.regress_out function using Linear Least Square method. #3110
base: main
Are you sure you want to change the base?
Conversation
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## main #3110 +/- ##
==========================================
+ Coverage 76.31% 76.39% +0.08%
==========================================
Files 109 109
Lines 12513 12526 +13
==========================================
+ Hits 9549 9569 +20
+ Misses 2964 2957 -7
|
Thanks for keeping them coming in! It would be helpful if you documented a bit why your newly introduced branch is faster with comments. |
Benchmark changes
Comparison: https://github.com/scverse/scanpy/compare/ad657edfb52e9957b9a93b3a16fc8a87852f3f09..3e3ca9dcb2e77e72b75095ff895cb55aeb7f98bc More details: https://github.com/scverse/scanpy/pull/3110/checks?check_run_id=26405245612 |
ooh, this time the benchmark shows really nicely how much faster it is!
|
Looks like preprocessing_log.time_regress_out('pbmc68k_reduced') , regress out those variables that is not inside it. It should regress_out ['n_counts', 'percent_mito'] instead of ["total_counts", "pct_counts_mt"]. For the both commit it fails so report the same time. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks good from my side, do you have opinions, @Intron7?
It looks really good. I would love to have a closer look next week |
@flying-sheep , can you look at the this benchmark test? |
Co-authored-by: Philipp A. <[email protected]>
Co-authored-by: Philipp A. <[email protected]>
Hi, everything is OK with the benchmarks, The reason these are named differently is here: scanpy/benchmarks/benchmarks/_utils.py Lines 27 to 31 in ad657ed
I did that to be able to run benchmarks benchmarks on multiple data sets with the same code. |
The |
This will work for csc matrix @numba.njit(cache=True, parallel=True)
def to_dense_csc(
shape: tuple[int, int],
indptr: NDArray[np.integer],
indices: NDArray[np.integer],
data: NDArray[DT],
) -> NDArray[DT]:
"""\
Numba kernel for np.toarray() function
"""
X = np.empty(shape, dtype=data.dtype)
for c in numba.prange(shape[1]):
X[:,c] = 0
for i in range(indptr[c], indptr[c + 1]):
X[indices[i],c] = data[i]
return X |
That kernel technically works but is way slower than the default scipy operation due to inefficient memory layout. |
@numba.njit(cache=True, parallel=True) | ||
def get_resid( | ||
data: np.ndarray, | ||
regressor: np.ndarray, | ||
coeff: np.ndarray, | ||
) -> np.ndarray: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is the DT
not important here as above?
regres = regressors | ||
tasks.append(tuple((data_chunk, regres, variable_is_categorical))) | ||
|
||
from joblib import Parallel, delayed | ||
res = None | ||
if not variable_is_categorical: | ||
A = regres.to_numpy() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the code before this should be refactored as well to include these new methods in the condition (i.e. at https://github.com/scverse/scanpy/pull/3110/files#diff-f9e0bdcdfc04622c421f0a5788bbba6ee0303750580d2915caba3239d799322fR759). We can just check regressors.to_numpy()
for non-zero determinant before iterating through the chunk list. We don't need to make the list at all, from what I can tell, if not variable_is_categorical
and the determinant is non-zero. This will also make things cleaner because we will need fewer checks. We can then use a boolean to check whether we should move into the next clause (which will include chunk creation) instead of checking res=None
or not.
Does this make sense?
Hi,
We are submitting PR for speed up of the regress_out function. Here we finding coefficient using Linear regression (Linear Least Squares) rather then GLM for non categorical data.
experiment setup : AWS r7i.24xlarge