A simplified API to Intel(R) oneAPI Data Analytics Library that allows for fast usage of the framework suited for Data Scientists or Machine Learning users. Built to help provide an abstraction to Intel(R) oneAPI Data Analytics Library for either direct usage or integration into one's own framework and extending this beyond by providing drop-in paching for scikit-learn.
- Documentation
- scikit-learn API and patching
- Source Code
- Building from Sources
- About Intel(R) oneAPI Data Analytics Library
Running full scikit-learn test suite with daal4p's optimization patches
daal4py can be installed from conda-forge (recommended):
conda install daal4py -c conda-forge
or from Intel channel:
conda install daal4py -c intel
You can build daal4py from sources as well.
Core functioanlity of daal4py is in place Scikit-learn patching - Same Code, Same Behavior but faster execution.
Intel CPU optimizations patching
from daal4py.sklearn import patch_sklearn
patch_sklearn()
from sklearn.svm import SVC
from sklearn.datasets import load_digits
digits = load_digits()
X, y = digits.data, digits.target
clf = SVC().fit(X, y)
res = clf.predict(X)
Intel CPU/GPU optimizations patching
from daal4py.sklearn import patch_sklearn
from daal4py.oneapi import sycl_context
patch_sklearn()
from sklearn.svm import SVC
from sklearn.datasets import load_digits
digits = load_digits()
X, y = digits.data, digits.target
with sycl_context("gpu"):
clf = SVC().fit(X, y)
res = clf.predict(X)
daal4py API, allows you to use wider set of Intel(R) oneAPI Data Analytics Library algorithms in just one line:
import daal4py as d4p
init = d4p.kmeans_init(data, 10, t_method="plusPlusDense")
result = init.compute(X)
You can even run this on a cluster by making simple code changes:
import daal4py as d4p
d4p.daalinit()
d4p.kmeans_init(data, 10, t_method="plusPlusDense", distributed=True)
result = init.compute(X, daal4py.my_procid())
d4p.daalfini()
daal4py patching affects performance of specific Scikit-learn functionality listed below. In cases when unsupported parameters are used, daal4py fallbacks into stock Scikit-learn. These limitations described below. If the patching does not cover your scenarios, submit an issue on GitHub.
Scenarios that are already available in 2020.3 release:
Task | Functionality | Parameters support | Data support |
---|---|---|---|
Classification | SVC | All parameters except kernel = 'poly' and 'sigmoid'. |
No limitations. |
RandomForestClassifier | All parameters except warmstart = True and cpp_alpha != 0, criterion != 'gini'. |
Multi-output and sparse data is not supported. | |
KNeighborsClassifier | All parameters except metric != 'euclidean' or minkowski with p = 2. |
Multi-output and sparse data is not supported. | |
LogisticRegression / LogisticRegressionCV | All parameters except solver != 'lbfgs' or 'newton-cg', class_weight != None, sample_weight != None. |
Only dense data is supported. | |
Regression | RandomForestRegressor | All parameters except warmstart = True and cpp_alpha != 0, criterion != 'mse'. |
Multi-output and sparse data is not supported. |
LinearRegression | All parameters except normalize != False and sample_weight != None. |
Only dense data is supported, #observations should be >= #features . |
|
Ridge | All parameters except normalize != False, solver != 'auto' and sample_weight != None. |
Only dense data is supported, #observations should be >= #features . |
|
ElasticNet | All parameters except sample_weight != None. |
Multi-output and sparse data is not supported, #observations should be >= #features . |
|
Lasso | All parameters except sample_weight != None. |
Multi-output and sparse data is not supported, #observations should be >= #features . |
|
Clustering | KMeans | All parameters except precompute_distances and sample_weight != None. |
No limitations. |
DBSCAN | All parameters except metric != 'euclidean' or minkowski with p = 2. |
Only dense data is supported. | |
Dimensionality reduction | PCA | All parameters except svd_solver != 'full'. |
No limitations. |
Other | train_test_split | All parameters are supported. | Only dense data is supported. |
assert_all_finite | All parameters are supported. | Only dense data is supported. | |
pairwise_distance | With metric ='cosine' and 'correlation'. |
Only dense data is supported. |
Scenarios that are only available in the master
branch (not released yet):
Task | Functionality | Parameters support | Data support |
---|---|---|---|
Regression | KNeighborsRegressor | All parameters except metric != 'euclidean' or minkowski with p = 2. |
Sparse data is not supported. |
Unsupervised | NearestNeighbors | All parameters except metric != 'euclidean' or minkowski with p = 2. |
Sparse data is not supported. |
Dimensionality reduction | TSNE | All parameters except metric != 'euclidean' or minkowski with p = 2. |
Sparse data is not supported. |
Other | roc_auc_score | Parameters average , sample_weight , max_fpr and multi_class are not supported. |
No limitations. |
To find out which implementation of the algorithm is currently used (daal4py or stock Scikit-learn), set the environment variable:
- On Linux and Mac OS:
export IDP_SKLEARN_VERBOSE=INFO
- On Windows:
set IDP_SKLEARN_VERBOSE=INFO
For example, for DBSCAN you get one of these print statements depending on which implementation is used:
INFO: sklearn.cluster.DBSCAN.fit: uses Intel(R) oneAPI Data Analytics Library solver
INFO: sklearn.cluster.DBSCAN.fit: uses original Scikit-learn solver
Read more in the documentation.
See Building from Sources for details.