Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: Implement GPU version of spectral decomposition as linear regression fallback #2976

Open
wants to merge 6 commits into
base: main
Choose a base branch
from

Conversation

david-cortes-intel
Copy link
Contributor

@david-cortes-intel david-cortes-intel commented Nov 14, 2024

Description

As a follow up from #2930

This PR adds a GPU version of the fallback for linear regression based on spectral decomposition, which mirrors the CPU version. It reuses the same tests, but this time executing them with fp32 too.

A few things I wasn't sure about:

  • Here I'm taking the minimum of the diagonal of the Cholesky factorization using a sycl reduction class, but I'm not sure if that's the usual way in oneDAL.
  • I am also not 100% sure if it is guaranteed that this operation (minimum of the diagonal) with an 'at_device' at the end would force execution to wait until the operation is completed before returning the value.
  • There's an operation here for which oneMKL doesn't provide the equivalent LAPACK functionality, which is: "column-major matrix divided by a broadcasted row vector" (i.e. multiply each row of the matrix by a scalar from the vector). I implemented it as a 2D for-loop with a multiplication by the inverse (which is faster but less precise), but I don't know if this is better than a 1D loop that would take a whole row, as rows are contiguous in memory.

PR should start as a draft, then move to ready for review state after CI is passed and all applicable checkboxes are closed.
This approach ensures that reviewers don't spend extra time asking for regular requirements.

You can remove a checkbox as not applicable only if it doesn't relate to this PR in any way.
For example, PR with docs update doesn't require checkboxes for performance while PR with any change in actual code should have checkboxes and justify how this code change is expected to affect performance (or justification should be self-evident).

Checklist to comply with before moving PR from draft:

PR completeness and readability

  • I have reviewed my changes thoroughly before submitting this pull request.
  • I have commented my code, particularly in hard-to-understand areas.
  • Git commit message contains an appropriate signed-off-by string (see CONTRIBUTING.md for details).
  • I have added a respective label(s) to PR if I have a permission for that.
  • I have resolved any merge conflicts that might occur with the base branch.

Testing

  • I have run it locally and tested the changes extensively.
  • All CI jobs are green or I have provided justification why they aren't.
  • I have extended testing suite if new functionality was introduced in this PR.

Performance

Not applicable (addition is a fallback for cases that would have failed previously).

@david-cortes-intel david-cortes-intel added the dpc++ Issue/PR related to DPC++ functionality label Nov 14, 2024
@david-cortes-intel david-cortes-intel marked this pull request as draft November 14, 2024 10:30
@david-cortes-intel
Copy link
Contributor Author

/intelci: run

/* Decompose: A = Q * diag(l) * t(Q) */
/* Note: for NRHS>1, this will overallocate in order to reuse the memory as buffer later on */
auto eigenvalues = array<Float>::empty(queue, dim_A * nrhs, alloc);
auto eigenvalues_view = ndview<Float, 1>::wrap(eigenvalues);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does it make sense to replace ndview and array with ndarray?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The signature of the function wrapper for syevd uses ndview, so it'd require modifying things elsewhere:

sycl::event syevd(sycl::queue& queue,

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is magic conversion that works:
https://github.com/oneapi-src/oneDAL/blob/main/cpp/oneapi/dal/algo/pca/backend/gpu/misc.hpp#L77
here ndarray has been provided in syevd

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. And what would be the difference between array and ndarray? Looks like they use the same allocator.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In general would be no difference, but you can avoid allocation of ndview

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed to ndarray.

/* Discard too small singular values */
std::int64_t num_discarded;
{
/* This is placed inside a block because the array created here is not used any further */
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why its done on cpu? Are there some blockers for using GPU kernel?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It requires finding the index of the first value in a sorted array that is above a certain threshold. I couldn't think of any fast way of doing it on GPU which wouldn't involve just transferring the data to CPU, either at once or one-by-one.

Do you have a suggestion for some better way of doing it?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no, sounds good, thanks


opt_array<Float> dummy{};
auto potrf_event = potrf_factorization<uplo>(queue, nxtx, dummy, { xtx_event });
auto potrs_event = potrs_solution<uplo>(queue, nxtx, nxty, dummy, { potrf_event, xty_event });
try {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As I remember we use try catch only in tests/examples. Does it make sense to avoid it here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's meant to catch an exception thrown by oneMKL in the call to potrf, over which we have no control AFAIK. An alternative would be to modify the wrapper, but it's also used elsewhere in places where the exception is not catched (so a user error is thrown).

@david-cortes-intel
Copy link
Contributor Author

david-cortes-intel commented Nov 14, 2024

Oddly, the first version of this function (minimum of the diagonal) that I wrote seems to succeed in some tests, but fail in others (particularly when it it called repeatedly, as in incremental linear regression):

template <typename Float>
Float diagonal_minimum(sycl::queue& queue,
                       const Float* Matrix,
                       const std::int64_t dim_matrix,
                       sycl::event& event_Matrix) {
    constexpr auto alloc = sycl::usm::alloc::device;
    auto diag_min_holder = array<Float>::empty(queue, 1, alloc);
    sycl::event diag_min_event = queue.submit([&](sycl::handler& h) {
        auto min_reduction = sycl::reduction(diag_min_holder.get_mutable_data(), sycl::minimum<>());
        h.depends_on(event_Matrix);
        h.parallel_for(dim_matrix, min_reduction, [=](const auto& i, auto& min_obj) {
            min_obj.combine(Matrix[i * (dim_matrix + 1)]);
        });
    });
    return ndview<Float, 1>::wrap(diag_min_holder).at_device(queue, 0, { diag_min_event });
}

(tried also adding event_Matrix to the dependencies of the return line, but it made no difference)

But this slower version doesn't seem to have any issue:

template <typename Float>
Float diagonal_minimum(sycl::queue& queue,
                       const Float* Matrix,
                       const std::int64_t dim_matrix,
                       sycl::event& event_Matrix) {
    constexpr auto alloc = sycl::usm::alloc::device;
    auto idx_min_holder = array<std::int64_t>::empty(queue, 1, alloc);
    sycl::event diag_min_event = mkl::blas::column_major::iamin(
        queue,
        dim_matrix,
        Matrix,
        dim_matrix + 1,
        idx_min_holder.get_mutable_data(),
        mkl::index_base::zero,
        { event_Matrix }
    );
    const std::int64_t idx_min = ndview<std::int64_t, 1>::wrap(idx_min_holder).at_device(queue, 0, { diag_min_event });
    return ndview<Float, 1>::wrap(Matrix, dim_matrix * dim_matrix).at_device(queue, idx_min * (dim_matrix + 1), { event_Matrix });
}

Any hints and what might be happening?

@david-cortes-intel
Copy link
Contributor Author

/intelci: run

@david-cortes-intel
Copy link
Contributor Author

david-cortes-intel commented Nov 15, 2024

Another thing that I'm now not very sure about is: should calls to oneMKL lapack manually call queue.wait_and_throw() in order to catch runtime exceptions from them?

This is what the oneMKL docs mention:

oneMKL error handling relies on the mechanism of C++ exceptions. Should error occur, it will be propagated at the point of a function call where it is caught using standard C++ error handling mechanism.

... which sounds like it would let it wait until the exception can be thrown, but then it leaves me wondering how it handles queueing.

@david-cortes-intel
Copy link
Contributor Author

After some investigation regarding the diagonal_minimum function, turns out the issue with the function for diagonal minimum was that it wasn't initializing the reducer object, despite what the docs mention:
https://github.khronos.org/SYCL_Reference/iface/reduction-variables.html#known-identities

Adding a manual initialization did the trick.

@david-cortes-intel
Copy link
Contributor Author

/intelci: run

@david-cortes-intel
Copy link
Contributor Author

/intelci: run

@david-cortes-intel david-cortes-intel marked this pull request as ready for review November 15, 2024 12:10
@david-cortes-intel
Copy link
Contributor Author

/intelci: run

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
dpc++ Issue/PR related to DPC++ functionality
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants