-
Notifications
You must be signed in to change notification settings - Fork 94
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Kmeans|| #650
base: master
Are you sure you want to change the base?
Kmeans|| #650
Conversation
All tests for KmMatrix should pass. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The fact that this PR also contains your other PR makes it harder to review in isolation. In future its best if you can keep them separate although this is not always practical. In this case we will have to review and merge your build system PR then rebase this, review and again and then hopefully merge.
Also try not to commit formatting changes as they also make review difficult.
@@ -0,0 +1,95 @@ | |||
#include <thrust/random.h> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure if you need all of this code for generating random numbers. Consider using thrust:
https://gist.github.com/ashwin/7245048
Thrust is included by default with the cuda toolkit and will allow you to do a lot of things much more easily in a C++ 11 style with high performance.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It just occurred to me that curand might help gaining a little bit performance and is easy enough to use. But your example is much cleaner. I will try to integrate it and do some benchmarks on larger dataset. If the gain is trivial then I will just remove the curand dependency.
src/gpu/kmeans/KmMatrix/KmConfig.h
Outdated
#define M_DEV __device__ | ||
#define M_DEVINLINE __device__ __forceinline__ | ||
#define M_HOSTDEVINLINE __host__ __device__ __forceinline__ | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This cuda check macro is duplicated here:
h2o4gpu/src/gpu/kmeans/kmeans_general.h
Line 13 in 2f849e2
// TODO(pseudotensor): Avoid throw for python exception handling. Need to avoid all exit's and return exit code all the way back. |
And here:
h2o4gpu/src/gpu/utils/utils.cuh
Line 36 in 3eb4715
#define safe_cuda(ans) throw_on_cuda_error((ans), __FILE__, __LINE__) |
I suggest you use the one in utils and delete the one in kmeans. My personal preference is to use the variant of error checking that throws an exception. This gives you more control of how you want to program to exit. We should use the same method throughout H2O4GPU.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I intend to remove all these utils code in KmMatrix later. Sorry for not stating it out. It's just I don't want to pollute the rest of the repo at this stage. And the namespace should be consistent with rest of the project, though I haven't figure out how to do this yet.
src/gpu/kmeans/KmMatrix/KmMatrix.cpp
Outdated
@@ -0,0 +1,311 @@ | |||
/*! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can see you have implemented KmMatrix using the pimpl design pattern. We did this in xgboost for HostDeviceVector only because it had to compile for CPU only as well as with GPU. This seems unnecessarily complicated for this project. I will check with the others if we need to maintain this requirement of compiling for CPU only.
After looking through your code your matrix class seems very similar in function to https://github.com/h2oai/h2o4gpu/blob/master/src/gpu/data/matrix.cuh
Please explain why you can't use this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For starter, the matrix in data doesn't deal with CPU and expanding it to support CPU is just rewrite. But now that dropping CPU support completely is possible, this seem avail.
Device context is an explicit parameter for every function, but not all functions use it. It's ok if we just want a matrix that works for current implementation. But if we do want to make a generic data structure that can encapsulate the complexity, the matrix in data might also need a big change.
Surely, with all the time I spent on writing KmMatrix and there are only few hundred lines of code in that matrix, the above can be done. The real difficulty is, there is no unit test in CPP, I need to rely on python tests which are more of integration tests. I tried to run the complete python tests from master branch on my machine and only a fraction of them passed. After that I dropped the thought of relying on existing tests. This also contributes to the fact that why I re-implement those utils.
I hope we can discuss this more lengthily.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't actually mind which direction you come at it from but in the end these two matrix representations must be unified. I think we still want to support matrices with CPU memory, but it is no longer a requirement that the code must compile without cuda.
Device context is an explicit parameter for every function, but not all functions use it.
You could possibly remove this and rely on your Singleton class to get things like the cublas handle inside these functions.
The real difficulty is, there is no unit test in CPP, I need to rely on python tests which are more of integration tests. I tried to run the complete python tests from master branch on my machine and only a fraction of them passed. After that I dropped the thought of relying on existing tests.
This is worrying. Perhaps it's a difference in your environment. We will see when we sort out continuous integration.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
in the end these two matrix representations must be unified
That's the plan, a unified matrix.
src/gpu/kmeans/KmMatrix/utils.cuh
Outdated
|
||
namespace H2O4GPU { | ||
namespace KMeans { | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These thread index helpers are not specific to kmeans and so should go in a common utilities file shared by all algorithms.
src/gpu/kmeans/KmMatrix/utils.cuh
Outdated
|
||
|
||
// Work around for shared memory | ||
// https://stackoverflow.com/questions/20497209/getting-cuda-error-declaration-is-incompatible-with-previous-variable-name |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should be named KmSharedMem instead of KmShardMem as the word "shard" can have a different meaning in the context of GPU programming. Although we do not seem to have an explicit code style for this project, Google style recommends not using abbreviated words in variable names. So this would then be: KmSharedMemory.
As above I also don't think this code is specific to kmeans so should go in a shared utilities file.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will be changed.
src/gpu/kmeans/KmMatrix/GpuInfo.cuh
Outdated
@@ -0,0 +1,76 @@ | |||
/*! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This GPU info class is not specific to kmeans. Check how the other algorithms are getting this information then consider moving into a common location and making the other algorithms also use this class. Ideally we want things done consistently across the code base. This makes it much easier to maintain.
src/gpu/kmeans/kmeans_h2o4gpu.cu
Outdated
@@ -1,37 +1,37 @@ | |||
/*! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This file seems to be all formatting changes. I cant see if you have actually changed anything. It is better to only commit actual changes and deal with formatting issues in a separate commit or PR.
I will take another look at this if you revert the formatting changes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is no change in it, I kept my code isolated from the rest of the project. And my hope is removing this file. Sorry for the mess.
I will at least get CMake part of the build system get merged. I will work on it as soon as I can see the jenkin report. |
The build is performed by cmake. Add missing <numeric> header and use cblas/cblas.h instead of cblas.h.
* Add test target to CMakeLists.txt. * Fix include path in cu files in tests/cpp/gpu/kmeans. * Make SWIG an optional dependencies.
61be357
to
2822872
Compare
Using Eigen could make KMeans work. But when it comes to expanding the cluster, the memory copy is expensive. Also, it's hard to launch separated kernel without memory copy with Eigen. So, proceed with a home brew matrix.
* Split proxy implementation into cpp file to save some compile time. * Fix some memory error. * Add two tests.
* Checking code is factored into KmConfig * GpuInfo now takes care of cublas_handle; * MatrixDim for latter use.
Builds basic kmeans|| framework on top of KmMatrix. The algorithm is not working yet.
* Fix copy_if prob_ptr index. * Add document for the algorithm object interface. * Move Generator into a member variable and complete its implementation. * Removes dead code.
* Create Generatorbase class for interface. * Move generator related code into KmMatrix. * Move shared memory related code into KmMatrix. * Create mock object based on Generatorbase for later test.
* Re-clustering is now a policy class, for easier testing and flexibility.
* During the extraction, blas operations now convert int to float.
Revert b9d71f for kmeans_h2o4gpu.cu.
* Put KmMatrix under gpu/matrix * Use shared utils for both kmeans and KmMatrix. * Rename the namespaces.
@RAMitchell Ready for another review. \O/ This time I re-based on a build-system branch that contains only CMake file changes, hopefully better than the last time. The example about using thrust to generate random number is not implemented yet. The PR is getting way~~ too heavy now. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks very good in general, there is a clear structure to your code and very good commenting. See comments for a few issues. I am not going to build/test myself as I don't have time and continuous integration should ideally be enforcing that this works on the required platforms.
@@ -0,0 +1,54 @@ | |||
/*! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems like a strange file name.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That is the original utils.h from h2o4gpu with some functions moved from gpu code.
For the *.h.in, it's for cmake configuration, specifically USE_CUDA flag. I know compiling without CUDA is not supported. But anyway I tried to remain the possibility. If that a burden, I can remove it in no time.
__global__ void self_row_argmin_sequential(kParam<int> _res, kParam<T> _val) { | ||
|
||
size_t idx = global_thread_idx(); | ||
if (idx < _val.rows) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Consider doing this: if (idx >= _val.rows) return;
I find it a bit cleaner but it's a matter of personal preference.
// ArgMin operation that excludes 0. Used when dealing with distance_pairs | ||
// in recluster where distance between points with itself is calculated, | ||
// hence the name. | ||
// FIXME: Maybe generalize it to selection algorithm. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These two kernels look very similar. Consider generalising them. Less code is better code. Also have a look at this: https://nvlabs.github.io/cub/structcub_1_1_device_segmented_reduce.html
Cub is another really great cuda library.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will look into it. Thanks for the suggestion. Indeed, those extra kernels are quite annoying.
void *d_temp_storage = NULL; | ||
|
||
// determine the temp_storage_bytes | ||
safe_cuda(cub::DeviceHistogram::HistogramEven( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice to see you using cub for this.
|
||
// Sort the indices by weights in ascending order, then use those at front | ||
// as result. | ||
thrust::sort_by_key(thrust::device, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Be aware that this function makes a very large temporary memory allocation. Sometimes it can be better to call cub directly because of this. Another way of dealing with this is to create a custom allocator for thrust algorithms that reuses the same memory buffer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have been thinking about custom memory management facility. But that might not be necessary for this function alone. Since it sorts the sampled centroids as one of the final steps. Sampled centroids should be much less then data points. If in practice this assumption doesn't hold, then kmeans|| algorithm might be a bust.
I will look into the cub solution.
return code; | ||
} | ||
|
||
|
||
extern cudaStream_t cuda_stream[MAX_NGPUS]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't like this global variable. Consider making this a member of the GPUInfo singleton or removing if it's no longer used.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's not part of kmeans||. If I change that, it might be in another PR that deals with kmeans in general. Is it acceptable for you?
src/gpu/matrix/KmMatrix/Arith.cu
Outdated
* @param _val The input matrix with shape m x n | ||
*/ | ||
template <typename T> | ||
__global__ void row_min_sequential(kParam<T> _res, kParam<T> _val) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These two kernels seem to be duplicated in src/gpu/kmeans/kmeans_init.cu. Did you move some code around and forget to remove it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not exactly. These two deals with min and argmin, is my original attempt. But then the 0 distance bug surfaced, I created two similar kernels specifically dealing with 0 values in kmeans_init.cu. The former is not used in kmeans, the later is not generic enough to be put in matrix. I will look into cub function you suggested, hopefully we remove the later two.
} | ||
|
||
template <typename T> | ||
struct UniformGenerator : public GeneratorBase<T> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A few things:
- I think this should be named UniformRandomGenerator. Generator by itself does not imply that it generates random numbers.
- We should definitely remove the curand dependency if possible. Extra dependencies heavily impact maintainability.
- I think src/gpu/matrix/KmMatrix/Generator.hpp is unnecessary now that we have established we don't need to compile without cuda. Perhaps just put its contents in this file?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this should be named UniformRandomGenerator. Generator by itself does not imply that it generates random numbers.
Not problem. it will change.
We should definitely remove the curand dependency if possible. Extra dependencies heavily impact maintainability.
Make sense, it will change.
I think src/gpu/matrix/KmMatrix/Generator.hpp is unnecessary now that we have established we don't need to compile without cuda. Perhaps just put its contents in this file?
If we need to run it in CPU, the header is still useful if we create a CPU generator and put this file in src/common. That's something for future, is it acceptable for you?
template <typename T> | ||
std::ostream& operator<<(std::ostream& os, KmMatrix<T>& m); | ||
|
||
template <typename T> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is this proxy class used for?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Originally, I used it to represent extracted row. Indeed now it seems redundant since copying is inevitable. You are right, I should remove it later, but preferablely not in this PR.
@@ -0,0 +1,358 @@ | |||
/*! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Again I don't think we need to have hpp/cuh and cpp/cu files. We just need the cuda files.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This cpp is for KmMatrix, not any specific KmMatrixImpl. We bridge KmMatrix to KmMatrixImpl (CudaKmMatrixImpl specifically) in this file. It's a generic frontend.
We can switch to doing some work on xgboost after you have dealt with the review issues until the continuous integration problem gets resolved. |
@trivialfis @RAMitchell just had a quick glance, changes look really good! Will do a proper review over the weekend. @RAMitchell by CI issues you mean @trivialfis not being able to see the logs? Would a log emailed when the build fails be sufficient? I remember adding that a long time ago but maybe someone changed it and need to add it back in. |
@mdymczyk Thanks for spending extra time on this. I haven't receive any mail from the CI yet. |
This is still WIP, for early review.
TODOs:
Above and tuning the algorithm described in #642
close #642