forked from NVIDIA/thrust
-
Notifications
You must be signed in to change notification settings - Fork 1
/
Copy pathCHANGELOG
548 lines (459 loc) · 21.4 KB
/
CHANGELOG
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
#######################################
# Thrust v1.6.0 #
#######################################
Summary
Thrust v1.6.0 provides an interface for customization and extension and a new
backend system based on the Threading Building Blocks library. With this
new interface, programmers may customize the behavior of specific algorithms
as well as control the allocation of temporary storage or invent entirely new
backends. These enhancements also allow multiple different backend systems
such as CUDA and OpenMP to coexist within a single program. Support for TBB
allows Thrust programs to integrate more naturally into applications which
may already employ the TBB task scheduler.
Breaking API Changes
The header <thrust/experimental/cuda/pinned_allocator.h> has been moved to <thrust/system/cuda/experimental/pinned_allocator.h>
thrust::experimental::cuda::pinned_allocator has been moved to thrust::cuda::experimental::pinned_allocator
The macro THRUST_DEVICE_BACKEND has been renamed THRUST_DEVICE_SYSTEM
The macro THRUST_DEVICE_BACKEND_CUDA has been renamed THRUST_DEVICE_SYSTEM_CUDA
The macro THRUST_DEVICE_BACKEND_OMP has been renamed THRUST_DEVICE_SYSTEM_OMP
thrust::host_space_tag has been renamed thrust::host_system_tag
thrust::device_space_tag has been renamed thrust::device_system_tag
thrust::any_space_tag has been renamed thrust::any_system_tag
thrust::iterator_space has been renamed thrust::iterator_system
New Features
Backend Systems
Threading Building Blocks (TBB) is now supported
Functions
for_each_n
raw_reference_cast
Types
pointer
reference
New Examples
cuda/custom_temporary_allocation
cuda/fallback_allocator
device_ptr
expand
minimal_custom_backend
raw_reference_cast
set_operations
Other Enhancements
thrust::for_each now returns the end of the input range similar to most other algorithms
thrust::pair and thrust::tuple have swap functionality
all CUDA algorithms now support large data types
iterators may be dereferenced in user __device__ or __global__ functions
the safe use of different backend systems is now possible within a single binary
Bug Fixes
#469 min_element and max_element algorithms no longer require a const comparison operator
Known Issues
cudafe++.exe may crash when parsing TBB headers on Windows.
#######################################
# Thrust v1.5.2 #
#######################################
Summary
Small bug fixes
Bug Fixes
Fixed warning about C-style initialization of structures
#######################################
# Thrust v1.5.1 #
#######################################
Summary
Small bug fixes
Bug Fixes
Sorting data referenced by permutation_iterators on CUDA produces invalid results
#######################################
# Thrust v1.5.0 #
#######################################
Summary
Thrust v1.5.0 provides introduces new programmer productivity and performance
enhancements. New functionality for creating anonymous "lambda" functions has
been added. A faster host sort provides 2-10x faster performance for sorting
arithmetic types on (single-threaded) CPUs. A new OpenMP sort provides
2.5x-3.0x speedup over the host sort using a quad-core CPU. When sorting
arithmetic types with the OpenMP backend the combined performance improvement
is 5.9x for 32-bit integers and ranges from 3.0x (64-bit types) to 14.2x
(8-bit types). A new CUDA reduce_by_key implementation provides 2-3x faster
performance.
Breaking API Changes
device_ptr<void> no longer unsafely converts to device_ptr<T> without an
explicit cast. Use the expression
device_pointer_cast(static_cast<int*>(void_ptr.get()))
to convert, for example, device_ptr<void> to device_ptr<int>.
New Features
Functions
stencil-less transform_if
Types
lambda placeholders
New Examples
lambda
Other Enhancements
host sort is 2-10x faster for arithmetic types
OMP sort provides speedup over host sort
reduce_by_key is 2-3x faster
reduce_by_key no longer requires O(N) temporary storage
CUDA scan algorithms are 10-40% faster
host_vector and device_vector are now documented
out-of-memory exceptions now provide detailed information from CUDART
improved histogram example
device_reference now has a specialized swap
reduce_by_key and scan algorithms are compatible with discard_iterator
Removed Functionality
Bug Fixes
#44 allow host_vector to compile when value_type uses __align__
#198 allow adjacent_difference to permit safe in-situ operation
#303 make thrust thread-safe
#313 avoid race conditions in device_vector::insert
#314 avoid unintended adl invocation when dispatching copy
#365 fix merge and set operation failures
Known Issues
None
Acknowledgments
Thanks to Manjunath Kudlur for contributing his Carbon library, from which the lambda functionality is derived.
Thanks to Jean-Francois Bastien for suggesting a fix for issue 303.
#######################################
# Thrust v1.4.0 #
#######################################
Summary
Thrust v1.4.0 provides support for CUDA 4.0 in addition to many feature
and performance improvements. New set theoretic algorithms operating on
sorted sequences have been added. Additionally, a new fancy iterator
allows discarding redundant or otherwise unnecessary output from
algorithms, conserving memory storage and bandwidth.
Breaking API Changes
Eliminations
thrust/is_sorted.h
thrust/utility.h
thrust/set_intersection.h
thrust/experimental/cuda/ogl_interop_allocator.h and the functionality therein
thrust::deprecated::copy_when
thrust::deprecated::absolute_value
New Features
Functions
copy_n
merge
set_difference
set_symmetric_difference
set_union
Types
discard_iterator
Device support
Compute Capability 2.1 GPUs
New Examples
run_length_decoding
Other Enhancements
Compilation warnings are substantially reduced in various contexts.
The compilation time of thrust::sort, thrust::stable_sort, thrust::sort_by_key,
and thrust::stable_sort_by_key are substantially reduced.
A fast sort implementation is used when sorting primitive types with thrust::greater.
The performance of thrust::set_intersection is improved.
The performance of thrust::fill is improved on SM 1.x devices.
A code example is now provided in each algorithm's documentation.
thrust::reverse now operates in-place
Removed Functionality
thrust::deprecated::copy_when
thrust::deprecated::absolute_value
thrust::experimental::cuda::ogl_interop_allocator
thrust::gather and thrust::scatter from host to device and vice versa are no longer supported.
Operations which modify the elements of a thrust::device_vector are no longer
available from source code compiled without nvcc when the device backend is CUDA.
Instead, use the idiom from the cpp_interop example.
Bug Fixes
#212 set_intersection works correctly for large input sizes.
#275 counting_iterator and constant_iterator work correctly with OpenMP as the
backend when compiling with optimization
#256 min and max correctly return their first argument as a tie-breaker
#248 NDEBUG is interpreted correctly
Known Issues
nvcc may generate code containing warnings when compiling some Thrust algorithms.
When compiling with -arch=sm_1x, some Thrust algorithms may cause nvcc to issue
benign pointer advisories.
When compiling with -arch=sm_1x and -G, some Thrust algorithms may fail to execute correctly.
thrust::inclusive_scan, thrust::exclusive_scan, thrust::inclusive_scan_by_key,
and thrust::exclusive_scan_by_key are currently incompatible with thrust::discard_iterator.
Acknowledgments
Thanks to David Tarjan for improving the performance of set_intersection.
Thanks to Duane Merrill for continued help with sort.
Thanks to Nathan Whitehead for help with CUDA Toolkit integration.
#######################################
# Thrust v1.3.0 #
#######################################
Summary
Thrust v1.3.0 provides support for CUDA 3.2 in addition to many feature
and performance enhancements.
Performance of the sort and sort_by_key algorithms is improved by as much
as 3x in certain situations. The performance of stream compaction algorithms,
such as copy_if, is improved by as much as 2x. Reduction performance is
also improved, particularly for small input sizes.
CUDA errors are now converted to runtime exceptions using the system_error
interface. Combined with a debug mode, also new in v1.3, runtime errors
can be located with greater precision.
Lastly, a few header files have been consolidated or renamed for clarity.
See the deprecations section below for additional details.
Breaking API Changes
Promotions
thrust::experimental::inclusive_segmented_scan has been renamed thrust::inclusive_scan_by_key and exposes a different interface
thrust::experimental::exclusive_segmented_scan has been renamed thrust::exclusive_scan_by_key and exposes a different interface
thrust::experimental::partition_copy has been renamed thrust::partition_copy and exposes a different interface
thrust::next::gather has been renamed thrust::gather
thrust::next::gather_if has been renamed thrust::gather_if
thrust::unique_copy_by_key has been renamed thrust::unique_by_key_copy
Deprecations
thrust::copy_when has been renamed thrust::deprecated::copy_when
thrust::absolute_value has been renamed thrust::deprecated::absolute_value
The header thrust/set_intersection.h is now deprecated; use thrust/set_operations.h instead
The header thrust/utility.h is now deprecated; use thrust/swap.h instead
The header thrust/swap_ranges.h is now deprecated; use thrust/swap.h instead
Eliminations
thrust::deprecated::gather
thrust::deprecated::gather_if
thrust/experimental/arch.h and the functions therein
thrust/sorting/merge_sort.h
thrust/sorting/radix_sort.h
New Features
Functions
exclusive_scan_by_key
find
find_if
find_if_not
inclusive_scan_by_key
is_partitioned
is_sorted_until
mismatch
partition_point
reverse
reverse_copy
stable_partition_copy
Types
system_error and related types
experimental::cuda::ogl_interop_allocator
bit_and, bit_or, and bit_xor
Device support
gf104-based GPUs
New Examples
opengl_interop.cu
repeated_range.cu
simple_moving_average.cu
sparse_vector.cu
strided_range.cu
Other Enhancements
Performance of thrust::sort and thrust::sort_by_key is substantially improved for primitive key types
Performance of thrust::copy_if is substantially improved
Performance of thrust::reduce and related reductions is improved
THRUST_DEBUG mode added
Callers of Thrust functions may detect error conditions by catching thrust::system_error, which derives from std::runtime_error
The number of compiler warnings generated by Thrust has been substantially reduced
Comparison sort now works correctly for input sizes > 32M
min & max usage no longer collides with <windows.h> definitions
Compiling against the OpenMP backend no longer requires nvcc
Performance of device_vector initialized in .cpp files is substantially improved in common cases
Performance of thrust::sort_by_key on the host is substantially improved
Removed Functionality
nvcc 2.3 is no longer supported
Bug Fixes
Debug device code now compiles correctly
thrust::uninitialized_copy and thrust::unintialized_fill now dispatch constructors on the device rather than the host
Known Issues
#212 set_intersection is known to fail for large input sizes
partition_point is known to fail for 64b types with nvcc 3.2
Acknowledgments
Thanks to Duane Merrill for contributing a fast CUDA radix sort implementation
Thanks to Erich Elsen for contributing an implementation of find_if
Thanks to Andrew Corrigan for contributing changes which allow the OpenMP backend to compile in the absence of nvcc
Thanks to Andrew Corrigan, Cliff Wooley, David Coeurjolly, Janick Martinez Esturo, John Bowers, Maxim Naumov, Michael Garland, and Ryuta Suzuki for bug reports
Thanks to Cliff Woolley for help with testing
#######################################
# Thrust v1.2.1 #
#######################################
Summary
Small fixes for compatibility with CUDA 3.1
Known Issues
inclusive_scan & exclusive_scan may fail with very large types
the Microsoft compiler may fail to compile code using both sort and binary search algorithms
uninitialized_fill & uninitialized_copy dispatch constructors on the host rather than the device
# 109 some algorithms may exhibit poor performance with the OpenMP backend with large numbers (>= 6) of CPU threads
default_random_engine::discard is not accelerated with nvcc 2.3
nvcc 3.1 may fail to compile code using types derived from thrust::subtract_with_carry_engine, such as thrust::ranlux24 & thrust::ranlux48.
#######################################
# Thrust v1.2.0 #
#######################################
Summary
Thrust v1.2 introduces support for compilation to multicore CPUs
and the Ocelot virtual machine, and several new facilities for
pseudo-random number generation. New algorithms such as set
intersection and segmented reduction have also been added. Lastly,
improvements to the robustness of the CUDA backend ensure
correctness across a broad set of (uncommon) use cases.
Breaking API Changes
thrust::gather's interface was incorrect and has been removed.
The old interface is deprecated but will be preserved for Thrust
version 1.2 at thrust::deprecated::gather &
thrust::deprecated::gather_if. The new interface is provided at
thrust::next::gather & thrust::next::gather_if. The new interface
will be promoted to thrust:: in Thrust version 1.3. For more details,
please refer to this thread:
http://groups.google.com/group/thrust-users/browse_thread/thread/f5f0583cb97b51fd
The thrust::sorting namespace has been deprecated in favor of the
top-level sorting functions, such as thrust::sort() and
thrust::sort_by_key().
New Features
Functions
reduce_by_key
set_intersection
tie
unique_copy
unique_by_key
unique_copy_by_key
Types
Random Number Generation
discard_block_engine
default_random_engine
linear_congruential_engine
linear_feedback_shift_engine
minstd_rand
minstd_rand0
normal_distribution (experimental)
ranlux24
ranlux48
ranlux24_base
ranlux48_base
subtract_with_carry_engine
taus88
uniform_int_distribution
uniform_real_distribution
xor_combine_engine
Functionals
project1st
project2nd
Fancy Iterators
permutation_iterator
reverse_iterator
Device support
Add support for multicore CPUs via OpenMP
Add support for Fermi-class GPUs
Add support for Ocelot virtual machine
New Examples
cpp_integration
histogram
mode
monte_carlo
monte_carlo_disjoint_sequences
padded_grid_reduction
permutation_iterator
row_sum
run_length_encoding
segmented_scan
stream_compaction
summary_statistics
transform_iterator
word_count
Other Enhancements
vector functions operator!=, rbegin, crbegin, rend, crend, data, & shrink_to_fit
integer sorting performance is improved when max is large but (max - min) is small and when min is negative
performance of inclusive_scan() and exclusive_scan() is improved by 20-25% for primitive types
support for nvcc 3.0
Removed Functionality
removed support for equal between host & device sequences
removed support for gather() and scatter() between host & device sequences
Bug Fixes
# 8 cause a compiler error if the required compiler is not found rather than a mysterious error at link time
# 42 device_ptr & device_reference are classes rather than structs, eliminating warnings on certain platforms
# 46 gather & scatter handle any space iterators correctly
# 51 thrust::experimental::arch functions gracefully handle unrecognized GPUs
# 52 avoid collisions with common user macros such as BLOCK_SIZE
# 62 provide better documentation for device_reference
# 68 allow built-in CUDA vector types to work with device_vector in pure C++ mode
# 102 eliminated a race condition in device_vector::erase
various compilation warnings eliminated
Known Issues
inclusive_scan & exclusive_scan may fail with very large types
the Microsoft compiler may fail to compile code using both sort and binary search algorithms
uninitialized_fill & uninitialized_copy dispatch constructors on the host rather than the device
# 109 some algorithms may exhibit poor performance with the OpenMP backend with large numbers (>= 6) of CPU threads
default_random_engine::discard is not accelerated with nvcc 2.3
Acknowledgments
Thanks to Gregory Diamos for contributing a CUDA implementation of set_intersection
Thanks to Ryuta Suzuki & Gregory Diamos for rigorously testing Thrust's unit tests and examples against Ocelot
Thanks to Tom Bradley for contributing an implementation of normal_distribution
Thanks to Joseph Rhoads for contributing the example summary_statistics
#######################################
# Thrust v1.1.1 #
#######################################
Summary
Small fixes for compatibility with CUDA 2.3a and Mac OSX Snow Leopard.
#######################################
# Thrust v1.1.0 #
#######################################
Summary
Thrust v1.1 introduces fancy iterators, binary search functions, and
several specialized reduction functions. Experimental support for
segmented scan has also been added.
Breaking API Changes
counting_iterator has been moved into the thrust namespace (previously thrust::experimental)
New Features
Functions
copy_if
lower_bound
upper_bound
vectorized lower_bound
vectorized upper_bound
equal_range
binary_search
vectorized binary_search
all_of
any_of
none_of
minmax_element
advance
inclusive_segmented_scan (experimental)
exclusive_segmented_scan (experimental)
Types
pair
tuple
device_malloc_allocator
Fancy Iterators
constant_iterator
counting_iterator
transform_iterator
zip_iterator
New Examples
computing the maximum absolute difference between vectors
computing the bounding box of a two-dimensional point set
sorting multiple arrays together (lexicographical sorting)
constructing a summed area table
using zip_iterator to mimic an array of structs
using constant_iterator to increment array values
Other Enhancements
added pinned memory allocator (experimental)
added more methods to host_vector & device_vector (issue #4)
added variant of remove_if with a stencil argument (issue #29)
scan and reduce use cudaFuncGetAttributes to determine grid size
exceptions are reported when temporary device arrays cannot be allocated
Bug Fixes
#5 make vector work for larger data types
#9 stable_partition_copy doesn't respect OutputIterator concept semantics
#10 scans should return OutputIterator
#16 make algorithms work for larger data types
#27 dispatch radix_sort even when comp=less<T> is explicitly provided
Known Issues
Using functors with Thrust entry points may not compile on Mac OSX with gcc-4.0.1
uninitialized_copy & uninitialized_fill dispatch constructors on the host rather than the device.
inclusive_scan, inclusive_scan_by_key, exclusive_scan, and exclusive_scan_by_key may fail when used with large types with the CUDA 3.1 driver
#######################################
# Thrust v1.0.0 #
#######################################
Breaking API changes
Rename top level namespace komrade to thrust.
Move partition_copy() & stable_partition_copy() into thrust::experimental namespace until we can easily provide the standard interface.
Rename range() to sequence() to avoid collision with Boost.Range.
Rename copy_if() to copy_when() due to semantic differences with C++0x copy_if().
New Features
Add C++0x style cbegin() & cend() methods to host_vector & device_vector.
Add transform_if function.
Add stencil versions of replace_if() & replace_copy_if().
Allow counting_iterator to work with for_each().
Allow types with constructors in comparison sort & reduce.
Other Enhancements
merge_sort and stable_merge_sort are now 2 to 5x faster when executed on the parallel device.
Bug fixes
Workaround an issue where an incremented iterator causes nvcc to crash. (Komrade issue #6)
Fix an issue where const_iterators could not be passed to transform. (Komrade issue #7)