Implement parallel ItemBlock processing via backup_controller goroutines #8659

sseago · 2025-01-29T21:22:45Z

Thank you for contributing to Velero!

Please add a summary of your change

Parallel ItemBlock processing via backup_controller goroutines as described in phase 2 of https://github.com/vmware-tanzu/velero/blob/main/design/backup-performance-improvements.md

Does your change fix a particular issue?

Fixes #8334

Please indicate you've done the following:

[x ] Accepted the DCO. Commits without the DCO will delay acceptance.
[x ] Created a changelog file (make new-changelog) or comment /kind changelog-not-required on this PR.
Updated the corresponding documentation in site/content/docs/main.

sseago · 2025-01-29T21:23:17Z

Currently in draft, as I have not yet tested the changes in a cluster env.

codecov · 2025-01-29T21:37:15Z

Codecov Report

Attention: Patch coverage is 91.78082% with 12 lines in your changes missing coverage. Please review.

Project coverage is 59.48%. Comparing base (f56698e) to head (91d7301).

Files with missing lines	Patch %	Lines
pkg/backup/backup.go	87.17%	9 Missing and 1 partial ⚠️
pkg/backup/item_collector.go	66.66%	1 Missing ⚠️
pkg/controller/backup_controller.go	50.00%	1 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #8659      +/-   ##
==========================================
+ Coverage   59.39%   59.48%   +0.08%     
==========================================
  Files         370      371       +1     
  Lines       39989    40107     +118     
==========================================
+ Hits        23752    23857     +105     
- Misses      14745    14756      +11     
- Partials     1492     1494       +2

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

shawn-hurley

I think I have just one concern otherwise nothing stands out!

pkg/backup/backup.go

kaovilai

Some typos. other than that no objections atm.

Note: I am not as familiar with select-case channel statements.

pkg/backup/backup.go

sseago · 2025-02-05T00:37:00Z

Updated in response to PR comments. Will test in cluster tomorrow.

sseago · 2025-02-06T00:34:59Z

I've tested this in an aws cluster. Two namespaces, 4 pods, 2 volumes, and additional related resources. Using vmware-tanzu/velero-plugin-example#75 to force each item backup to take 5 seconds allowed me to verify parallel processing. Starting with 1 configured worker, up to 16 -- each time dropped the total backup time, and I was able to verify from the logs that the 5-second duration BIA actions were happening in parallel to each other in the expected number.

Summary of backup time/results is below. Note that the differing item count on subsequent backups was mainly due to the inclusion of events. If those had been excluded, the time drop from one backup to the next would be more significant. Also note that with a regular real-world backup of this size, the time differences would be much smaller, as I've introduced an artificial time duration per item here.

The real-world use case where parallel backup will be most pronounced would be a backup with a large number of small PVCs using CSI snapshots or datamover.

mysql-persistent in two namespaces. 5 second BIA delay per item

1 worker: mysql-1worker-01
Started:    2025-02-05 15:08:57 +0000 UTC
Completed:  2025-02-05 15:15:58 +0000 UTC
Elapsed: 07:01
Items backed up:              77

2 workers: mmysql-2worker-01
Started:    2025-02-05 16:04:21 +0000 UTC
Completed:  2025-02-05 16:08:32 +0000 UTC
Elapsed:  04:01
Items backed up:              83

4 workers: mmysql-4worker-01
Started:    2025-02-05 16:33:26 +0000 UTC
Completed:  2025-02-05 16:36:01 +0000 UTC
Elapsed: 02:35
Items backed up:              89

8 workers: mmysql-8worker-01
Started:    2025-02-05 16:48:15 +0000 UTC
Completed:  2025-02-05 16:50:01 +0000 UTC
Elapsed: 01:46
Total items to be backed up:  96

16 workers: mmysql-16worker-01
Started:    2025-02-05 16:58:00 +0000 UTC
Completed:  2025-02-05 16:59:37 +0000 UTC
Elapsed:  01:37
Items backed up:              102

sseago · 2025-02-07T05:55:04Z

rebased after #8664 merged

Lyndon-Li · 2025-02-07T08:27:42Z

pkg/backup/backup.go

@@ -391,7 +393,7 @@ func (kb *kubernetesBackupper) BackupWithResolvers(

 	// the main backup process will send on this channel once
 	// for every item it processes.
-	update := make(chan progressUpdate)
+	update := make(chan progressUpdate, 10)


Once we set the limit to inputChannel, is it still necessary to have a limit here?
If not, let's not set it, so as to reduce the efforts of. future enhancement, performance tuning and troubleshooting

This isn't a limit -- this is the buffer size. Without it, the channel is unbuffered. It's not something that we expect to have any need to tweak via performance tuning, though. Having a buffer allows the goroutine that sends messages to the channel to return without blocking if the receiver doesn't immediately pick it up. In general, we want a non-zero buffer size, but it doesn't need to be large. Whether it's 5 or 10 or something else is probably not important, but we do want a buffer. The input channel is a completely separate channel with its own buffer.

https://clavinjune.dev/en/blogs/buffered-vs-unbuffered-channel-in-golang/

Lyndon-Li · 2025-02-07T08:31:09Z

pkg/backup/backup.go

+					}
+
+					if len(response.itemBlock.Items) > 0 {
+						log.WithFields(map[string]interface{}{


Do we need to output the progress as an Info log? It will flood the log for a large backup.

This is the way progress has always been. Current PR just moves it to a different location. It doesn't currently flood the logs since there are usually several other logs for each progress log. I think we can change log frequency as a separate PR since this shouldn't change log frequency from Velero 1.15 or earlier.

Lyndon-Li · 2025-02-07T08:40:58Z

pkg/backup/backup.go

-			"name":      items[i].name,
-		}).Infof("Backed up %d items out of an estimated total of %d (estimate will change throughout the backup)", backedUpItems, totalItems)
+	// Wait for all the ItemBlocks to be processed
+	select {


Looks like we don't have a way to break the itemblock processing halfway, so code branches like case <-responseCtx.Done() will never be reached, right?

So is this responseCtx for future enhancement, e.g., backup cancel?

I don't think this particular Done() will be hit -- although when we do implement backup cancel, we may end up using it. @shawn-hurley This is similar to the analyzer engine code we had looked at which did have this ctx.Done() there. Do you think it makes sense to leave it, or is it better to remove?

Lyndon-Li · 2025-02-07T08:52:40Z

pkg/backup/item_block_worker_pool.go

+func (p *ItemBlockWorkerPool) Stop() {
+	p.cancelFunc()
+	p.logger.Info("ItemBlock worker stopping")
+	p.wg.Wait()


Add another log to indicate work pool stopped and the wait time

Lyndon-Li · 2025-02-07T09:19:18Z

pkg/backup/item_block_worker_pool.go

+
+func StartItemBlockWorkerPool(ctx context.Context, workers int, log logrus.FieldLogger) *ItemBlockWorkerPool {
+	// Buffer will hold up to 10 ItemBlocks waiting for processing
+	inputChannel := make(chan ItemBlockInput, 10)


Let's add a variable for the input limit (10).

I'm not sure we need a second configuration value here. This is just a buffer which sets how many inputs can be queued without blocking before a worker takes one. I can add it if we really want two configuration values here, but in most cases users won't really want to change this -- i.e. the same value that works with 1 worker (the default number of workers) is probably fine when they have 10 or 20 workers. So we can add it, but I don't expect tweaking it to change performance much (although setting it to a very small value (1 or 0) might negatively affect performance.

Lyndon-Li · 2025-02-07T09:19:49Z

pkg/backup/item_block_worker_pool.go

+
+func StartItemBlockWorkerPool(ctx context.Context, workers int, log logrus.FieldLogger) *ItemBlockWorkerPool {
+	// Buffer will hold up to 10 ItemBlocks waiting for processing
+	inputChannel := make(chan ItemBlockInput, 10)


Why do we pick 10 as the hardcode value, instead of number of CPU cores?

CPU cores isn't relevant here. This is just the buffer -- the number of not-yet-taken input values that can be added to the channel before the adding process blocks. The default value for the number of goroutines processing inputs will be one.

Lyndon-Li · 2025-02-07T09:20:26Z

pkg/backup/item_block_worker_pool.go

+
+func StartItemBlockWorkerPool(ctx context.Context, workers int, log logrus.FieldLogger) *ItemBlockWorkerPool {
+	// Buffer will hold up to 10 ItemBlocks waiting for processing
+	inputChannel := make(chan ItemBlockInput, 10)


Are are going to make this configurable and when?
What is the consideration?

See the above comment. I don't really see a big need to make it configurable -- the concern is that users might get confused with both a worker pool size configuration (how many worker threads, this config is already in place) and a channel buffer size configuration, which really won't affect performance much at all. I can certainly make this configurable, though, if there is a use case for it.

Signed-off-by: Scott Seago <[email protected]>

sseago marked this pull request as draft January 29, 2025 21:22

github-actions bot assigned sseago Jan 29, 2025

github-actions bot requested review from anshulahuja98 and ywk253100 January 29, 2025 21:22

github-actions bot added Area/Design Design Documents has-unit-tests labels Jan 29, 2025

sseago force-pushed the parallel-itemblocks branch from fa680a9 to 5cee460 Compare January 29, 2025 21:29

github-actions bot added the has-changelog label Jan 29, 2025

shawn-hurley reviewed Jan 30, 2025

View reviewed changes

pkg/backup/backup.go Show resolved Hide resolved

kaovilai reviewed Feb 3, 2025

View reviewed changes

pkg/backup/backup.go Outdated Show resolved Hide resolved

shubham-pampattiwar reviewed Feb 4, 2025

View reviewed changes

pkg/backup/backup.go Show resolved Hide resolved

pkg/backup/backup.go Outdated Show resolved Hide resolved

sseago force-pushed the parallel-itemblocks branch from 5cee460 to d08bb75 Compare February 5, 2025 00:36

sseago force-pushed the parallel-itemblocks branch 2 times, most recently from dad8770 to 097941e Compare February 6, 2025 00:29

sseago marked this pull request as ready for review February 6, 2025 00:30

github-actions bot requested a review from shubham-pampattiwar February 6, 2025 00:30

sseago force-pushed the parallel-itemblocks branch 3 times, most recently from e1ab7c0 to fd2ee86 Compare February 7, 2025 05:54

Lyndon-Li reviewed Feb 7, 2025

View reviewed changes

Implement parallel ItemBlock processing via backup_controller goroutines

91d7301

Signed-off-by: Scott Seago <[email protected]>

sseago force-pushed the parallel-itemblocks branch from fd2ee86 to 91d7301 Compare February 7, 2025 16:42

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement parallel ItemBlock processing via backup_controller goroutines #8659

Implement parallel ItemBlock processing via backup_controller goroutines #8659

sseago commented Jan 29, 2025

sseago commented Jan 29, 2025

codecov bot commented Jan 29, 2025 •

edited

Loading

shawn-hurley left a comment

kaovilai left a comment

sseago commented Feb 5, 2025

sseago commented Feb 6, 2025

sseago commented Feb 7, 2025

Lyndon-Li Feb 7, 2025 •

edited

Loading

sseago Feb 7, 2025

sseago Feb 7, 2025

Lyndon-Li Feb 7, 2025

sseago Feb 7, 2025

Lyndon-Li Feb 7, 2025 •

edited

Loading

sseago Feb 7, 2025

Lyndon-Li Feb 7, 2025

Lyndon-Li Feb 7, 2025

sseago Feb 7, 2025

Lyndon-Li Feb 7, 2025

sseago Feb 7, 2025

Lyndon-Li Feb 7, 2025

sseago Feb 7, 2025

Implement parallel ItemBlock processing via backup_controller goroutines #8659

Are you sure you want to change the base?

Implement parallel ItemBlock processing via backup_controller goroutines #8659

Conversation

sseago commented Jan 29, 2025

Please add a summary of your change

Does your change fix a particular issue?

Please indicate you've done the following:

sseago commented Jan 29, 2025

codecov bot commented Jan 29, 2025 • edited Loading

Codecov Report

shawn-hurley left a comment

Choose a reason for hiding this comment

kaovilai left a comment

Choose a reason for hiding this comment

sseago commented Feb 5, 2025

sseago commented Feb 6, 2025

sseago commented Feb 7, 2025

Lyndon-Li Feb 7, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Lyndon-Li Feb 7, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

codecov bot commented Jan 29, 2025 •

edited

Loading

Lyndon-Li Feb 7, 2025 •

edited

Loading

Lyndon-Li Feb 7, 2025 •

edited

Loading