Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Port recent GPU stroke expansion changes in flatten to CPU version #415

Merged
merged 6 commits into from
Dec 6, 2023

Conversation

armansito
Copy link
Collaborator

This replicates the GPU-side stroke expansion changes that were recently made to the flatten stage to its CPU version.

Note on buffer sizes:

The longpathdash (round caps) and mmark scenes assert (due to out-of-bounds buffer access) if zoomed in (the former crashes immediately). The asserts can occur in the path_count and flatten stages. The root cause is that the segments, seg_counts, and lines buffers are not large enough to accommodate the required number of line segments. When run on the GPU, this manifests itself as artifacts rather than a crash.

This PR does not adjust the bump buffer sizes in order to avoid hiding the problem. I would like to fix this as part of #366. Unfortunately, even an accurate buffer size estimate will require more memory than what's permitted by the WebGPU standard (the default limit for maxStorageBufferBindingSize is 128 MiB and implementations aren't guaranteed to support a higher value). We have been considering culling lines that fall outside the viewport which can alleviate the memory usage in zoomed-in scenarios.

@armansito armansito requested review from raphlinus and dfrg November 22, 2023 21:46
This is needed to unpack the miter limit parameter in the CPU test
shader version of GPU stroke flattening.
Ported the recent GPU-side stroke expansion changes to the CPU version
of the `flatten` stage.
fn test_f16_to_f32() {
const EPS: f32 = 0.001;
let input: f32 = std::f32::consts::PI;
assert!((input - f16_to_f32(f32_to_f16(input))).abs() < EPS);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be nice to have:

  1. an exported bit pattern of a 16 bit float, which we check gets the right number (i.e. test that both directions aren't wrong in the same way)
  2. test that values in the extreme ranges (i.e. values of the order 60_000 and 1/60_000 ± a factor of 2, if required - and 0) are appropriately roundtripped
  3. special values (i.e. NaN, infinity) roundtrip as expected. This is less important, as we don't really want to see them - maybe we should warn! in those cases?

Copy link
Collaborator Author

@armansito armansito Dec 1, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. I added tests for max and min value roundtrip, nan/inf roundtrip, and an explicit conversion. Perhaps test_f16_to_f32_roundtrip should become a proptest but I think this is good enough for now.

Copy link
Member

@DJMcNab DJMcNab left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ran this locally, and it worked great, thanks.

I do unfortunately see different behaviour in tricky_strokes with use-cpu enabled and disabled - the "bottom right" one doesn't change when zooming out when using the CPU version.

A few questions, which I believe are unlikely to be defects.

I can only really do a cursory scan of the code with my current understanding, but that hasn't brought up anything blocking

crates/encoding/src/math.rs Outdated Show resolved Hide resolved
@@ -580,7 +580,6 @@ fn output_two_lines_with_transform(

struct NeighboringSegment {
do_join: bool,
p0: vec2f,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be nice if we could get warned about this being never read, but that's wishful thinking.

Although...

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

**suspense**

let n_lines = if theta <= ROBUST_EPSILON {
MAX_LINES
} else {
MAX_LINES.min((std::f32::consts::TAU / theta).ceil() as u32)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it likely that most of this maximum will be used on paths which are off-screen? (or has some culling already occured?)
Having done some (probably incorrect) translation of this code, it seems to me that the radius has to be much greater than 10_000 pixels1.

Footnotes

  1. I tried this in two different ways, and got two different numbers, so I don't fully trust my maths here

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it likely that most of this maximum will be used on paths which are off-screen?

Yes, the really high line segment counts occur at really high zoom levels (for example if you zoom in directly to a stroke's cap). There is actually currently no culling and it's highly likely to wastefully generate off-screen line segments at high zoom levels. longpathdash with round caps is a good example of this, where zooming in actually exceeds the memory budget for LineSoup.

We've been talking about implementing at the very least viewport culling next year to handle the zoomed-in cases.

MAX_LINES.min((std::f32::consts::TAU / theta).ceil() as u32)
};

let th = angle / (n_lines as f32);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why does the number of lines depend only on the radius, rather than including the angle being swept over?

I would naïvely expect th to be the same for all circles with the same radius, as that should make equal length lines.
It feels quite likely that I'm missing something obvious here

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also have this concern, and wonder if it might have something to do with the high observed memory usage.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah yes, there is a bug here and thank you for taking a close look and catching that. n_lines is the number for a full circle. This is directly transcribed from the GPU flatten code where I was getting the count for a full circle and fitting that into any sweep angle. I also noticed that the derivation of theta is also not quite right. Given a radius $r$ and the length $l$ of the line segment from the center of the circle to the middle of the chord (i.e. a flattened line segment) given by $l = r - tol$, the derivation should look like:

$$ tol = r - r \times cos(\frac{\theta}{2}) $$

$$ \theta = 2 \times arccos(1 - \frac{tol}{r}) $$

And n_lines = (angle / theta).ceil(). We still need to guard against a division by zero / over-subdividing. I went ahead and picked an arbitrary minimum angle limit of 0.0001 radians but this could probably get fine tuned.

Copy link
Contributor

@raphlinus raphlinus left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have concerns about the number of subdivisions in the arc flattening logic. I didn't wrap my head around it fully - perhaps it's right and there's something I'm missing, in which case I'd happily accept an argument.

Other than that, this looks like a good foundation for the Euler spiral flattening, and I'm looking forward to getting that fully wired.

let tol: f32 = 0.5;
let radius = tol.max((p0 - transform.apply(center)).length());
let x = 1. - tol / radius;
let theta = (2. * x * x - 1.).clamp(-1., 1.).acos();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the same as sqrt(8 tol / radius) for small tol. I wonder if the extra math is worth it. I haven't investigated in detail.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The new expression should be correct now and is simpler.

MAX_LINES.min((std::f32::consts::TAU / theta).ceil() as u32)
};

let th = angle / (n_lines as f32);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also have this concern, and wonder if it might have something to do with the high observed memory usage.

) {
let mut p0 = transform.apply(begin);
let mut r = begin - center;
let tol: f32 = 0.5;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems a little loose to me, I've generally been using 0.25. I'm also concerned that it doesn't match the flattening tolerance for curves in general.

This might be a good time for a heads up: I am not confident that the tolerance value in the existing code is accurately calibrated against the actual Fréchet distance between flattened and source curve; I may have let a constant factor of error enter. But the current work I'm doing does calibrate that, so when I land that I will be systematic.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With the new derivation, 0.25 leads to visible line segments. 0.1 looks decent enough to me. I'm worried that anything lower will blow up the line count but you should probably take a look and make a judgement as to whether this tolerance is visually satisfactory.

@armansito
Copy link
Collaborator Author

PTAL. I think the arc flattening logic should be correct now barring any adjustments to MIN_THETA and the error tolerance. The current values still exceed the LineSoup and SegmentCount buffer sizes with the mmark and longpathdash (round) scenes.

Copy link
Contributor

@raphlinus raphlinus left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, this makes more sense. I expect to take a closer look (hopefully visualizing the flattening output) as I adapt the Euler stuff, so I didn't go through with a fine-tooth comb, but this definitely gets us to a better place. Thanks!

let th = angle / (n_lines as f32);
let c = th.cos();
let s = th.sin();
let c = theta.cos();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can write let (s, c) = theta.sin_cos() but I'm not sure it's a significant improvement, and in any case the function is not available in WGSL.

@@ -126,6 +126,12 @@ fn write_line(
bbox: &mut IntBbox,
lines: &mut [LineSoup],
) {
assert!(
!p0.is_nan() && !p1.is_nan(),
"wrote line segment with NaN: p0: {:?}, p1: {:?}",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can say {p0:?} here, as of Rust 1.58.

@armansito
Copy link
Collaborator Author

I do unfortunately see different behaviour in tricky_strokes with use-cpu enabled and disabled - the "bottom right" one doesn't change when zooming out when using the CPU version.

@DJMcNab I'm not sure why there is a discrepancy but I'll investigate this further. That said the bottom rows of strokes are currently not at all handled correctly - that's awaiting @raphlinus 's Euler Spiral work (early prototype started here: https://github.com/linebender/vello/tree/euler-spiral-strokes).

@armansito armansito merged commit 3f24d8c into main Dec 6, 2023
4 checks passed
@armansito armansito deleted the stroke-flatten-cpu branch December 6, 2023 22:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants