Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Boolean Operations take a long time #68

Open
MattFerraro opened this issue May 29, 2024 · 5 comments
Open

Boolean Operations take a long time #68

MattFerraro opened this issue May 29, 2024 · 5 comments

Comments

@MattFerraro
Copy link

MattFerraro commented May 29, 2024

Here is an example script:

use truck_meshalgo::tessellation::{MeshableShape, MeshedShape};
use truck_modeling::builder::{rsweep, try_attach_plane, tsweep, vertex};
use truck_modeling::{Point3, Vector3};
use truck_polymesh::{obj, Rad};
use truck_shapeops::{and, or};

fn main() {
    // Make a cube with side length 100
    let origin = vertex(Point3::new(0.0, 0.0, 0.0));
    let x_axis = tsweep(&origin, Vector3::new(100.0, 0.0, 0.0));
    let xy_square = tsweep(&x_axis, Vector3::new(0.0, 100.0, 0.0));
    let cube = tsweep(&xy_square, Vector3::new(0.0, 0.0, 100.0));

    // Save it as an obj file
    let mesh = cube.triangulation(0.01).to_polygon();
    let file = std::fs::File::create("test_cube.obj").unwrap();
    obj::write(&mesh, file).unwrap();

    // Make a cylinder that is centered at (50, 50) so it will interfere with the cube
    let point = vertex(Point3::new(35.0, 50.0, -20.0));
    let circle = rsweep(
        &point,
        Point3::new(50.0, 50.0, -20.0),
        Vector3::new(0.0, 0.0, 1.0),
        Rad(7.0),
    );
    let disk = try_attach_plane(&[circle]).unwrap();
    let cylinder = tsweep(&disk, Vector3::new(0.0, 0.0, 140.0));

    // save the cylinder to a file
    let mesh = cylinder.triangulation(0.01).to_polygon();
    let file = std::fs::File::create("test_cylinder.obj").unwrap();
    obj::write(&mesh, file).unwrap();

    // Now we let's do the boolean operations!

    let and_result = and(&cube, &cylinder, 1.0);
    let mesh = and_result.unwrap().triangulation(0.01).to_polygon();
    let file = std::fs::File::create("test_AND.obj").unwrap();
    obj::write(&mesh, file).unwrap();
    // This results in the cylinder, but truncated. This is the region where the cylinder intersects the cube
    // Aka the region of space which is both inside the cube AND inside the cylinder

    let or_result = or(&cube, &cylinder, 1.0);
    let mesh = or_result.unwrap().triangulation(0.01).to_polygon();
    let file = std::fs::File::create("test_OR.obj").unwrap();
    obj::write(&mesh, file).unwrap();
    // This results in a cube on a stick, aka the union of the cube and the cylinder
    // Aka the region of space which is inside the cube OR inside the cylinder

    let mut not_cylinder = cylinder.clone();
    not_cylinder.not();
    // not_cylinder is a weird thing...it's the entire universe _except_ the cylinder
    let and_not_result = and(&cube, &not_cylinder, 1.0);
    let mesh = and_not_result.unwrap().triangulation(0.01).to_polygon();
    let file = std::fs::File::create("test_AND_NOT.obj").unwrap();
    obj::write(&mesh, file).unwrap();
    // This results in a cube with a hole in it, aka the cube with the cylinder subtracted from it
    // Aka the region of space which is inside the cube AND NOT inside the cylinder
}

This takes 13 seconds of compute time on my M1 macbook air. Is it possible to cut the compute time down by a factor of 10, or even 100?

For anyone who might want to dive in a test things, here is a simpler example focusing just on OR, and it takes 15 seconds to run on my laptop:

...

fn main() {
    let origin = vertex(Point3::new(0.0, 0.0, 0.0));
    let x_axis = tsweep(&origin, Vector3::new(100.0, 0.0, 0.0));
    let xy_square = tsweep(&x_axis, Vector3::new(0.0, 100.0, 0.0));
    let cube = tsweep(&xy_square, Vector3::new(0.0, 0.0, 100.0));

    let point = vertex(Point3::new(104.0, 50.0, -20.0));
    let circle = rsweep(
        &point,
        Point3::new(80.0, 50.0, -20.0),
        Vector3::new(0.0, 0.0, 1.0),
        Rad(7.0),
    );
    let disk = try_attach_plane(&[circle]).unwrap();
    let cylinder = tsweep(&disk, Vector3::new(0.0, 0.0, 140.0));

    let or_result = or(&cube, &cylinder, 0.2);
    let mesh = or_result.unwrap().triangulation(0.01).to_polygon();
    let file = std::fs::File::create("test_OR.obj").unwrap();
    obj::write(&mesh, file).unwrap();
    // This results in a cube on a stick, aka the union of the cube and the cylinder
    // Aka the region of space which is inside the cube OR inside the cylinder
}

This example is slower because the third parameter to or() is the tolerance to mesh to. The run time is strongly dependent on that tolerance. If I use a tolerance of 0.9 it runs in just 4 seconds. But if I use a tolerances of 1.0 it fails to solve and we get a panic instead.

@tringenbach
Copy link

I wouldn't say I've really started looking at it yet, but I did the easy thing, I turned on --release, I did get a 10x or so performance boost

cargo run --example performance-issue  29.26s user 0.20s system 286% cpu 10.295 total
cargo run --profile release --example performance-issue  2.60s user 0.17s system 211% cpu 1.311 total

@twitchyliquid64
Copy link

image

A quick look through a profiler suggests that truck_geometry::nurbs::knot_vec::::try_bspline_basis_functions is the majority of the CPU time, and of that, most of it is wasted allocating its vec here:

let mut res = vec![0.0; n];

I imagine some refactoring could introduce a variant of try_bspline_basis_functions which lets you pass in an existing vec to use when its getting called in a hot loop like that. We would also get a big speed up from cache locality as well.

@twitchyliquid64
Copy link

twitchyliquid64 commented Oct 31, 2024

The pattern seems to look like this:

  1. Boolean ops means theres a need for computing the intersection, so the algorithm needs to find (u, v) of the intersection(s)
  2. Calls into truck_geotrait::algo::surface::presearch
  3. presearch needs to test parameter values, so it calls subs like a LOT. Really really hot function.
  4. But, deep in the guts of the underlying curve implementation, subs needs to compute and evaluate the basis functions, so it computes the basis functions into a newly-allocated vec for each call.
  5. Adding up the cost of allocation and free, this accounts for roughly 50% of the time in presearch.

So an easy win for at least a 2x (but probably more due to cache locality) would be to refactor presearch/subs call path to avoid allocating, or avoid allocating more than once-ish.

@twitchyliquid64
Copy link

I'm fairly sure we could figure out how to turn this:

pub fn try_bspline_basis_functions(&self, degree: usize, t: f64) -> Result<Vec<f64>> {
let n = self.len() - 1;
if self[0].near(&self[n]) {
return Err(Error::ZeroRange);
} else if n < degree {
return Err(Error::TooLargeDegree(n + 1, degree));
}
let idx = {
let idx = self
.floor(t)
.unwrap_or_else(|| self.floor(self[0]).unwrap());
if idx == n {
n - self.multiplicity(n)
} else {
idx
}
};
let mut res = vec![0.0; n];
res[idx] = 1.0;
for k in 1..=degree {
let base = if idx < k { 0 } else { idx - k };
let delta = self[base + k] - self[base];
let max = if idx + k < n { idx } else { n - k - 1 };
let mut a = inv_or_zero(delta) * (t - self[base]);
for i in base..=max {
let delta = self[i + k + 1] - self[i + 1];
let b = inv_or_zero(delta) * (self[i + k + 1] - t);
res[i] = a * res[i] + b * res[i + 1];
a = 1.0 - b;
}
}
res.truncate(n - degree);
Ok(res)
}

Into a method that returns an iterator over the computed basis functions rather than a Vec.

@LtdJorge
Copy link

LtdJorge commented Nov 8, 2024

@twitchyliquid64 I more or less got to the same conclusions as you. I created this repo to track my findings. It contains perf runs, flamegraps, hyperfine, dhat profiles, etc.

The optimization where I got the most benefit was from adding SmallVec as a dependency (any Vec with stack allocation would work) and turning this:

pub struct KnotVec(Vec<f64>);

Into this:

pub type KnotVecInner = SmallVec<[f64; 16]>;

pub struct KnotVec(KnotVecInner);

This was just a quick and dirty test, choosing 16x f64s (2 cache lines) because they gave the lowest runtime length. I have just pushed these changes to my repo which uses the modified truck version as patched dependency. The example code to benchmark is the one @MattFerraro provided. I also added full optimizations including hand-picked target-features for my specific CPU to get a before/after that the compiler cannot optimize more, so if anyone wants to try, those might need to be changed.

I also saw that the tessellation code was very hot and I added a micro-optimization to spade, which is to create vectors with Vec::with_capacity(16) instead of with Vec::new(). This is because empty vectors will have to reallocate a couple times to get to 16 items, while my implementation only needs one initial allocation. However, the number 16 here is completely arbitrary, tracking the average number of objects such vectors hold would yield a better number.

I want to test this with PGO, to see if it makes a sizeable difference for those super hot callsites. I'll setup something whenever I have more time for it.

AFAIK, the CADmium usecase would be running on WASM, so compiler optimizations and other techniques might not be able to be applied there.

Lastly, if anyone wants to check out my repo, be sure to set up Git LFS if you want my perf.data files, as those are too big for normal GitHub upload (at least the ones when running in debug mode.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants