-
Notifications
You must be signed in to change notification settings - Fork 5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Optimize distance calculations: consider replacing sf::st_distance #164
Comments
Thanks for opening the issue! I agree that function can be quite the pain. I'm not super willing to adopt RFast as a dependency, because that winds up causing all sorts of knock-on effects with how we handle units in arguments throughout the package. I'm going to leave this open though, because it'd be worth exploring C++ based alternatives to this distance function. Depending on what specific function you're looking to use, you might be interested in nc <- sf::read_sf(system.file("shape/nc.shp", package = "sf"))
nc_dist <- as.numeric(sf::st_distance(nc)) # or use RFast here, or other functions
rsample::clustering_cv(
sf::st_drop_geometry(nc),
AREA, # this is a dummy value to force the function to run
distance_function = \(x) x,
cluster_function = \(dists, v, ...) kmeans(dists, v)$cluster,
x = nc_dist
)
#> # 10-cluster cross-validation
#> # A tibble: 10 × 2
#> splits id
#> <list> <chr>
#> 1 <split [88/12]> Fold01
#> 2 <split [91/9]> Fold02
#> 3 <split [96/4]> Fold03
#> 4 <split [92/8]> Fold04
#> 5 <split [88/12]> Fold05
#> 6 <split [89/11]> Fold06
#> 7 <split [84/16]> Fold07
#> 8 <split [84/16]> Fold08
#> 9 <split [92/8]> Fold09
#> 10 <split [96/4]> Fold10 Created on 2024-11-12 with reprex v2.1.1 |
Totally understand your take on the However, it only covers Euclidean distance, but it produces identical results to P.S. Thanks for the tip on how to use custom distance function!! Appreciate it library(sf)
#> Linking to GEOS 3.12.1, GDAL 3.8.4, PROJ 9.3.1; sf_use_s2() is TRUE
library(Rcpp)
library(units)
#> udunits database from C:/Users/TsyplenkovA/AppData/Local/R/win-library/4.4/units/share/udunits/udunits2.xml
cppFunction("NumericMatrix distance_matrix_cpp(NumericMatrix points) {
int n = points.nrow();
NumericMatrix distances(n, n);
for(int i = 0; i < n; i++) {
distances(i,i) = 0;
for(int j = i+1; j < n; j++) {
double dx = points(i,0) - points(j,0);
double dy = points(i,1) - points(j,1);
double dist = sqrt(dx*dx + dy*dy);
distances(i,j) = dist;
distances(j,i) = dist;
}
}
return distances;
}")
# Wrapper function to add units
cpp_distance_units <- function(pts) {
crs <- sf::st_crs(pts)
coords <- sf::st_coordinates(pts)
dist_matrix <- distance_matrix_cpp(coords)
ids <- as.character(seq_along(coords[, 1]))
dimnames(dist_matrix) <- list(ids, ids)
units(dist_matrix) <- crs$units
dist_matrix
}
# Function from the previous message
pts <- create_points(5000)
bench::mark(
sf = sf::st_distance(pts, which = "Euclidean"),
cpp = cpp_distance_units(pts),
rfast = {
coords <- sf::st_coordinates(pts)
Rfast::Dist(coords, method = "euclidean")
},
iterations = 10,
relative = TRUE,
check = FALSE
)
#> # A tibble: 3 × 6
#> expression min median `itr/sec` mem_alloc `gc/sec`
#> <bch:expr> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 sf 2.51 2.61 1 2.00 2.34
#> 2 cpp 1.08 1.11 2.35 1 1
#> 3 rfast 1 1 2.61 1.01 1.67
waldo::compare(
sf::st_distance(pts, which = "Euclidean"),
cpp_distance_units(pts)
)
#> ✔ No differences Created on 2024-11-13 with reprex v2.1.0 |
Hi there!
Just a short note on the distance calculations. From my experience, it's one of the major bottlenecks in the package. For example, I can't even apply the
spatialsample
package for my datasets with 40k points in it.spatialsample/R/buffer.R
Line 27 in ded1691
What if we replace
sf::st_distance
with a slightly more robust function? For example, there is some evidence thatRfast::Dist
works two to three times faster thansf::st_distance()
. Perhaps it would be worth adding one more package to the dependency list in the name of a speed boost?Unfortunately, both algorithms seem to have O(n²) time complexity, which is not good, and
Rfast
is not a silver bullet. Additionally, in the case of longlat coordinates,sf::st_distance()
may still be preferable as it computes Great Circle distance.I can prepare a PR if you like the approach.
See below some benchmarking
Created on 2024-11-13 with reprex v2.1.0
The text was updated successfully, but these errors were encountered: