Skip to content

Commit

Permalink
try to add r lib path
Browse files Browse the repository at this point in the history
  • Loading branch information
szcf-weiya committed Jan 28, 2025
1 parent 31fa8fc commit 756b19d
Show file tree
Hide file tree
Showing 2 changed files with 6 additions and 1 deletion.
1 change: 1 addition & 0 deletions .github/workflows/ci.yml
Original file line number Diff line number Diff line change
Expand Up @@ -45,5 +45,6 @@ jobs:

- uses: julia-actions/julia-docdeploy@v1
env:
LD_LIBRARY_PATH: /opt/R/${{ matrix.r-version }}/lib/R/lib
GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }} # If authenticating with GitHub Actions token
DOCUMENTER_KEY: ${{ secrets.DOCUMENTER_KEY }} # If authenticating with SSH deploy key
6 changes: 5 additions & 1 deletion docs/src/index.md
Original file line number Diff line number Diff line change
@@ -1,3 +1,7 @@
# Kmeans Benchmarks

Kmeans is a popular clustering algorithm.
Clustering is a cornerstone of unsupervised machine learning, with the k-means algorithm standing as one of the most widely used methods for partitioning data into coherent groups. Its simplicity, interpretability, and adaptability have made it a staple in fields ranging from customer segmentation to bioinformatics. However, the performance and results of k-means can vary significantly depending on the implementation choices made by practitioners, including the software ecosystem (e.g., R, Julia, Python), the algorithmic variants employed (e.g., Lloyd’s algorithm, Hartigan-Wong, or scalable approximations like Mini-Batch k-means), and the initialization strategies (e.g., random seeding, k-means++, or density-based initialization). These choices impact not only computational efficiency but also the quality and stability of the resulting clusters.

This project seeks to systematically benchmark and compare k-means implementations across different frameworks—focusing on R and Julia as representative languages for statistical computing and high-performance numerical analysis, respectively—while also evaluating the interplay between initialization methods and algorithmic variants. R, with its rich ecosystem of packages (e.g., `stats`, `ClusterR`), offers user-friendly tools optimized for statistical rigor, whereas Julia (particularly the package `Clustering`), leveraging its just-in-time (JIT) compilation and parallel computing capabilities, promises faster execution for large datasets. Beyond software comparisons, the study will assess how initialization techniques (e.g., naive random centroids vs. sophisticated seeding) influence convergence rates, cluster quality metrics (e.g., silhouette score, within-cluster sum of squares), and sensitivity to local optima.

By quantifying trade-offs between computational speed, scalability, and cluster accuracy, this work aims to provide actionable insights for researchers and practitioners in selecting optimal k-means configurations tailored to their data size, dimensionality, and domain requirements. The findings will contribute to a deeper understanding of how algorithmic choices and software ecosystems shape the practical utility of this foundational clustering method.

0 comments on commit 756b19d

Please sign in to comment.