Divide Clusting

The NextGen Divides and Attributes

A needed step towards advancing large scale heterogeneous hydrologic forecasting is the classification of regions with shared hydrological features. This is particularly critical with advances like the Next Generation Water Resource Modeling Framework that allows different models to be run in different portions of the county.

The hydrofabric data, and associated attribute information, used to run the NextGen are available on lynker-spatial.

The aim of this Wiki is to show how the attribute data used to downscale forcings and run models like LSTM (Long Short-Term Memory), Top Model, CFE (Conceptual Functional Equivalent), and NOAH-OWP can be clustered to define areas with similar hydrologic properties.

The goal is to provide a starting point from which model formulations can be assigned and refined.

Clustering

Starting with the model attributes shared with the NextGen hydrofabric, the following steps outline how to produce clusters:

Normalization and Scaling: We rescale the data to a standard range, between 0 and 1 and normalize them so that all features have the same scale or unit variance while following a normal distribution. It prevents features with larger magnitudes from dominating the learning process, ensuring that all features contribute equally to the clustering.
Principal Component Analysis (PCA): We reduce the 39 features to 20 principal components. Reducing the dimensionality of the data using PCA can help in removing noise, redundancy, and multicollinearity. This simplification of data can lead to a more efficient Self Organizing Map (SOM) model with reduced computation requirements.
Self-Organizing Map (SOM): SOMs are robust in clustering due to their ability to preserving the data's topology. They handle complex data distributions, and remain reliable in dynamic environments. We used Best Matching Units (BMUs) from a SOM as inputs for K-means/Birch/GussianMixture clustering algorithms that enhances efficiency, improves cluster starting points, captures complex patterns, and results in more interpretable clusters. This integration leverages the strengths of both SOM and subsequent clustering methods. The number of neurons was selected to be 10000 following a general rule of thumb of having greater than 5*sqrt(n), where n is the number of observations. We can plot the quantization and topographic error of the SOM at each step, to understand how the training evolves.

cluster_error

As seen, both quantization and topographic error level out around the 500 epochs mark.

Finding optimum number of clusters: Once SOM is trained we use the following scorings to identify the optimum number of clusters:

Silhouette score: quantifies the cohesion and separation of clusters, with higher values indicating better-defined clusters.
Calinski-Harabasz index: measures the inter-cluster variance against intra-cluster variance, and higher values signify more distinct clusters.
Davies-Bouldin index: evaluates the average similarity between each cluster and its most similar one, where lower values correspond to better-defined clusters.

These metrics offer diverse perspectives on the quality of clustering solutions, and using them we can find a balance between cohesion, separation, and cluster distinctiveness. This image shows the k-means scoring results for these 3 metrics

cluster_k_means_elbow