Handling Highly Imbalanced Flood Data Using K-Means Clustering in Skyline Query Dominance Testing
Published 2026-06-11
Keywords
- Flood, Imbalanced Data, K-Means Clustering, Skyline Query
How to Cite
Abstract
Skyline query is a recommendation algorithm used to select objects based on multi-attribute preferences, but a key challenge is that its results can be highly imbalanced, where only a small number of objects meet the preferred criteria. This imbalance reduces the reliability of spatial decision-making, including in flood vulnerability assessment. This study addresses the issue by applying a modified Sort-Filter Skyline method that considers maximum and minimum attribute preferences during sorting. The skyline output shows a strong class imbalance, with only 18 areas identified as flood-prone compared to 1,574 non-flood-prone areas. To mitigate this, K-Means clustering is used as a refinement step. The Elbow and Gap Statistic methods recommend three clusters as optimal, while the Silhouette method suggests eight. Cluster distribution analysis shows that three clusters produce a more balanced representation, with Scheme 1 and Scheme 3 showing better balance ratios and lower variation than Scheme 2. Thus, clustering into three groups helps achieve a more representative mapping of flood-prone areas.