Analysis of Key Features in PCOS Diagnosis Using Random Forest and XGBoost with SMOTE and SHAP
Published 2026-06-11
Keywords
- Machine Learning, PCOS, SHAP, SMOTE, XGBoost
How to Cite
Abstract
Polycystic Ovary Syndrome (PCOS) is a hormonal disorder in women of
reproductive age characterized by irregular cycles, hyperandrogenism, and
polycystic ovarian morphology. Diagnosis is challenging because symptoms overlap with other endocrine disorders. This study proposes an interpretable machine learning approach for PCOS diagnosis using Random Forest and XGBoost. The Synthetic Minority Oversampling Technique (SMOTE) was applied to handle class imbalance, while Shapley Additive Explanations (SHAP) enhanced model interpretability. The dataset included 541 samples with 45 clinical and hormonal features, processed through preprocessing and hyperparameter tuning with GridSearchCV. XGBoost with SMOTE and GridSearchCV achieved the best performance, with 93% accuracy, 92% precision, 89% recall, and 90% F1-score. Random Forest obtained comparable results with 93% accuracy, 94% precision, 87% recall, and 90% F1-score. SHAP analysis highlighted key features such as follicle count, Anti Müllerian Hormone (AMH), skin darkening, weight gain, and irregular cycles. Global SHAP interpretation identified the most influential predictors, while local SHAP provided patient-specific explanations that improved transparency. The consistency of SHAP results with the Rotterdam criteria supports the model’s clinical validity and strengthens trust in AI-assisted tools. Overall, combining SMOTE, GridSearchCV, and SHAP not only improved predictive performance but also ensured transparent outcomes, indicating potential use for early PCOS screening.