Clustering-Based Knowledge Discovery in Breast Cancer: Insights from a Local Clinical Dataset

Dehghantanha, Oveis; Mehrshad, Nasser; Bakhshali, Roksana; Sebzari, Ahmad Reza

doi:10.22061/jecei.2025.11787.835

تعداد نشریات	15
تعداد شماره‌ها	239
تعداد مقالات	2,438
تعداد مشاهده مقاله	4,129,580
تعداد دریافت فایل اصل مقاله	3,008,446

	Clustering-Based Knowledge Discovery in Breast Cancer: Insights from a Local Clinical Dataset
Journal of Electrical and Computer Engineering Innovations (JECEI)
مقاله 11، دوره 14، شماره 1، فروردین 2026، صفحه 117-144 اصل مقاله (1.78 M)
نوع مقاله: Original Research Paper
شناسه دیجیتال (DOI): 10.22061/jecei.2025.11787.835
نویسندگان
Oveis Dehghantanha¹؛ Nasser Mehrshad^* ¹؛ Roksana Bakhshali²؛ Ahmad Reza Sebzari³
¹Department of Electrical and Computer Engineering, University of Birjand, Birjand, Iran.
²Omid Cancer Center, Ahvaz, Iran.
³Department of Internal Medicine, School of Medicine, Cellular and Molecular Research Center, Valiasr Hospital, Birjand University of Medical Sciences, Birjand, Iran.
تاریخ دریافت: 04 اردیبهشت 1404، تاریخ بازنگری: 24 تیر 1404، تاریخ پذیرش: 11 مرداد 1404
چکیده
Background and Objectives: Understanding the heterogeneity of breast cancer is crucial for improving treatment strategies. This study investigates the application of K-Means and Hierarchical Clustering to a local dataset of breast cancer patients from Iranmehr Hospital, Birjand, Iran, with the primary goal of identifying potential patient subgroups based on their clinical and treatment characteristics for knowledge discovery. The potential of these subgroups to inform future research on personalized treatment approaches is explored. Methods: A retrospective dataset comprising pathological and clinical information was analyzed using K-Means and Agglomerative Hierarchical Clustering to identify patient subgroups. The optimal number of clusters was consistently determined to be two (k=2) for both methods based on rigorous internal validation metrics (Elbow Method, Silhouette Analysis, Calinski-Harabasz Index, and Largest Jump Analysis for Hierarchical Clustering). Statistical tests (ANOVA and Chi-squared) were employed to assess significant differences in features across the identified clusters from both K-Means and Hierarchical analyses, providing insights into the key factors differentiating these groups. Internal cluster validity was assessed using Silhouette Score and Calinski-Harabasz Index. Results: The K-Means analysis identified two clusters exhibiting significant differences in characteristics such as age, chemotherapy session intensity, menopausal status, nodal involvement, and biomarker expression (ER, PR, HER2, Ki67). The Hierarchical Clustering also yielded two clusters with varying characteristics, and a comparison between the two methods highlighted both similarities and differences in the identified patient stratifications. The overall agreement between K-Means and Hierarchical Clustering was quantified by an Adjusted Rand Index (ARI) of 0.4697. Conclusion: Both K-Means and Hierarchical Clustering effectively revealed potential patient subgroups within the studied dataset, highlighting the heterogeneity of breast cancer presentation and treatment at a local level These clusters exhibited statistically significant differences across key clinical and treatment features. Future research is needed to validate these findings in larger, multi-center studies, explore the clinical significance of these subgroups in terms of treatment outcomes, and compare the effectiveness of different clustering methodologies for this purpose.
کلیدواژه‌ها
Breast Cancer؛ Knowledge Discovery؛ Clustering؛ K-Means Clustering؛ Hierarchical Clustering

مراجع
[1] F. Bray, M. Laversanne, E. Weiderpass, I. Soerjomataram, "The ever-increasing importance of cancer as a leading cause of premature death worldwide," Cancer, 127(16): 3029–3030, 2021. [2] NCD Countdown 2030 Collaborators, "NCD Countdown 2030: Pathways to achieving Sustainable Development Goal target 3.4," Lancet, 396(10255): 918, 2020. [3] H. Sung et al., "Global cancer statistics 2020: GLOBOCAN estimates of incidence and mortality worldwide for 36 cancers in 185 countries," CA Cancer J. Clin., 71(3): 209–249, 2021. [4] A. G. Renehan, M. Tyson, M. Egger, R. F. Heller, M. Zwahlen, "Body-mass index and incidence of cancer: A systematic review and meta-analysis of prospective observational studies," Lancet, 371(9612): 569–578, 2008. [5] A. McTiernan et al., "Recreational physical activity and the risk of breast cancer in postmenopausal women: The Women’s health initiative cohort study," JAMA, 290(10): 1331–1336, 2003. [6] M. E. Levine et al., "Low protein intake is associated with a major reduction in IGF-1, cancer, and overall mortality in the 65 and younger but not older population," Cell Metab., 19(3): 407–417, 2014. [7] N. Hamajima et al., "Collaborative reanalysis of individual data from 53 epidemiological studies, including 58,515 women with breast cancer and 95,067 women without the disease," Br. J. Cancer, 87(11): 1234–1245, 2002. [8] U.S. Department of Health and Human Services, The Health Consequences of Smoking—50 Years of Progress: A Report of the Surgeon General. Atlanta, GA: U.S. Department of Health and Human Services, 2014. [9] D. J. Hunter et al., "Oral contraceptive use and breast cancer: A prospective study of young women," Cancer Epidemiol. Biomarkers Prev., 19(10): 2496–2502, 2010. [10] A. K. Jain, “Data clustering: 50 years beyond K-means,” Pattern Recognit. Lett., 31(8): 651–666, 2010. [11] A. Ahmad, L. Dey, “A k-mean clustering algorithm for mixed numeric and categorical data,” Data Knowl. Eng., 63(2): 503–527, 2007. [12] Z. Huang, “Clustering large data sets with mixed numeric and categorical values,” in Proc. The First Pacific-Asia Conference on Knowledge Discovery and Data Mining: 21–34, 1997. [13] D. Xu, Y. Tian, “A comprehensive survey of clustering algorithms,” Ann. Data Sci., 2(2): 165–193, 2015. [14] F. Murtagh, P. Legendre, “Ward’s hierarchical agglomerative clustering method: Which algorithms implement Ward’s criterion?,” J. Classif., 31(3): 274–295, 2014. [15] G. Pison, A. Struyf, P. J. Rousseeuw, "Displaying a clustering with CLUSPLOT," Comput. Stat. Data Anal., 30(4): 381–392, 1999. [16] A. K. Dubey et al., "Analysis of k-means clustering approach on the breast cancer Wisconsin dataset," Int. J. Comput. Assist. Radiol. Surg., 11(11): 2033–2047, 2016. [17] U. Agrawal, D. Soria, C. Wagner, J. Garibaldi, I. O. Ellis, J. M. S. Bartlett, D. Cameron, E. A. Rakha, A. R. Green, "Combining clustering and classification ensembles: A novel pipeline to identify breast cancer profiles," Artif. Intell. Med., 97: 27–37, 2019. [18] C.Wang et al., "Breast cancer patient stratification using a molecular regularized consensus clustering method," Methods (San Diego, Calif.), 67(3): 304–312, 2014. [19] Z. Sajjadnia et al., "Preprocessing breast cancer data to improve the data quality, diagnosis procedure, and medical care services," Cancer Inform., 19: 1176935120917955, 2020. [20] A. Ahmadi et al., “Incidence pattern and spatial analysis of breast cancer in Iranian women: Geographical information system applications,” East. Mediterr. Health J., 24(4): 345–352, 2018. [21] S. M. Hosseini , M. Parvin , P. Shokri , M. Fadaie , B. Ghaytasi , M. Khondabi , M. Olfatifar, E. Chavoshi, "Clustering of breast cancer cases among women from kurdistan province, Iran: A population-based cross-sectional study," middle east journal of cancer, 9(1): 2018. [22] S. Dehdar et al., “Applications of different machine learning approaches in prediction of breast cancer diagnosis delay,” Front. Oncol., 13: 1103369, 2023. [23] M. Radak et al., "Machine learning and deep learning techniques for breast cancer diagnosis and classification: a comprehensive review of medical imaging studies," J. Cancer Res. Clin. Oncol., 149(12): 10473–10491, 2023. [24] J. Xiao et al., "The application and comparison of machine learning models for the prediction of breast cancer prognosis: Retrospective cohort study," JMIR Med. Inform., 10(2): e33440, 2022. [25] I. Guyon, A. Elisseeff, "An introduction to variable and feature selection," J. Mach. Learn. Res., 3: 1157–1182, 2003. [26] A. Zimek, E. Schubert, H. P. Kriegel, "A survey on unsupervised outlier detection in high-dimensional numerical data," Stat. Anal. Data Min.: ASA Data Sci. J., 5(5): 363–387, 2012. [27] D. T. Dinh, V. N. Huynh, S. Sriboonchitta, "Clustering mixed numerical and categorical data with missing values," Inf. Sci., 571: 418–442, 2021. [28] S. Boluki, S. Zamani Dadaneh, X. Qian, E. R. Dougherty, "Optimal clustering with missing values," BMC Bioinformatics, 20: 1–10, 2019. [29] M. Sheller et al., “Federated learning in medicine: Facilitating multi-institutional collaborations without sharing patient data,” Sci. Rep., 10(1): 12598, 2020. [30] Q. Yang et al., “Federated machine learning: concept and applications,” ACM Trans. Intell. Syst. Technol., 10(2): 1–19, 2019. [31] M. Ester, H. Kriegel, J. Sander, X. Xu, “A density-based algorithm for discovering clusters in large spatial databases with noise,” in Proc. 2nd Int. Conf. Knowledge Discovery and Data Mining (KDD-96): 226–231, 1996. [32] D. A. Reynolds, “Gaussian mixture models,” in Encyclopedia of Biometrics, Springer, pp. 827-832, 2015. [33] G. L. Gierach et al., "Relationship between mammographic density and breast cancer death in the breast cancer surveillance consortium," J. Natl. Cancer Inst., 104(16): 1218–1227, 2012. [34] G. C. Wishart et al., "Screen-detected vs symptomatic breast cancer: Is improved survival due to stage migration alone?" Br. J. Cancer, 98(11): 1741–1744, 2008. [35] S. Adams et al., " Prognostic value of tumor-infiltrating lymphocytes in triple-negative breast cancers from two phase III randomized adjuvant breast cancer trials: ECOG 2197 and ECOG 1199," J. Clin. Oncol., 32(27): 2959-2966, 2014. [36] S. Watanabe, H. Asamura, "Lymph node dissection for lung cancer: Significance, strategy, and technique," J. Thorac. Oncol., 4(5): 652–657, 2009. [37] M. Ferrero-Poüs et al., "Comparison of enzyme immunoassay and immunohistochemical measurements of estrogen and progesterone receptors in breast cancer patients," Appl. Immunohistochem. Mol. Morphol., 9(3): 267–275, 2001. [38] K. C. Chu et al., "Frequency distributions of breast cancer characteristics classified by estrogen receptor and progesterone receptor status for eight racial/ethnic groups," Cancer, 92(1): 37–45, 2001. [39] A. S. Knoop et al., "Value of epidermal growth factor receptor, HER2, p53, and steroid receptors in predicting the efficacy of tamoxifen in high-risk postmenopausal breast cancer patients," J. Clin. Oncol., 19(14): 3376–3384, 2001. [40] C. R. Wenger et al., "DNA ploidy, S-phase, and steroid receptors in more than 127,000 breast cancer patients," Breast Cancer Res. Treat., 28: 9–20, 1993. [41] N. Falette et al., "Prognostic value of P53 gene mutations in a large series of node-negative breast cancer patients," Cancer Res., 58(7): 1451–1455, 1998. [42] R. M. Elledge et al., "Prognostic significance of p53 gene alterations in node-negative breast cancer," Breast Cancer Res. Treat., 26: 225–235, 1993. [43] I. L. Andrulis et al., "neu/erbB-2 amplification identifies a poor-prognosis group of women with node-negative breast cancer," J. Clin. Oncol., 16(4): 1340–1349, 1998. [44] A. K. Tandon et al., "HER-2/neu oncogene protein and prognosis in breast cancer," J. Clin. Oncol., 7(8): 1120–1128, 1989. [45] M. Ferrero-Poüs et al., "Relationship between c-erb B-2 and other tumor characteristics in breast cancer prognosis," Clin. Cancer Res., 6(12): 4745–4754, 2000. [46] M. Bolla et al., "Estimation of epidermal growth factor receptor in 177 breast cancers: Correlation with prognostic factors," Breast Cancer Res. Treat., 16: 97–102, 1990. [47] V. Pawlowski et al., "Prognostic value of the type I growth factor receptors in a large series of human primary breast cancers quantified with a real-time reverse transcription-polymerase chain reaction assay," Clin. Cancer Res., 6(11): 4217–4225, 2000. [48] C. A. Purdie et al., "Progesterone receptor expression is an independent prognostic variable in early breast cancer: A population-based study," Br. J. Cancer, 110(3): 565–572, 2014. [49] J. P. Thakkar, D. G. Mehta, "A review of an unfavorable subset of breast cancer: Estrogen receptor positive progesterone receptor negative," Oncologist, 16(3): 276–285, 2011. [50] J. Anampa, D. Makower, J. A. Sparano, "Progress in adjuvant chemotherapy for breast cancer: An overview," BMC Med., 13: 195, 2015. [51] P. A. Francis et al., "Tailoring adjuvant endocrine therapy for premenopausal breast cancer," N. Engl. J. Med., 379(2): 122–137, 2018. [52] J. MacQueen, "Some methods for classification and analysis of multivariate observations," in Proc. Fifth Berkeley Symp. Math. Statist. Probability, Volume 1: Statistics, 5: 281-298, 1967. [53] S. C. Johnson, "Hierarchical clustering schemes," Psychometrika, 32(3): 241-254, 1967. [54] R. L. Thorndike, "Who belongs in the family?," Psychometrika, 18(4): 267-276, 1953. [55] P. J. Rousseeuw, "Silhouettes: A graphical aid to the interpretation and validation of cluster analysis," J. Comput. Appl. Math., 20: 53-65, 1987. [56] T. Caliński, J. Harabasz, "A dendrite method for cluster analysis," Commun. Stat. - Theory Methods, 3(1): 1-27, 1974. [57] A. K. Jain, M. N. Murty, P. J. Flynn, "Data clustering: A review," ACM Comput. Surv., 31(3): 264–323, 1999. [58] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel et al., "Scikit-learn: Machine learning in python," J. Mach. Learn. Res., 12: 2825-2830, 2011.
آمار تعداد مشاهده مقاله: 338 تعداد دریافت فایل اصل مقاله: 106

سامانه مدیریت نشریات علمی. طراحی و پیاده سازی از سیناوب

پیوندهای مفید

آمار

Clustering-Based Knowledge Discovery in Breast Cancer: Insights from a Local Clinical Dataset