Enhancing cardiovascular risk assessment with advanced data balancing and domain knowledge-driven explainability

Yang, Fan; Qiao, Yanan; Hajek, Petr; Abedin, Mohammad

doi:10.1016/j.eswa.2024.124886

Journal article 1015 views 362 downloads

Enhancing cardiovascular risk assessment with advanced data balancing and domain knowledge-driven explainability

Fan Yang

, Yanan Qiao

, Petr Hajek

, Mohammad Abedin

Expert Systems with Applications, Volume: 255, Start page: 124886

Swansea University Author: Mohammad Abedin

PDF | Version of Record

© 2024 The Author(s). This is an open access article under the CC BY license
Download (3.67MB)

Check full text

DOI (Published version): 10.1016/j.eswa.2024.124886

Abstract

In medical risk prediction, such as predicting heart disease, machine learning (ML) classifiers must achieve high accuracy, precision, and recall to minimize the chances of incorrect diagnoses or treatment recommendations. However, real-world datasets often have imbalanced data, which can affect cla...

Full description

Published in:	Expert Systems with Applications
ISSN:	0957-4174
Published:	Elsevier BV 2024
Online Access:	Check full text
URI:	https://cronfa.swan.ac.uk/Record/cronfa67523

first_indexed	2024-09-02T14:23:54Z
last_indexed	2024-11-25T14:20:20Z
id	cronfa67523
recordtype	SURis
fullrecord	<?xml version="1.0"?><rfc1807><datestamp>2024-10-30T13:02:25.2823340</datestamp><bib-version>v2</bib-version><id>67523</id><entry>2024-09-02</entry><title>Enhancing cardiovascular risk assessment with advanced data balancing and domain knowledge-driven explainability</title><swanseaauthors><author><sid>4ed8c020eae0c9bec4f5d9495d86d415</sid><ORCID>0000-0002-4688-0619</ORCID><firstname>Mohammad</firstname><surname>Abedin</surname><name>Mohammad Abedin</name><active>true</active><ethesisStudent>false</ethesisStudent></author></swanseaauthors><date>2024-09-02</date><deptcode>CBAE</deptcode><abstract>In medical risk prediction, such as predicting heart disease, machine learning (ML) classifiers must achieve high accuracy, precision, and recall to minimize the chances of incorrect diagnoses or treatment recommendations. However, real-world datasets often have imbalanced data, which can affect classifier performance. Traditional data balancing methods can lead to overfitting and underfitting, making it difficult to identify potential health risks accurately. Early prediction of heart attacks is of paramount importance, and researchers have developed ML-based systems to address this problem. However, much of the existing ML research is based on a single dataset, often ignoring performance evaluation across multiple datasets. As the demand for interpretable ML models grows, model interpretability becomes central to revealing insights and feature effects within predictive models. To address these challenges, we present a novel data balancing technique that uses a divide-and-conquer strategy with the -Means clustering algorithm to segment the dataset. The performance of our approach is highlighted through comparisons with established techniques, which demonstrate the superiority of our proposed method. To address the challenge of inter-dataset discrepancies, we use two different datasets. Our holistic pipeline, strengthened by the innovative balancing technique, effectively addresses performance discrepancies, culminating in a significant improvement from 81% to 90%. Furthermore, through advanced statistical analysis, it has been determined that the 95% confidence interval for the AUC metric of our method ranges from 0.8187 to 0.8411. This observation serves to underscore the consistency and reliability of our approach, demonstrating its ability to achieve high performance across a range of scenarios. Incorporating Explainable AI (XAI), we examine the feature rankings and their contributions within the best performing Random Forest model. While the domain expert feedback is consistent with the explanatory power of XAI, some differences remain. Nevertheless, a remarkable convergence in feature ranking and weighting is observed, bridging the insights from XAI tools and domain expert perspectives.</abstract><type>Journal Article</type><journal>Expert Systems with Applications</journal><volume>255</volume><journalNumber/><paginationStart>124886</paginationStart><paginationEnd/><publisher>Elsevier BV</publisher><placeOfPublication/><isbnPrint/><isbnElectronic/><issnPrint>0957-4174</issnPrint><issnElectronic/><keywords>Heart disease risk, Data balancing, Performance discrepancy, Explainability, Expert system, Domain knowledge</keywords><publishedDay>1</publishedDay><publishedMonth>12</publishedMonth><publishedYear>2024</publishedYear><publishedDate>2024-12-01</publishedDate><doi>10.1016/j.eswa.2024.124886</doi><url/><notes/><college>COLLEGE NANME</college><department>Management School</department><CollegeCode>COLLEGE CODE</CollegeCode><DepartmentCode>CBAE</DepartmentCode><institution>Swansea University</institution><apcterm>SU Library paid the OA fee (TA Institutional Deal)</apcterm><funders>This research is supported by the Natural Science Basic Research Program of Shaanxi [Program No. 2023-JC-YB-490]. This research is also supported by the Research Fund of Guangxi Key Lab of Multi-source Information Mining & Security (MIMS24-06). This research is also supported by ”the Fundamental Research Funds for the Central Universities, JLU” (93K172024K12).</funders><projectreference/><lastEdited>2024-10-30T13:02:25.2823340</lastEdited><Created>2024-09-02T15:22:43.9758569</Created><path><level id="1">Faculty of Humanities and Social Sciences</level><level id="2">School of Management - Accounting and Finance</level></path><authors><author><firstname>Fan</firstname><surname>Yang</surname><orcid>0000-0003-1842-1084</orcid><order>1</order></author><author><firstname>Yanan</firstname><surname>Qiao</surname><orcid>0000-0002-5739-355x</orcid><order>2</order></author><author><firstname>Petr</firstname><surname>Hajek</surname><orcid>0000-0001-5579-1215</orcid><order>3</order></author><author><firstname>Mohammad</firstname><surname>Abedin</surname><orcid>0000-0002-4688-0619</orcid><order>4</order></author></authors><documents><document><filename>67523__31447__95b1a0a6699e460eb261795f7bece18c.pdf</filename><originalFilename>67523.VOR.pdf</originalFilename><uploaded>2024-09-24T09:49:22.7080309</uploaded><type>Output</type><contentLength>3847182</contentLength><contentType>application/pdf</contentType><version>Version of Record</version><cronfaStatus>true</cronfaStatus><documentNotes>© 2024 The Author(s). This is an open access article under the CC BY license</documentNotes><copyrightCorrect>true</copyrightCorrect><language>eng</language><licence>http://creativecommons.org/licenses/by/4.0/</licence></document></documents><OutputDurs/></rfc1807>
spelling	2024-10-30T13:02:25.2823340 v2 67523 2024-09-02 Enhancing cardiovascular risk assessment with advanced data balancing and domain knowledge-driven explainability 4ed8c020eae0c9bec4f5d9495d86d415 0000-0002-4688-0619 Mohammad Abedin Mohammad Abedin true false 2024-09-02 CBAE In medical risk prediction, such as predicting heart disease, machine learning (ML) classifiers must achieve high accuracy, precision, and recall to minimize the chances of incorrect diagnoses or treatment recommendations. However, real-world datasets often have imbalanced data, which can affect classifier performance. Traditional data balancing methods can lead to overfitting and underfitting, making it difficult to identify potential health risks accurately. Early prediction of heart attacks is of paramount importance, and researchers have developed ML-based systems to address this problem. However, much of the existing ML research is based on a single dataset, often ignoring performance evaluation across multiple datasets. As the demand for interpretable ML models grows, model interpretability becomes central to revealing insights and feature effects within predictive models. To address these challenges, we present a novel data balancing technique that uses a divide-and-conquer strategy with the -Means clustering algorithm to segment the dataset. The performance of our approach is highlighted through comparisons with established techniques, which demonstrate the superiority of our proposed method. To address the challenge of inter-dataset discrepancies, we use two different datasets. Our holistic pipeline, strengthened by the innovative balancing technique, effectively addresses performance discrepancies, culminating in a significant improvement from 81% to 90%. Furthermore, through advanced statistical analysis, it has been determined that the 95% confidence interval for the AUC metric of our method ranges from 0.8187 to 0.8411. This observation serves to underscore the consistency and reliability of our approach, demonstrating its ability to achieve high performance across a range of scenarios. Incorporating Explainable AI (XAI), we examine the feature rankings and their contributions within the best performing Random Forest model. While the domain expert feedback is consistent with the explanatory power of XAI, some differences remain. Nevertheless, a remarkable convergence in feature ranking and weighting is observed, bridging the insights from XAI tools and domain expert perspectives. Journal Article Expert Systems with Applications 255 124886 Elsevier BV 0957-4174 Heart disease risk, Data balancing, Performance discrepancy, Explainability, Expert system, Domain knowledge 1 12 2024 2024-12-01 10.1016/j.eswa.2024.124886 COLLEGE NANME Management School COLLEGE CODE CBAE Swansea University SU Library paid the OA fee (TA Institutional Deal) This research is supported by the Natural Science Basic Research Program of Shaanxi [Program No. 2023-JC-YB-490]. This research is also supported by the Research Fund of Guangxi Key Lab of Multi-source Information Mining & Security (MIMS24-06). This research is also supported by ”the Fundamental Research Funds for the Central Universities, JLU” (93K172024K12). 2024-10-30T13:02:25.2823340 2024-09-02T15:22:43.9758569 Faculty of Humanities and Social Sciences School of Management - Accounting and Finance Fan Yang 0000-0003-1842-1084 1 Yanan Qiao 0000-0002-5739-355x 2 Petr Hajek 0000-0001-5579-1215 3 Mohammad Abedin 0000-0002-4688-0619 4 67523__31447__95b1a0a6699e460eb261795f7bece18c.pdf 67523.VOR.pdf 2024-09-24T09:49:22.7080309 Output 3847182 application/pdf Version of Record true © 2024 The Author(s). This is an open access article under the CC BY license true eng http://creativecommons.org/licenses/by/4.0/
title	Enhancing cardiovascular risk assessment with advanced data balancing and domain knowledge-driven explainability
spellingShingle	Enhancing cardiovascular risk assessment with advanced data balancing and domain knowledge-driven explainability Mohammad Abedin
title_short	Enhancing cardiovascular risk assessment with advanced data balancing and domain knowledge-driven explainability
title_full	Enhancing cardiovascular risk assessment with advanced data balancing and domain knowledge-driven explainability
title_fullStr	Enhancing cardiovascular risk assessment with advanced data balancing and domain knowledge-driven explainability
title_full_unstemmed	Enhancing cardiovascular risk assessment with advanced data balancing and domain knowledge-driven explainability
title_sort	Enhancing cardiovascular risk assessment with advanced data balancing and domain knowledge-driven explainability
author_id_str_mv	4ed8c020eae0c9bec4f5d9495d86d415
author_id_fullname_str_mv	4ed8c020eae0c9bec4f5d9495d86d415_***_Mohammad Abedin
author	Mohammad Abedin
author2	Fan Yang Yanan Qiao Petr Hajek Mohammad Abedin
format	Journal article
container_title	Expert Systems with Applications
container_volume	255
container_start_page	124886
publishDate	2024
institution	Swansea University
issn	0957-4174
doi_str_mv	10.1016/j.eswa.2024.124886
publisher	Elsevier BV
college_str	Faculty of Humanities and Social Sciences
hierarchytype
hierarchy_top_id	facultyofhumanitiesandsocialsciences
hierarchy_top_title	Faculty of Humanities and Social Sciences
hierarchy_parent_id	facultyofhumanitiesandsocialsciences
hierarchy_parent_title	Faculty of Humanities and Social Sciences
department_str	School of Management - Accounting and Finance{{{_:::_}}}Faculty of Humanities and Social Sciences{{{_:::_}}}School of Management - Accounting and Finance
document_store_str	1
active_str	0
description	In medical risk prediction, such as predicting heart disease, machine learning (ML) classifiers must achieve high accuracy, precision, and recall to minimize the chances of incorrect diagnoses or treatment recommendations. However, real-world datasets often have imbalanced data, which can affect classifier performance. Traditional data balancing methods can lead to overfitting and underfitting, making it difficult to identify potential health risks accurately. Early prediction of heart attacks is of paramount importance, and researchers have developed ML-based systems to address this problem. However, much of the existing ML research is based on a single dataset, often ignoring performance evaluation across multiple datasets. As the demand for interpretable ML models grows, model interpretability becomes central to revealing insights and feature effects within predictive models. To address these challenges, we present a novel data balancing technique that uses a divide-and-conquer strategy with the -Means clustering algorithm to segment the dataset. The performance of our approach is highlighted through comparisons with established techniques, which demonstrate the superiority of our proposed method. To address the challenge of inter-dataset discrepancies, we use two different datasets. Our holistic pipeline, strengthened by the innovative balancing technique, effectively addresses performance discrepancies, culminating in a significant improvement from 81% to 90%. Furthermore, through advanced statistical analysis, it has been determined that the 95% confidence interval for the AUC metric of our method ranges from 0.8187 to 0.8411. This observation serves to underscore the consistency and reliability of our approach, demonstrating its ability to achieve high performance across a range of scenarios. Incorporating Explainable AI (XAI), we examine the feature rankings and their contributions within the best performing Random Forest model. While the domain expert feedback is consistent with the explanatory power of XAI, some differences remain. Nevertheless, a remarkable convergence in feature ranking and weighting is observed, bridging the insights from XAI tools and domain expert perspectives.
published_date	2024-12-01T06:23:26Z
_version_	1867858005040562176
score	11.108426

Enhancing cardiovascular risk assessment with advanced data balancing and domain knowledge-driven explainability

Similar Items