No Cover Image

Journal article 1723 views 246 downloads

Defining Disease Phenotypes in Primary Care Electronic Health Records by a Machine Learning Approach: A Case Study in Identifying Rheumatoid Arthritis

Shang-ming Zhou Orcid Logo, Fabiola Fernandez-Gutierrez, Jonathan Kennedy, Roxanne Cooksey, Mark Atkinson Orcid Logo, Spiros Denaxas, Stefan Siebert, William G. Dixon, Terence W. O’Neill, Ernest Choy, Cathie Sudlow, Sinead Brophy Orcid Logo, (UK Biobank Follow-up and Outcomes Group)

PLOS ONE, Volume: 11, Issue: 5, Start page: e0154515

Swansea University Authors: Shang-ming Zhou Orcid Logo, Mark Atkinson Orcid Logo, Sinead Brophy Orcid Logo

  • ZhouDefiningdiseaseVoR.PDF

    PDF | Version of Record

    © 2016 Zhou et al. This is an open access article distributed under the terms of the Creative Commons Attribution License CC-BY 4.0, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

    Download (1.09MB)

Abstract

Objectives1) To use data-driven method to examine clinical codes (risk factors) of a medical condition in primary care electronic health records (EHRs) that can accurately predict a diagnosis of the condition in secondary care EHRs. 2) To develop and validate a disease phenotyping algorithm for rheu...

Full description

Published in: PLOS ONE
ISSN: 1932-6203
Published: Public Library of Science (PLoS) 2016
Online Access: Check full text

URI: https://cronfa.swan.ac.uk/Record/cronfa27734
Tags: Add Tag
No Tags, Be the first to tag this record!
first_indexed 2016-05-07T01:08:30Z
last_indexed 2023-01-31T03:35:10Z
id cronfa27734
recordtype SURis
fullrecord <?xml version="1.0"?><rfc1807><datestamp>2023-01-30T16:04:23.7152909</datestamp><bib-version>v2</bib-version><id>27734</id><entry>2016-05-06</entry><title>Defining Disease Phenotypes in Primary Care Electronic Health Records by a Machine Learning Approach: A Case Study in Identifying Rheumatoid Arthritis</title><swanseaauthors><author><sid>118578a62021ba8ef61398da0a8750da</sid><ORCID>0000-0002-0719-9353</ORCID><firstname>Shang-ming</firstname><surname>Zhou</surname><name>Shang-ming Zhou</name><active>true</active><ethesisStudent>false</ethesisStudent></author><author><sid>8f85ae301cc97a48eaf58fe343c5a797</sid><ORCID>0000-0003-4237-3588</ORCID><firstname>Mark</firstname><surname>Atkinson</surname><name>Mark Atkinson</name><active>true</active><ethesisStudent>false</ethesisStudent></author><author><sid>84f5661b35a729f55047f9e793d8798b</sid><ORCID>0000-0001-7417-2858</ORCID><firstname>Sinead</firstname><surname>Brophy</surname><name>Sinead Brophy</name><active>true</active><ethesisStudent>false</ethesisStudent></author></swanseaauthors><date>2016-05-06</date><deptcode>BMS</deptcode><abstract>Objectives1) To use data-driven method to examine clinical codes (risk factors) of a medical condition in primary care electronic health records (EHRs) that can accurately predict a diagnosis of the condition in secondary care EHRs. 2) To develop and validate a disease phenotyping algorithm for rheumatoid arthritis using primary care EHRs.MethodsThis study linked routine primary and secondary care EHRs in Wales, UK. A machine learning based scheme was used to identify patients with rheumatoid arthritis from primary care EHRs via the following steps: i) selection of variables by comparing relative frequencies of Read codes in the primary care dataset associated with disease case compared to non-disease control (disease/non-disease based on the secondary care diagnosis); ii) reduction of predictors/associated variables using a Random Forest method, iii) induction of decision rules from decision tree model. The proposed method was then extensively validated on an independent dataset, and compared for performance with two existing deterministic algorithms for RA which had been developed using expert clinical knowledge.ResultsPrimary care EHRs were available for 2,238,360 patients over the age of 16 and of these 20,667 were also linked in the secondary care rheumatology clinical system. In the linked dataset, 900 predictors (out of a total of 43,100 variables) in the primary care record were discovered more frequently in those with versus those without RA. These variables were reduced to 37 groups of related clinical codes, which were used to develop a decision tree model. The final algorithm identified 8 predictors related to diagnostic codes for RA, medication codes, such as those for disease modifying anti-rheumatic drugs, and absence of alternative diagnoses such as psoriatic arthritis. The proposed data-driven method performed as well as the expert clinical knowledge based methods.ConclusionData-driven scheme, such as ensemble machine learning methods, has the potential of identifying the most informative predictors in a cost-effective and rapid way to accurately and reliably classify rheumatoid arthritis or other complex medical conditions in primary care EHRs.</abstract><type>Journal Article</type><journal>PLOS ONE</journal><volume>11</volume><journalNumber>5</journalNumber><paginationStart>e0154515</paginationStart><paginationEnd/><publisher>Public Library of Science (PLoS)</publisher><placeOfPublication/><isbnPrint/><isbnElectronic/><issnPrint/><issnElectronic>1932-6203</issnElectronic><keywords/><publishedDay>31</publishedDay><publishedMonth>12</publishedMonth><publishedYear>2016</publishedYear><publishedDate>2016-12-31</publishedDate><doi>10.1371/journal.pone.0154515</doi><url/><notes/><college>COLLEGE NANME</college><department>Biomedical Sciences</department><CollegeCode>COLLEGE CODE</CollegeCode><DepartmentCode>BMS</DepartmentCode><institution>Swansea University</institution><degreesponsorsfunders>RCUK, MR/K006525/1</degreesponsorsfunders><apcterm/><funders/><projectreference/><lastEdited>2023-01-30T16:04:23.7152909</lastEdited><Created>2016-05-06T10:11:18.3605508</Created><path><level id="1">Faculty of Medicine, Health and Life Sciences</level><level id="2">Swansea University Medical School - Health Data Science</level></path><authors><author><firstname>Shang-ming</firstname><surname>Zhou</surname><orcid>0000-0002-0719-9353</orcid><order>1</order></author><author><firstname>Fabiola</firstname><surname>Fernandez-Gutierrez</surname><order>2</order></author><author><firstname>Jonathan</firstname><surname>Kennedy</surname><order>3</order></author><author><firstname>Roxanne</firstname><surname>Cooksey</surname><order>4</order></author><author><firstname>Mark</firstname><surname>Atkinson</surname><orcid>0000-0003-4237-3588</orcid><order>5</order></author><author><firstname>Spiros</firstname><surname>Denaxas</surname><order>6</order></author><author><firstname>Stefan</firstname><surname>Siebert</surname><order>7</order></author><author><firstname>William G.</firstname><surname>Dixon</surname><order>8</order></author><author><firstname>Terence W.</firstname><surname>O&#x2019;Neill</surname><order>9</order></author><author><firstname>Ernest</firstname><surname>Choy</surname><order>10</order></author><author><firstname>Cathie</firstname><surname>Sudlow</surname><order>11</order></author><author><firstname>Sinead</firstname><surname>Brophy</surname><orcid>0000-0001-7417-2858</orcid><order>12</order></author><author><firstname>(UK Biobank Follow-up and Outcomes</firstname><surname>Group)</surname><order>13</order></author></authors><documents><document><filename>0027734-14072016180208.PDF</filename><originalFilename>ZhouDefiningdiseaseVoR.PDF</originalFilename><uploaded>2016-07-14T18:02:08.0000000</uploaded><type>Output</type><contentLength>1157217</contentLength><contentType>application/pdf</contentType><version>Version of Record</version><cronfaStatus>true</cronfaStatus><embargoDate>2016-07-14T00:00:00.0000000</embargoDate><documentNotes>&#xA9; 2016 Zhou et al. This is an open access article distributed under the terms of the Creative Commons Attribution License CC-BY 4.0, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.</documentNotes><copyrightCorrect>true</copyrightCorrect><language>eng</language><licence>https://creativecommons.org/licenses/by/4.0/</licence></document></documents><OutputDurs/></rfc1807>
spelling 2023-01-30T16:04:23.7152909 v2 27734 2016-05-06 Defining Disease Phenotypes in Primary Care Electronic Health Records by a Machine Learning Approach: A Case Study in Identifying Rheumatoid Arthritis 118578a62021ba8ef61398da0a8750da 0000-0002-0719-9353 Shang-ming Zhou Shang-ming Zhou true false 8f85ae301cc97a48eaf58fe343c5a797 0000-0003-4237-3588 Mark Atkinson Mark Atkinson true false 84f5661b35a729f55047f9e793d8798b 0000-0001-7417-2858 Sinead Brophy Sinead Brophy true false 2016-05-06 BMS Objectives1) To use data-driven method to examine clinical codes (risk factors) of a medical condition in primary care electronic health records (EHRs) that can accurately predict a diagnosis of the condition in secondary care EHRs. 2) To develop and validate a disease phenotyping algorithm for rheumatoid arthritis using primary care EHRs.MethodsThis study linked routine primary and secondary care EHRs in Wales, UK. A machine learning based scheme was used to identify patients with rheumatoid arthritis from primary care EHRs via the following steps: i) selection of variables by comparing relative frequencies of Read codes in the primary care dataset associated with disease case compared to non-disease control (disease/non-disease based on the secondary care diagnosis); ii) reduction of predictors/associated variables using a Random Forest method, iii) induction of decision rules from decision tree model. The proposed method was then extensively validated on an independent dataset, and compared for performance with two existing deterministic algorithms for RA which had been developed using expert clinical knowledge.ResultsPrimary care EHRs were available for 2,238,360 patients over the age of 16 and of these 20,667 were also linked in the secondary care rheumatology clinical system. In the linked dataset, 900 predictors (out of a total of 43,100 variables) in the primary care record were discovered more frequently in those with versus those without RA. These variables were reduced to 37 groups of related clinical codes, which were used to develop a decision tree model. The final algorithm identified 8 predictors related to diagnostic codes for RA, medication codes, such as those for disease modifying anti-rheumatic drugs, and absence of alternative diagnoses such as psoriatic arthritis. The proposed data-driven method performed as well as the expert clinical knowledge based methods.ConclusionData-driven scheme, such as ensemble machine learning methods, has the potential of identifying the most informative predictors in a cost-effective and rapid way to accurately and reliably classify rheumatoid arthritis or other complex medical conditions in primary care EHRs. Journal Article PLOS ONE 11 5 e0154515 Public Library of Science (PLoS) 1932-6203 31 12 2016 2016-12-31 10.1371/journal.pone.0154515 COLLEGE NANME Biomedical Sciences COLLEGE CODE BMS Swansea University RCUK, MR/K006525/1 2023-01-30T16:04:23.7152909 2016-05-06T10:11:18.3605508 Faculty of Medicine, Health and Life Sciences Swansea University Medical School - Health Data Science Shang-ming Zhou 0000-0002-0719-9353 1 Fabiola Fernandez-Gutierrez 2 Jonathan Kennedy 3 Roxanne Cooksey 4 Mark Atkinson 0000-0003-4237-3588 5 Spiros Denaxas 6 Stefan Siebert 7 William G. Dixon 8 Terence W. O’Neill 9 Ernest Choy 10 Cathie Sudlow 11 Sinead Brophy 0000-0001-7417-2858 12 (UK Biobank Follow-up and Outcomes Group) 13 0027734-14072016180208.PDF ZhouDefiningdiseaseVoR.PDF 2016-07-14T18:02:08.0000000 Output 1157217 application/pdf Version of Record true 2016-07-14T00:00:00.0000000 © 2016 Zhou et al. This is an open access article distributed under the terms of the Creative Commons Attribution License CC-BY 4.0, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited. true eng https://creativecommons.org/licenses/by/4.0/
title Defining Disease Phenotypes in Primary Care Electronic Health Records by a Machine Learning Approach: A Case Study in Identifying Rheumatoid Arthritis
spellingShingle Defining Disease Phenotypes in Primary Care Electronic Health Records by a Machine Learning Approach: A Case Study in Identifying Rheumatoid Arthritis
Shang-ming Zhou
Mark Atkinson
Sinead Brophy
title_short Defining Disease Phenotypes in Primary Care Electronic Health Records by a Machine Learning Approach: A Case Study in Identifying Rheumatoid Arthritis
title_full Defining Disease Phenotypes in Primary Care Electronic Health Records by a Machine Learning Approach: A Case Study in Identifying Rheumatoid Arthritis
title_fullStr Defining Disease Phenotypes in Primary Care Electronic Health Records by a Machine Learning Approach: A Case Study in Identifying Rheumatoid Arthritis
title_full_unstemmed Defining Disease Phenotypes in Primary Care Electronic Health Records by a Machine Learning Approach: A Case Study in Identifying Rheumatoid Arthritis
title_sort Defining Disease Phenotypes in Primary Care Electronic Health Records by a Machine Learning Approach: A Case Study in Identifying Rheumatoid Arthritis
author_id_str_mv 118578a62021ba8ef61398da0a8750da
8f85ae301cc97a48eaf58fe343c5a797
84f5661b35a729f55047f9e793d8798b
author_id_fullname_str_mv 118578a62021ba8ef61398da0a8750da_***_Shang-ming Zhou
8f85ae301cc97a48eaf58fe343c5a797_***_Mark Atkinson
84f5661b35a729f55047f9e793d8798b_***_Sinead Brophy
author Shang-ming Zhou
Mark Atkinson
Sinead Brophy
author2 Shang-ming Zhou
Fabiola Fernandez-Gutierrez
Jonathan Kennedy
Roxanne Cooksey
Mark Atkinson
Spiros Denaxas
Stefan Siebert
William G. Dixon
Terence W. O’Neill
Ernest Choy
Cathie Sudlow
Sinead Brophy
(UK Biobank Follow-up and Outcomes Group)
format Journal article
container_title PLOS ONE
container_volume 11
container_issue 5
container_start_page e0154515
publishDate 2016
institution Swansea University
issn 1932-6203
doi_str_mv 10.1371/journal.pone.0154515
publisher Public Library of Science (PLoS)
college_str Faculty of Medicine, Health and Life Sciences
hierarchytype
hierarchy_top_id facultyofmedicinehealthandlifesciences
hierarchy_top_title Faculty of Medicine, Health and Life Sciences
hierarchy_parent_id facultyofmedicinehealthandlifesciences
hierarchy_parent_title Faculty of Medicine, Health and Life Sciences
department_str Swansea University Medical School - Health Data Science{{{_:::_}}}Faculty of Medicine, Health and Life Sciences{{{_:::_}}}Swansea University Medical School - Health Data Science
document_store_str 1
active_str 0
description Objectives1) To use data-driven method to examine clinical codes (risk factors) of a medical condition in primary care electronic health records (EHRs) that can accurately predict a diagnosis of the condition in secondary care EHRs. 2) To develop and validate a disease phenotyping algorithm for rheumatoid arthritis using primary care EHRs.MethodsThis study linked routine primary and secondary care EHRs in Wales, UK. A machine learning based scheme was used to identify patients with rheumatoid arthritis from primary care EHRs via the following steps: i) selection of variables by comparing relative frequencies of Read codes in the primary care dataset associated with disease case compared to non-disease control (disease/non-disease based on the secondary care diagnosis); ii) reduction of predictors/associated variables using a Random Forest method, iii) induction of decision rules from decision tree model. The proposed method was then extensively validated on an independent dataset, and compared for performance with two existing deterministic algorithms for RA which had been developed using expert clinical knowledge.ResultsPrimary care EHRs were available for 2,238,360 patients over the age of 16 and of these 20,667 were also linked in the secondary care rheumatology clinical system. In the linked dataset, 900 predictors (out of a total of 43,100 variables) in the primary care record were discovered more frequently in those with versus those without RA. These variables were reduced to 37 groups of related clinical codes, which were used to develop a decision tree model. The final algorithm identified 8 predictors related to diagnostic codes for RA, medication codes, such as those for disease modifying anti-rheumatic drugs, and absence of alternative diagnoses such as psoriatic arthritis. The proposed data-driven method performed as well as the expert clinical knowledge based methods.ConclusionData-driven scheme, such as ensemble machine learning methods, has the potential of identifying the most informative predictors in a cost-effective and rapid way to accurately and reliably classify rheumatoid arthritis or other complex medical conditions in primary care EHRs.
published_date 2016-12-31T03:33:42Z
_version_ 1763751412350582784
score 11.030209