| 摘 要: 针对较少有以完整冠心病电子病历进行疾病发生风险预测的问题,从篇章级的角度切入,将分词工具Jieba、词向量转换工具词频-逆文档频率(Term Frequency-Inverse Document Frequency,TF-IDF)与二分类模型梯度提升树(Gradient Boosting Tree,GBT)相融合,提出一种能对完整冠心病电子病历进行分析的疾病发生风险预测模型Jieba-TF-IDF-GBT。该模型在某真实电子病历上表现良好,F1值为0.976,在“标签反转”实验中的F1值为0.977。实验证明该模型性能优异,且适用于其他疾病发生风险预测研究,对基于机器学习方法的疾病预测提供有效的借鉴。 |
| 关键词: 自然语言处理 冠心病 电子病历 疾病预测 |
|
中图分类号:
文献标识码: A
|
| 基金项目: 上海健康医学院地方高水平大学建设项目(22MC2022001);上海市公共卫生重点学科项目(GWVI-11.1-490);上海市公共卫生体系建设三年行动计划项目(GWVI-6);上海市卫生健康委员会2024年卫生健康政策研究课题(2024HP72);数据要素视角下基于机器学习方法的典型慢病管理应用关键技术研究———以崇明区为例(CKY2024-65) |
|
| Coronary Heart Disease Risk Prediction Analysis Based on Jieba-TF-IDF-GBT Model |
|
HUANG Yifan1, LIU Hong1,2,XIONG Honglin3,4, FEI Zhewei2,FAN Chongjun1
|
(1.Business School, University of Shanghai for Science & Technology, Shanghai 200093, China; 2.Information Management Section, Shanghai University of Medicine & Health Sciences Affiliated Chongming Hospital, Shanghai 202150, China; 3.Collaborative Innovation Center for Biomedicine, Shanghai University of Medicine & Health Sciences, Shanghai 201318, China; 4.Antai College of Economics and Management, Shanghai Jiao Tong University, Shanghai 200030, China)
huangyifan8426@163.com; liuh@sumhs.edu.cn; honyex@126.com; zheweifei@163.com; fan.chongjun@163.com
|
| Abstract: Addressing the scarcity of research utilizing complete electronic medical records (EMRs) for coronary heart disease (CHD) risk prediction, this study adopts a documen-t level approach. By integrating the Jieba word segmentation tool, Term Frequency-Inverse Document Frequency (TF-IDF) text vectorization, and the Gradient Boosting Tree (GBT) binary classification model, we propose the Jieba-TF-IDF-GBT framework for analyzing comprehensive CHD EMRs to predict disease risk. The model demonstrates outstanding performance on rea-l world EMR datasets, achieving an F1-score of 0.976 and an F1-score of 0.977 in “labe-l reversal”experiments. Results confirm its superior predictive capability and generalizability to other disease risk prediction studies, offering valuable insights for machine learning-based medical forecasting. |
| Keywords: natural language processing coronary heart disease electronic medical records disease prediction |