软件工程

引用本文:

黄逸凡,刘红,熊红林,费哲为,樊重俊.基于Jieba-TF-IDF-GBT模型的冠心病风险预测分析[J].软件工程,2025,28(10):52-57.【点击复制】

分享到：微信更多

基于Jieba-TF-IDF-GBT模型的冠心病风险预测分析

黄逸凡¹,刘红^1,2,熊红林^3,4,费哲为²,樊重俊¹

(1.上海理工大学管理学院,上海 200093;
2.上海健康医学院附属崇明医院信息管理科,上海 202150;
3.上海健康医学院协同科研中心,上海 201318;
4.上海交通大学安泰经济与管理学院,上海 200030)
huangyifan8426@163.com; liuh@sumhs.edu.cn; honyex@126.com; zheweifei@163.com; fan.chongjun@163.com

摘要: 针对较少有以完整冠心病电子病历进行疾病发生风险预测的问题,从篇章级的角度切入,将分词工具Jieba、词向量转换工具词频-逆文档频率(Term Frequency-Inverse Document Frequency,TF-IDF)与二分类模型梯度提升树(Gradient Boosting Tree,GBT)相融合,提出一种能对完整冠心病电子病历进行分析的疾病发生风险预测模型Jieba-TF-IDF-GBT。该模型在某真实电子病历上表现良好,F1值为0.976,在“标签反转”实验中的F1值为0.977。实验证明该模型性能优异,且适用于其他疾病发生风险预测研究,对基于机器学习方法的疾病预测提供有效的借鉴。

关键词: 自然语言处理冠心病电子病历疾病预测

中图分类号: 文献标识码: A

基金项目: 上海健康医学院地方高水平大学建设项目(22MC2022001);上海市公共卫生重点学科项目(GWVI-11.1-490);上海市公共卫生体系建设三年行动计划项目(GWVI-6);上海市卫生健康委员会2024年卫生健康政策研究课题(2024HP72);数据要素视角下基于机器学习方法的典型慢病管理应用关键技术研究———以崇明区为例(CKY2024-65)

Coronary Heart Disease Risk Prediction Analysis Based on Jieba-TF-IDF-GBT Model

HUANG Yifan¹, LIU Hong^1,2,XIONG Honglin^3,4, FEI Zhewei²,FAN Chongjun¹

(1.Business School, University of Shanghai for Science & Technology, Shanghai 200093, China;
2.Information Management Section, Shanghai University of Medicine & Health Sciences Affiliated Chongming Hospital, Shanghai 202150, China;
3.Collaborative Innovation Center for Biomedicine, Shanghai University of Medicine & Health Sciences, Shanghai 201318, China;
4.Antai College of Economics and Management, Shanghai Jiao Tong University, Shanghai 200030, China)
huangyifan8426@163.com; liuh@sumhs.edu.cn; honyex@126.com; zheweifei@163.com; fan.chongjun@163.com

Abstract: Addressing the scarcity of research utilizing complete electronic medical records (EMRs) for coronary heart disease (CHD) risk prediction, this study adopts a documen-t level approach. By integrating the Jieba word segmentation tool, Term Frequency-Inverse Document Frequency (TF-IDF) text vectorization, and the Gradient Boosting Tree (GBT) binary classification model, we propose the Jieba-TF-IDF-GBT framework for analyzing comprehensive CHD EMRs to predict disease risk. The model demonstrates outstanding performance on rea-l world EMR datasets, achieving an F1-score of 0.976 and an F1-score of 0.977 in “labe-l reversal”experiments. Results confirm its superior predictive capability and generalizability to other disease risk prediction studies, offering valuable insights for machine learning-based medical forecasting.

Keywords: natural language processing coronary heart disease electronic medical records disease prediction

用微信扫一扫