• 首页
  • 期刊简介
  • 编委会
  • 投稿指南
  • 收录情况
  • 杂志订阅
  • 联系我们
引用本文:黄逸凡,刘红,熊红林,费哲为,樊重俊.基于Jieba-TF-IDF-GBT模型的冠心病风险预测分析[J].软件工程,2025,28(10):52-57.【点击复制】
【打印本页】   【下载PDF全文】   【查看/发表评论】  【下载PDF阅读器】  
←前一篇|后一篇→ 过刊浏览
分享到: 微信 更多
基于Jieba-TF-IDF-GBT模型的冠心病风险预测分析
黄逸凡1,刘红1,2,熊红林3,4,费哲为2,樊重俊1
(1.上海理工大学管理学院,上海 200093;
2.上海健康医学院附属崇明医院信息管理科,上海 202150;
3.上海健康医学院协同科研中心,上海 201318;
4.上海交通大学安泰经济与管理学院,上海 200030)
huangyifan8426@163.com; liuh@sumhs.edu.cn; honyex@126.com; zheweifei@163.com; fan.chongjun@163.com
摘 要: 针对较少有以完整冠心病电子病历进行疾病发生风险预测的问题,从篇章级的角度切入,将分词工具Jieba、词向量转换工具词频-逆文档频率(Term Frequency-Inverse Document Frequency,TF-IDF)与二分类模型梯度提升树(Gradient Boosting Tree,GBT)相融合,提出一种能对完整冠心病电子病历进行分析的疾病发生风险预测模型Jieba-TF-IDF-GBT。该模型在某真实电子病历上表现良好,F1值为0.976,在“标签反转”实验中的F1值为0.977。实验证明该模型性能优异,且适用于其他疾病发生风险预测研究,对基于机器学习方法的疾病预测提供有效的借鉴。
关键词: 自然语言处理  冠心病  电子病历  疾病预测
中图分类号:     文献标识码: A
基金项目: 上海健康医学院地方高水平大学建设项目(22MC2022001);上海市公共卫生重点学科项目(GWVI-11.1-490);上海市公共卫生体系建设三年行动计划项目(GWVI-6);上海市卫生健康委员会2024年卫生健康政策研究课题(2024HP72);数据要素视角下基于机器学习方法的典型慢病管理应用关键技术研究———以崇明区为例(CKY2024-65)
Coronary Heart Disease Risk Prediction Analysis Based on Jieba-TF-IDF-GBT Model
HUANG Yifan1, LIU Hong1,2,XIONG Honglin3,4, FEI Zhewei2,FAN Chongjun1
(1.Business School, University of Shanghai for Science & Technology, Shanghai 200093, China;
2.Information Management Section, Shanghai University of Medicine & Health Sciences Affiliated Chongming Hospital, Shanghai 202150, China;
3.Collaborative Innovation Center for Biomedicine, Shanghai University of Medicine & Health Sciences, Shanghai 201318, China;
4.Antai College of Economics and Management, Shanghai Jiao Tong University, Shanghai 200030, China)
huangyifan8426@163.com; liuh@sumhs.edu.cn; honyex@126.com; zheweifei@163.com; fan.chongjun@163.com
Abstract: Addressing the scarcity of research utilizing complete electronic medical records (EMRs) for coronary heart disease (CHD) risk prediction, this study adopts a documen-t level approach. By integrating the Jieba word segmentation tool, Term Frequency-Inverse Document Frequency (TF-IDF) text vectorization, and the Gradient Boosting Tree (GBT) binary classification model, we propose the Jieba-TF-IDF-GBT framework for analyzing comprehensive CHD EMRs to predict disease risk. The model demonstrates outstanding performance on rea-l world EMR datasets, achieving an F1-score of 0.976 and an F1-score of 0.977 in “labe-l reversal”experiments. Results confirm its superior predictive capability and generalizability to other disease risk prediction studies, offering valuable insights for machine learning-based medical forecasting.
Keywords: natural language processing  coronary heart disease  electronic medical records  disease prediction


版权所有:软件工程杂志社
地址:辽宁省沈阳市浑南区新秀街2号 邮政编码:110179
电话:0411-84767887 传真:0411-84835089 Email:semagazine@neusoft.edu.cn
备案号:辽ICP备17007376号-1
技术支持:北京勤云科技发展有限公司

用微信扫一扫

用微信扫一扫