摘 要: 文章采用BERTopic模型,对“好大夫在线”平台上的医学科普文章进行主题挖掘,旨在提升患者检索医疗信息的效率,并辅助医疗从业者精准把握医学话题的发展趋势,进而推动医疗事业的进步。针对医学文本信息量大、专业性强的特点,研究通过数据预处理、预训练嵌入模型ERNIE-Health,并细致调整模型参数,有效地解决了传统LDA(Latent Dirichlet Allocation)模型在医学文本处理任务中存在的局限性。实验结果显示,BERTopic模型成功识别出220个研究主题,且经OCTIS(Open Topic Modeling Toolkit for Interpretability and Similarity)框架评估,主题多样性得分为0.662,连贯性得分为0.991,显著提升了主题挖掘的准确性和可靠性。此项研究对医疗大数据中知识的深入挖掘具有重要意义。 |
关键词: BERTopic;医学科普;主题挖掘;主题建模;自然语言处理 |
中图分类号: TP391
文献标识码: A
|
基金项目: 教育部人文社会科学研究一般项目(23YJCZH281);上海市哲学社会科学规划课题(2022ZGL010);信息网络安全公安部重点实验室开放课题项目资助(C23600) |
|
BERTopic for Topic Mining in Medical Articles:Applications and Analysis |
SONG Junjie1, YIN Pei1,2, DENG Shiyu1,YUAN Yixin1
|
(1.Business School, University of Shanghai for Science and Technology, Shanghai 200093, China; 2.School of Intelligent Emergency Management, University of Shanghai for Science and Technology, Shanghai 200093, China)
sjj1622829175@163.com; pyin@usst.edu.cn; dengshiyu982460692@163.com; 18874666376@163.com
|
Abstract: This study employs the BERTopic model to conduct topic mining on popular medical science articles from the "Haodf.com" platform, aiming to enhance the efficiency of medical information retrieval for patients and assist healthcare professionals in precisely tracking the development trends of medical topics, thereby advancing medical practices. To address the challenges of large-scale data and domain-specific terminology in medical texts, the research incorporates data preprocessing, utilizes the pre-trained embedding model ERNIE-Health, and meticulously optimizes model parameters, effectively overcoming the limitations of traditional LDA (Latent Dirichlet Allocation) models in medical text processing tasks. Experimental results demonstrate that the BERTopic model successfully identifies 220 research topics. Evaluated through the OCTIS (Open Topic Modeling Toolkit for Interpretability and Similarity) framework, the model achieves a topic diversity score of 0.662 and a coherence score of 0.991, significantly improving the accuracy and reliability of topic mining. This research holds substantial importance for in-depth knowledge discovery in medical big data. |
Keywords: BERTopic; medical science popularization; topic mining; topic modeling; natural language processing |