摘 要: 随着互联网大健康数字化时代的到来,健康数据海量增长,为解决医疗数据集成应用中异构数据的术语标准化问题,提出一种利用PubMedBERT计算语义相似度实现医学术语对齐的技术。使用特定医学领域预训练模型,结合缩略词扩展方法增强语义信息,并与传统相似度计算模型、BERT(Bidirectional Encoder Representations from Transformers)及其变体相比较。在测试语料上的实验表明,缩略词扩展后PubMedBERT预训练模型TOP1的准确率提高了18.79%,PubMedBERT 模型TOP1、TOP3、TOP5、TOP10的准确率分别达到78.49%、85.69%、87.44%、89.54%,优于其他对比模型。该方法可以为医学术语对齐工作提供一种智能化的解决方案。 |
关键词: 语义相似度;术语对齐;缩略词扩展;PubMedBERT |
中图分类号: TP391.1
文献标识码: A
|
|
Research on Medical Term Alignment Method Based on PubMedBERT Pre-training Mode |
WANG Yiru, ZHENG Jianli, ZHOU Haoran
|
(School of Health Science and Engineering, University of Shanghai f or Science and Technology, Shanghai 200093, China)
e_wangyiru@163.com; zhengjianli163@163.com; zhouhaoran1908@163.com
|
Abstract: In the context of digital era of Internet health, there is a massive growth of health data. In order to solve the problem of terminology standardization for heterogeneous data in medical data integration applications, a technology using PubMedBERT to calculate semantic similarity to achieve medical terminology alignment is proposed. This technology uses pre-trained models in specific medical fields, and enhances semantic information with abbreviation expansion methods. Then it is compared with traditional similarity calculation models, BERT (Bidirectional Encoder Representations from Transformers), and their variants. The experiment on the test corpus shows that the accuracy of PubMedBERT pre-trained model TOP1 has improved by 18.79% after abbreviation expansion, and the accuracy of PubMedBERT models TOP1, TOP3, TOP5, and TOP10 reaches 78.49% , 85.69% , 87.44% , and 89.54% , respectively, which is superior to other comparative models. This method can provide an intelligent solution for medical terminology alignment work. |
Keywords: semantic similarity; term alignment; expansion of abbreviations; PubMedBERT |