摘 要: 针对目前基于维基百科的相似度计算方法预处理过程烦琐、计算量大的问题,本文以维基百科为本体引 入基于特征的词语语义计算,提出了一种基于维基百科的快速词语相似度计算方法。根据维基百科页面链接结构的特 点,该方法把页面的入链接和出链接作为页面特征值构建特征向量模型,通过计算页面的特征向量相关系数计算对应词 语的语义相似度。本文还改进了维基百科消歧处理算法,在一词多义的处理中减少社会认知度低的义项页面的干扰,进 一步提高了计算准确度。经Miller & Charles(MC30)和Rubenstein & Goodenough(RG65)测试集的测试,测试结果表 明了基于维基百科链接特征的方法在计算相似度方面的可行性,也验证了本文的计算策略和消歧改进算法的合理性。 |
关键词: 语义相似度;维基百科;基于链接;基于特征值 |
中图分类号: TP391
文献标识码: A
|
基金项目: 广西高校科学研究项目,基于描述逻辑的教育技术标准本体模型研究(项目批准号:ZD2014129). |
|
A Semantic Similarity Calculation Based on the Features of Wikipedia Links |
ZHANG Bo
|
( School of Mathematics & Computer Science, Hezhou University, Hezhou 542899, China)
|
Abstract: Measuring semantic similarity is a critical basic research in natural language processing.Because Wikipedia has open-editing,huge vocabulary,rapid update and other features,more and more research and applications have been focused on Wikipedia.This paper proposes a page-link approach for calculating word semantic similarity by taking Wikipedia as data resource.This approach improves the Wikipedia Link Vector Model (WLVM) method taking outgoing links as the feature vector,and utilizes page's incoming links and outgoing links as feature values in Wikipedia,then calculates the semantic similarity between words by measuring feature set similarity between the corresponding pages.The method also improves the disambiguation page processing by reducing the interference of the low social recognition pages.Through testing with Miller & Charles (MC30) and Rubenstein & Goodenough (RG65) benchmark,the validity of this method on the measuring word semantic similarity measurement is verified. |
Keywords: word similarity;Wikipedia;link-based;feature-based |