摘 要: 传统的文本相似度计算基于向量空间模型(VSM),文本映射成独立的、互不关联的词构成的向量。由于 长篇小说具有比普通文本更为复杂的构成元素,以及更加紧密的上下文联系,传统算法忽略词项的上下文联系,并且产 生高维向量,因此算法的效率和精度不理想。为此,本文基于公共词集对长篇小说进行相似度计算,并对公共词集进行 上下文约束检查,得到关联比较紧密的词集,作为小说的主要特征。实验结果表明,对于某些小说类型,效果有很大的 提升。 |
关键词: 公共词集;小说相似度;上下文约束 |
中图分类号: TP391.1
文献标识码: A
|
|
Similarity of Long Novels Based on Common Word Sets |
GUO Tao,BA Yuanjie,LI Shaoang
|
( School of Computer Science, Jilin University, Changchun 130012, China)
|
Abstract: Traditional text similarity computation is based on Vector Space Model (VSM),where the text is mapped into independent and unrelated words.Because novels have more complex elements and much closer context than ordinary texts,the traditional algorithm ignores the context of the words and produces the high dimensional vector,so that the efficiency and accuracy of the algorithm are not ideal.For this reason,this paper calculates the similarity of the novels based on the common word set,and carries out the context constraint check on the common word set to achieve a more closely related word set as the main feature of the novel.The experimental results show that for some types of novels,the effect is greatly improved. |
Keywords: common word set;novel similarity;context constraint |