摘 要: 利用文本挖掘技术进行体育热点分析,可以为体育领域的发展提供更多有用的信息。文中提出了一种基于TF-IDF(Term Frequency-Inverse Document Frequency,词频-逆文档频率)和TextRank(文本排序)的中文文本关键词提取方法,该方法首先采用分词、去除停用词等对文本进行预处理;其次使用TF-IDF算法计算每个词的重要性并进行归一化处理,同时使用TextRank算法权衡单词之间的关系并计算每个单词的得分以进行归一化处理;最后将TF-IDF值和TextRank得分进行加权和得到每个词的综合权重值,最终获得权重值最高的N 个关键词。应用TF-IDF和TextRank结合的方法在F1 值上选择5个关键词时取得了更好的结果,相较于只使用TF-IDF方法或TextRank方法,其关键词提取准确率分别提高约40%和32%。该方法有效提高了关键词提取的准确性和提取效率。 |
关键词: TF-IDF;TextRank;体育新闻;关键词提取 |
中图分类号: TP391.1
文献标识码: A
|
|
A Chinese Text Keyword Extraction Method Based on the Combination of TF-IDF and TextRank——— A Case Study of Sports News |
LAN Xiaofang1, LIU Zhuo2, XU Zhihao1, XIAO Yi2
|
(1.Oriental College of Science and Technology, Hunan Agricultural University, Changsha 410128, China; 2.College of Information and Intelligent, Hunan Agricultural University, Changsha 410128, China)
lanxf@stu.hunau.edu.cn; fetty_max@163.com; guapideyouxiang@stu.hunau.edu.cn; xiaoyi@hunau.edu.cn
|
Abstract: Using text mining techniques for sports hot topic analysis can provide more useful information for the development of the sports field. This paper proposes a method for extracting Chinese text keywords based on TF-IDF and TextRank. This method preprocesses the text by tokenizing and removing stop words, and then calculates the importance of each word using the TF-IDF algorithm and normalizes the values. Fianlly, the TextRank algorithm is used to weigh the relationships between words and calculate scores for each word, which are also normalized. Finally, the TF-IDF values and TextRank scores are weighted to obtain a comprehensive weight for each word, ultimately obtaining the N keywords with the highest weight value. The method of combining TF-IDF and TextRank achieved better results when selecting 5 keywords on F1 value, and compared to using only TF-IDF method or TextRank method, the accuracy of keyword extraction increases by about 40% and 32% , respectively. This method effectively improves the accuracy and efficiency of keyword extraction. |
Keywords: TF-IDF; TextRank; sports news; keyword extraction |