摘 要: 针对专利技术主题识别效率偏低、识别难度大等问题,文章提出了FPC-Kmeans++(Kmeans plus plus with feature phrase clusters)专利聚类分析与技术主题识别方法,该方法创新性地使用特征短语替代传统的分词结果,作为专利数据分析的基础。文章以无人机专利为例,对该方法进行了实证检验。实验结果表明,相较于传统的Kmeans++(Kmeans plus plus)和LDAKmeans++(Kmeans plus plus with Latent Dirichlet Allocation)方法,该方法能更精确地判断出最佳主题数和得到层次更鲜明的聚类效果,展现了其在专利主题识别上的优势。并且,相较于其他对比算法,文章提出的NER-FPP(Named Entity Recognition with Feature Phrase Probability)算法在专利特征短语提取上效果最好,F1值分数最高,达到了93.36%。 |
关键词: 主题识别;专利聚类;NER;TF-IDF |
中图分类号: TP391.1
文献标识码: A
|
基金项目: 2022年国家社科基金一般项目(22BGL282) |
|
Research on Patent Clustering Analysis and Technical Topic Recognition Based on FPC-Kmeans++ in the Field of Unmanned Aerial Vehicle Field |
LIU Jun1, WANG Xiulai1,2
|
(1.School of Computer, Nanjing University of Inf ormation Science and Technology, Nanjing 210044, China; 2.Nanjing Jinling Hospital, Affiliated Hospital of Medical School, Nanjing University, Nanjing 210016, China)
20211249335@nuist.edu.cn; wangxiulai@126.com
|
Abstract: In view of the low efficiency and high difficulty of patent technical topic recognition, this paper proposes a FPC-Kmeans++ (Kmeans Plus Plus with Feature Phrase Clusters) patent clustering analysis and technical topic recognition method, which innovatively uses feature phrases instead of traditional word segmentation results as the basis for patent data analysis. Taking patents of Unmanned Aerial Vehicle (UAV) as examples, this method is empirically tested. The experimental results show that compared to traditional Kmeans++ and LDAKmeans++ (Kmeans Plus Plus with Latent Dirichlet Allocation) methods, the proposed method can more accurately determine the optimal number of topics and achieve more distinct hierarchical clustering effects, demonstrating its advantages in patent topic recognition. Furthermore, compared to other contrast algorithms, the proposed NER-FPP ( Named Entity Recognition with Feature Phrase Probability) algorithm performs best in extracting patent feature phrases, with the highest F1 score reaching 93.36% . |
Keywords: topic recognition; patent clustering; NER; TF-IDF |