摘 要: 为了有效利用已知信息快速地进行数据聚类,提出了一种基于网格的半监督密度峰值聚类(GS-DPC)算法。利用统计信息网格对数据集进行划分,将落在网格内数据点的个数作为局部密度值,计算出每一个网格代表点;根据局部密度值和相对距离值确定聚类中心;利用成对约束集指导聚类过程后得到聚类结果。实验结果表明,GS-DPC算法进行数据聚类算法的平均消耗时间比密度峰值聚类算法(DPC)降低32百分点;GS-DPC算法在6个数据集上的平均精确度(ACC)约为0.84,平均调整互信息(AMI)约为0.68,平均调整兰德系数(ARI)约为0.67,因此GS-DPC算法可以快速且有效地进行数据聚类并获得较好的聚类结果。 |
关键词: 密度峰值聚类;网格;半监督;STING;成对约束 |
中图分类号: TP399
文献标识码: A
|
|
A Grid-based Semi-supervised Density Peak Clustering Algorithm |
YANG Jinrui1, LIU Ji1,2
|
(1.School of Statistics & Data Science, Xinjiang University of Finance & Economics, Urumqi 830012, China; 2.Xinjiang Social & Economic Statistics & Big Data Application Research Center, Xinjiang University of Finance & Economics, Urumqi 830012, China)
1519188386@qq.com; Liuji5000@126.com
|
Abstract: In order to efficiently cluster data using known information, a Grid-based Semi-supervised Density Peak Clustering (GS-DPC) algorithm is proposed. The algorithm divides the dataset using statistical information grids, with the number of data points within each grid serving as the local density value to calculate a representative point for each grid. Clustering centers are determined based on local density values and relative distance values, and clustering results are obtained after guiding the clustering process using a pairwise constraint set. Experimental results show that the average time consumption of the GS-DPC algorithm for data clustering is 32 percentage points lower than that of the density peak clustering algorithm (DPC). The GS-DPC algorithm achieves an average accuracy (ACC) of about 0.84, an average Adjusted Mutual Information (AMI) of about 0.68, and an average Adjusted Rand Index (ARI) of about 0.67 on six datasets, demonstrating that it can efficiently and effectively cluster data while obtaining good clustering results. |
Keywords: density peak clustering; grid; semi-supervised; STING; pairwise constraint |