摘 要: 针对传统k-means算法中初始聚类中心随机确定的问题,提出k-means改进算法。首先,定义变量权 值,权值的大小等于样本密度乘以簇间距离除以簇内样本平均距离,通过最大权值来确定聚类中心,克服了随机确定 聚类中心的不稳定性。然后在Hadoop平台上用Map-Reduce框架下实现算法的并行化。最后以南通公交IC刷卡记录为 例,通过改进的k-means聚类算法进行IC卡刷卡记录的分析。实验表明,在Hadoop平台下改进k-means算法运行稳 定、可靠,具有很好的聚类效果。 |
关键词: MapReduce;改进k-means算法;k-means;聚类 |
中图分类号: TP301
文献标识码: A
|
基金项目: 本文系南通市科技资助项目“BP神经网络技术在智能公交IC卡中的应用研究”(项目编号:MS12017026-4). |
|
Study on the Application of Improved K-means Clustering Algorithm in the Data Analysis of Bus IC Cards |
YANG Jianbing
|
( Nantong Science and Technology College, Nantong 226007, China)
|
Abstract: Aiming at the problem of random determination of initial clustering centers in traditional k-means algorithm,an improved k-means algorithm is proposed in this paper.First,the weight value of the variable is defined.The weight value is equal to the sample density multiplied by the distance between clusters and then divided by the average distance within the cluster.The clustering center is determined by the maximum weight,and the instability of the cluster center is determined randomly.Then the parallelization of the algorithm is implemented under the Map-Reduce framework on the Hadoop platform.Finally,taking the Nantong bus IC card record as an example,an improved k-means clustering algorithm is used to analyze the IC card record.Experiments show that the improved k-means algorithm is stable and reliable under the Hadoop platform,with a good clustering effect. |
Keywords: MapReduce;improved k-means algorithm;k-means;clustering |