摘 要: 闭频繁项集包含了关于频繁项集的完整信息,可显著减少频繁项集挖掘所产生的模式数量,在一定程度 上降低了内存开销、提高了时间效率。数据流的特性决定了它需要更高效的挖掘算法,为此使用分治策略,提出一种并 行化闭频繁项集挖掘算法PCFI。该算法采用垂直数据格式存储项集的事务,通过对事务集的集合运算,可快速得到项 集的支持度计数,合并具有相同事务集的频繁项,得到初始生成子,降低了搜索空间的规模。采用分治策略对初始生成 子进行并行处理,得到约简前序集和约简后序集,在挖掘过程中不断地对每一生成子的搜索空间进行减枝,得到更小的 约简后序集,从而减少对冗余数据的处理。实验分析表明,该算法的性能优于先前设计的算法。 |
关键词: 数据流;滑动窗口;垂直数据格式;并行计算;闭频繁项集 |
中图分类号: TP311.5
文献标识码: A
|
|
Parallel Mining Algorithm of Closed Frequent Itemsets in the Data Stream |
FENG Zhonghui,YIN Shaohong
|
( School of Computer Science and Software Engineering, Tianjin Polytechnic University, Tianjin 300387, China)
|
Abstract: The closed frequent itemsets contain complete information about frequent itemsets,which can significantly reduce the number of patterns generated by frequent itemsets mining,to a certain extent,decreasing the memory overhead and improving the time efficiency.The characteristics of the data stream determine that it needs a more efficient mining algorithm.To solve this problem,the paper proposes a parallel closed frequent itemsets mining algorithm,PCFI.This algorithm uses the vertical data format to store the items in a set.By collecting the set of transactions,the support counts of the items can be quickly obtained,and the frequent items with the same set of transactions are merged to obtain the initial generation and reduce the size of the search space.The partitioning strategy is adopted to process the initial generator in parallel,and the sets of pre-reduction sequence and the post-reduction sequence are obtained.In the mining process,the search space of each generator is continuously reduced,and the reduction sequence set becomes smaller,thus reducing the redundant data processing.Experimental analysis shows that the performance of this algorithm is superior to the previously designed algorithm. |
Keywords: data stream;sliding window model;vertical data format;parallel computing;closed frequent itemsets |