摘 要: 在大数据背景下,数据膨胀的速度已经远远超出了人工分析的能力范围,因此,如何在大数据时代构建 热词发现与可视化机制尤为紧迫和重要。本文通过研究Hadoop大数据平台下的MapReduce计算框架和TF-IDF算法, 给出了TF-IDF算法在Hadoop分布式并行化计算平台下的具体实现,并以此并行化算法作为大数据架构下热词发现技 术的核心算法,然后利用可视化工具对结果进行分析处理。结果表明,TF-IDF并行化算法可以较好地发现大规模数据 量中的热点词汇;与传统单机下的算法相比,该算法处理效率更高。 |
关键词: Hadoop;TF-IDF并行化;热词发现;可视化 |
中图分类号: TP391
文献标识码: A
|
基金项目: 2017年度洛阳市社会科学规划项目,大数据架构下的热词发现与可视化技术研究. |
|
Research on Hot Word Discovery and Visualization Technology Based on Big Data Architecture |
HU Ruijuan
|
( Information Engineering University, Zhengzhou 450000, China)
|
Abstract: The speed of data expansion is far beyond the ability of artificial analysis in the era of big data.Therefore,it is particularly urgent and important how to build hot word discovery and visualization mechanism.By studying the MapReduce computing framework and TF-IDF algorithm under the Hadoop platform,this paper gives the concrete implementation of the TF-IDF algorithm under the Hadoop distributed parallel computing platform,and uses this parallel algorithm as the core algorithm of the hot word discovery technology based on the big data architecture,and then uses the visualization tool to display and analyze the results.The results show that the TF-IDF parallelization algorithm can find the hot words in large amount of data much better.Compared with traditional single-machine algorithms,this algorithm is more efficient. |
Keywords: Hadoop;TF-IDF parallelization;hot word discovery;visualization |