摘 要: Hadoop分布式文件系统(HDFS)的默认数据块放置策略均衡了数据存储的可靠性和读写速度,却没有考虑发挥集群的最佳性能。针对该问题提出了一种优化后的数据块放置算法。该算法为数据块设计2个指标,即被查询率与平均读取时间,用于评估集群执行任务对数据块的需求量。在符合HDFS默认数据放置算法基本规则的前提下,通过对数据块的需求量进行分析,然后重新计算数据块的放置位置,将需求量最多的数据转移到能够最快处理它们的节点上。通过实验数据证明:该算法可以使集群整体性能提高20%以上。优化后的数据块放置算法是有效的,并且不会增加对集群带宽的占用。 |
关键词: HDFS;数据块;放置策略;性能优化 |
中图分类号: TP311.1
文献标识码: A
|
|
An Optimized Hadoop Data Placement Strategy |
WU Yue
|
(State Forestry and Grassland Administration Industrial Development Planning Institute, Beijing 100010, China)
wuyue98@126.com
|
Abstract: The default data chunk placement strategy of Hadoop Distributed File System (HDFS) balances the reliability of data storage and read/write speed, but does not consider the optimal performance of the cluster. The paper proposes an optimized data placement algorithm to address this issue. Two indicators for data chunks, namely query rate and average read time are designed in this algorithm, to evaluate the demand of data chunks for cluster execution tasks. On the premise of meeting the basic rules of HDFS default placement algorithm, the data with the highest demand are transferred to the node that can process them the fastest, by analyzing the demand of chunks and recalculating their placement. Experimental data show that the algorithm can improve the overall performance of the cluster by more than 20% . The optimized data chunk placement algorithm is effective and will not increase the utilization of cluster bandwidth. |
Keywords: HDFS; data chunks; placement strategy; performance optimization |