摘 要: 为了避免硬盘出现故障而造成大量数据丢失,文章提出一种基于随机森林的方法对硬盘的故障进行预测,降低其丢失数据的风险。首先,在数据预处理方面,对所采用的数据做特征映射预处理;其次,通过对决策树进行构建及选取等,构建随机森林预测模型,根据所选取的特征属性预测硬盘故障率所在的区间,并且特征属性的变化能反映出硬盘故障率的变化趋势;最后,对构建的随机森林模型参数进行调优,选取不同的n_estimators参数值进行测试和优化。实验结果表明,与XGBoost(Extreme Gradient Boosting)、LSTM(Long Short-Term Memory)等方法相比,本文方法的F1值(F-Measure)分别提高了0.93%和1.84%,并且对随机森林预测模型的参数值进行不同取值测试,最终准确率达到98.18%,比默认值提高了1.23%,证明该方法能更精确地预测硬盘故障率,反映出硬盘故障率基于特征属性的变化趋势。 |
关键词: 随机森林;硬盘故障率;故障率预测;特征映射;S.M.A.R.T属性 |
中图分类号: TP391
文献标识码: A
|
基金项目: 河北省自然科学基金(F2022208002);河北省高等学校科学技术研究重点项目(ZD2021048) |
|
Research on Hard Disk Fault Rate Prediction Based on Random Forest |
ZHANG Yongqiang1,4, KONG Junjun1, CUI Yao2, LI Xiangnan3,4
|
(1.School of Inf ormation Science and Engineering, Hebei University of Science and Technology, Shijiazhuang 050018, China; 2.Shijiazhuang Changhong Intelligent Technology Co., Ltd., Shijiazhuang 050004, China; 3.Shijiazhuang Chunxiao Internet Inf ormation Technology Co., Ltd., Shijiazhuang 050061, China; 4. Hebei Technology Innovation Center of Intelligent IoT, Shijiazhuang 050018, China)
zyq@hebust.edu.cn; kjunjun555@163.com; cuiyao@changhong.cc; xiangnan.li@chunxiao.net
|
Abstract: Aiming at hard disk faults which result in a large amount of data loss, this paper proposes a Random Forest-based method to predict hard disk faults and reduce the risk of data loss. Firstly, in terms of data processing, feature mapping preprocessing for the data used is performed. Secondly, by constructing and selecting Decision Trees, a Random Forest model is constructed to predict the range of hard disk fault rate based on the selected feature attributes, the changes of which reflect the changing trend of hard disk fault rate. Finally, the parameters of the constructed Random Forest model are optimized and tested with different n_estimators parameter values. The experimental results show that compared with methods such as XGBoost (Extreme Gradient Boosting) and LSTM (Long Short Term Memory), the F1 value (F-Measure) of the proposed method has increased by 0.93% and 1.84% , respectively. In addition, the parameter values of the Random Forest model are tested with different values, and the final accuracy reaches 98.18% , which is 1.23% higher than the default value. This proves that the proposed method can predict the hard disk fault rate more accurately and reflect the changing trend of the hard disk fault rate based on feature attributes. |
Keywords: Random Forest; hard disk fault rate; fault rate prediction; feature mapping; S.M.A.R.T attribute |