摘 要: 为了解决ReliefF算法随机抽样会抽取到不具代表性的样本且未考虑特征间相关性的问题,提出基于冗余性分析的ReliefF特征选择算法。首先改进ReliefF的抽样策略,其次将特征权重序列划分为几个子集,分别利用最大信息系数及Pearson系数共同衡量特征相关性,设置相应采样比例剔除冗余特征。将改进算法与其他特征选择算法进行对比,结果表明相较于传统ReliefF,在LightGBM(Light Gradient Boosting Machine,轻量级梯度提升机器学习)上的分类准确率可提升0.63%~12.10%,在SVM(Support Vector Machine,支持向量机)上的分类准确率可提升0.92%~9.06%,改进算法的分类准确率明显优于其他几种特征选择算法,在考虑特征与标签相关性的同时,能有效剔除冗余信息。 |
关键词: 特征选择;ReliefF算法;最大信息系数;冗余性分析 |
中图分类号: TP181
文献标识码: A
|
基金项目: 欧盟项目(598649-EPP-1-2018-1-FR-EPPKA2-CBHE-JP);国家自然科学基金项目(61602064);四川省科技厅项目(2021YFH0107,2022YFS0544,2022NSFSC0571);成都信息工程大学科技创新能力提升计划项目,面向大规模医疗数据的疾病风险评估预测优化研究(KYQN202223) |
|
Improved ReliefF Feature Selection Algorithm Base on Analysis of Redundancy |
LI Lijun1,4, ZHANG Haiqing1,4, LI Daiwei1,4, XIANG Xiaoming2, YU Xi3
|
(1.School of So f tware Engineering, Chengdu University of In f ormation Technology, Chengdu 610225, China; 2.Sichuan Meteorological Observation and Data Centre, Chengdu 610072, China; 3.Stirling College, Chengdu University, Chengdu 610106, China; 4.Sichuan Province Engineering Technology Research Center of Support So f tware of In f ormatization Application, Chengdu 610225, China)
2432094015@qq.com; zhanghq@cuit.edu.cn; ldwcuit@cuit.edu.cn; micxiang@foxmail.com; yuxi@cdu.edu.cn
|
Abstract: This paper proposes a ReliefF feature selection algorithm based on redundancy analysis to solve the problem of randomly selecting non-representative samples without considering the correlation between features in the ReliefF algorithm. Firstly, the sampling strategy of ReliefF is improved, and then the feature weight sequence is divided into several subsets. The maximum information coefficient and Pearson coefficient are used to jointly measure feature correlation, and corresponding sampling ratios are set to eliminate redundant features. Comparing the improved algorithm with other feature selection algorithms, the results show that compared to traditional ReliefF, the classification accuracy of the improved algorithm can be improved by 0.63% ~12.10% on LightGBM (Light Gradient Boosting Machine), and improved by 0.92% ~9.06% on SVM(Support Vector Machine). The classification accuracy of the improved algorithm is significantly better than other feature selection algorithms, and it can effectively eliminate redundant information while considering the correlation between features and labels. |
Keywords: feature selection; ReliefF algorithm; maximum information coefficient; analysis of redundancy |