软件工程

引用本文:

李丽君,张海清,李代伟,向筱铭,于曦.基于冗余性分析的改进ReliefF特征选择算法[J].软件工程,2023,26(11):48-51.【点击复制】

分享到：微信更多

基于冗余性分析的改进ReliefF特征选择算法

李丽君^1,4, 张海清^1,4, 李代伟^1,4, 向筱铭², 于曦³

(1.成都信息工程大学软件工程学院, 四川成都 610225;
2.四川省气象探测数据中心, 四川成都 610072;
3.成都大学斯特灵学院, 四川成都 610106;
4.四川省信息化应用支撑软件工程技术研究中心, 四川成都 610255)
2432094015@qq.com; zhanghq@cuit.edu.cn; ldwcuit@cuit.edu.cn; micxiang@foxmail.com; yuxi@cdu.edu.cn

摘要: 为了解决ReliefF算法随机抽样会抽取到不具代表性的样本且未考虑特征间相关性的问题,提出基于冗余性分析的ReliefF特征选择算法。首先改进ReliefF的抽样策略,其次将特征权重序列划分为几个子集,分别利用最大信息系数及Pearson系数共同衡量特征相关性,设置相应采样比例剔除冗余特征。将改进算法与其他特征选择算法进行对比,结果表明相较于传统ReliefF,在LightGBM(Light Gradient Boosting Machine,轻量级梯度提升机器学习)上的分类准确率可提升0.63%~12.10%,在SVM(Support Vector Machine,支持向量机)上的分类准确率可提升0.92%~9.06%,改进算法的分类准确率明显优于其他几种特征选择算法,在考虑特征与标签相关性的同时,能有效剔除冗余信息。

关键词: 特征选择 ReliefF算法最大信息系数冗余性分析

中图分类号: TP181 文献标识码: A

基金项目: 欧盟项目(598649-EPP-1-2018-1-FR-EPPKA2-CBHE-JP);国家自然科学基金项目(61602064);四川省科技厅项目(2021YFH0107,2022YFS0544,2022NSFSC0571);成都信息工程大学科技创新能力提升计划项目,面向大规模医疗数据的疾病风险评估预测优化研究(KYQN202223)

Improved ReliefF Feature Selection Algorithm Base on Analysis of Redundancy

LI Lijun^1,4, ZHANG Haiqing^1,4, LI Daiwei^1,4, XIANG Xiaoming², YU Xi³

(1.School of So f tware Engineering, Chengdu University of In f ormation Technology, Chengdu 610225, China;
2.Sichuan Meteorological Observation and Data Centre, Chengdu 610072, China;
3.Stirling College, Chengdu University, Chengdu 610106, China;
4.Sichuan Province Engineering Technology Research Center of Support So f tware of In f ormatization Application, Chengdu 610225, China)

2432094015@qq.com; zhanghq@cuit.edu.cn; ldwcuit@cuit.edu.cn; micxiang@foxmail.com; yuxi@cdu.edu.cn

Abstract: This paper proposes a ReliefF feature selection algorithm based on redundancy analysis to solve the problem of randomly selecting non-representative samples without considering the correlation between features in the ReliefF algorithm. Firstly, the sampling strategy of ReliefF is improved, and then the feature weight sequence is divided into several subsets. The maximum information coefficient and Pearson coefficient are used to jointly measure feature correlation, and corresponding sampling ratios are set to eliminate redundant features. Comparing the improved algorithm with other feature selection algorithms, the results show that compared to traditional ReliefF, the classification accuracy of the improved algorithm can be improved by 0.63% ~12.10% on LightGBM (Light Gradient Boosting Machine), and improved by 0.92% ~9.06% on SVM(Support Vector Machine). The classification accuracy of the improved algorithm is significantly better than other feature selection algorithms, and it can effectively eliminate redundant information while considering the correlation between features and labels.

Keywords: feature selection ReliefF algorithm maximum information coefficient analysis of redundancy

用微信扫一扫