摘 要: 数据缺失问题在现实工作生活中不可避免,为保证信息完整度以便于后续统计分析,尽可能准确地预测填补缺失值则显得尤为重要。基于两组分别服从于高斯分布和伽马分布的模拟数据集和一组非洲地区部分国家预期寿命实际数据,分别预设5%、10%和20%三种缺失比例,利用计算机软件对四种插补方法统计结果进行比较分析。试验结果表明,模拟数据中自回归建模插补和均值插补整体效果略优于最近邻插补和线性回归插补;实际数据中当缺失数据比例较低时,最近邻插补和线性回归插补效果优于前两者,当缺失比例较高时与模拟数据效果无明显差异。 |
关键词: 缺失数据;插补方法;自回归建模 |
中图分类号: TP399
文献标识码: A
|
|
Comparative Analysis of the Performance of Interpolation Methods for Missing Data |
XU Hongyan1, SUN Yunshan2, QIN Qilin1, ZHU Mingtao2
|
( 1.School of Science, Tianjin University of Commerce, Tianjin 300134, China; 2.School of Information Engineering, Tianjin University of Commerce, Tianjin 300134, China)
2552727224@qq.com; sunyunshan@tjcu.edu.cn; 3099141857@qq.com; 648191948@qq.com
|
Abstract: Data missing is inevitable. In order to ensure information integrity and follow-up statistical analysis, it is particularly important to predict and fill in missing values as accurately as possible. Based on two sets of simulated data sets that are subject to Gaussian distribution and Gamma distribution respectively, and a set of actual life expectancy data of some countries in Africa, three missing ratios of 5%, 10% and 20% are preset respectively, and the statistical results of the four interpolation methods are compared and analyzed by computer software. The experimental results show that the overall effect of auto-regression modeling interpolation and mean interpolation in simulated data is slightly better than that of K-nearest neighbor interpolation and linear regression interpolation. In actual data, when the proportion of missing data is low, K-nearest neighbor interpolation and linear regression is better than the former two, and there is no significant difference in the effect of the simulated data when the missing ratio is high. |
Keywords: missing data; interpolation method; autoregressive |