摘 要: 针对动物图像存在图像背景复杂多变、类间特征差异小、类内特征差异大的特点,提出多尺度可变ViT(Vision Transformer)图像识别模型。在ViT模型的基础上,融合卷积神经网络多层特征图,并提出可变注意力机制,使模型能较好地融合图像的局部特征和全局特征,能较好地识别图像中各种尺度的动物。构建包含90种类别、共21 142张图像的动物数据集,在数据集上进行实验的结果表明,所提出的模型取得了90.34%和97.59%的Top-1准确率和Top-5准确率。 |
关键词: 动物图像;ViT;可变注意力机制;多层特征图 |
中图分类号: TP39
文献标识码: A
|
|
Multi-Scale Adaptable Vision Transformer and Its Application in Animal Image Recognition |
XIA Yifan, WANG Duanhong, LI Jilong, JIANG Feng
|
(Taizhou Institute of Sci. & Tech., NJUST., Taizhou 225300, China)
1115000760@qq.com;; 1530891210@qq.com; 2419267020@qq.com; jf@nustti.edu.cn
|
Abstract: This paper proposes a multi-scale adaptable ViT (Vision Transformer) image recognition model to address the problems of complex and diverse image backgrounds, small inter-class feature differences, and large intraclass feature differences in animal images. Based on the ViT model, a multi-layer feature map of convolutional neural network is integrated, and an adaptable Attention Mechanism is proposed to enable the model to effectively integrate local and global features of images and accurately recognize animals of various scales in images. A dataset containing 90 categories with a total of 21 142 animal images is constructed, and experimental results on the dataset show that the proposed model achieves Top-1 and Top-5 accuracies of 90.34% and 97.59% , respectively. |
Keywords: animal image; ViT; adaptable Attention Mechanism; multi-layer feature map |