摘 要: 针对传统多标签分类模型中标签相关性容易被忽略和标签标注成本不断增加的问题,提出一种新颖的多标签图像分类方法,将提示学习和交叉注意力机制结合并使用部分标签训练。具体来说,首先,通过将提示与标签结合生成文本输入,并使用预训练文本编码器进行编码,提取文本特征。其次,将图像作为图像编码器的输入。同时,在文本和图像编码器中,加入可学习的提示,旨在增强模型性能。此外,采用了交叉注意力机制,促进模态间的信息交互,从而提升分类效果。通过实验表明,该模型在The PASCAL Visual Object Classes(VOC2007)数据集上使用90%的真实标签时,mAP值达到94.6%。 |
关键词: 多标签分类;提示学习;注意力机制 |
中图分类号: TP391
文献标识码: A
|
基金项目: 宁波市科技计划项目(2022Z082, 2023Z069);浙江省自然科学基金(LY23F020014);苏州市科技计划项目(ZXL2023176) |
|
Multi-label Image Classification Method Based on Prompt Learning and Attention Mechanism |
WANG Rui1, WU Fangyu2, ZHANG Bailing1,3
|
(1.School of Computer Science and Technology, Zhejiang Sci-Tech University, Hangzhou 310018, China; 2.School of Advanced Technology, X'i an Jiaotong-Liverpool University, Suzhou 215123, China; 3.School of Computing and Data Engineering, NingboTech University, Ningbo 315100, China)
Ruiwang2301@163.com; fangyu.wu02@xjtlu.edu.cn; bailing.zhang@nit.zju.edu.cn
|
Abstract: Aiming at the problem that label relevance is easily ignored and the cost of label annotation is increasing in traditional multi-label classification models, this paper proposes a novel multi-label image classification method that combines prompt learning and cross-attention mechanism using partially labeled training data. Specifically, the method first generates textual inputs by combining prompts with labels and encodes them using a pre-trained text encoder to extract text features. Next, images are used as inputs to an image encoder. Meanwhile, learnable prompts are incorporated in both the text and image encoders to enhance model performance. Additionally, a cross-attention mechanism is employed to facilitate interaction between modalities and improve classification effect. Experimental results show that the model achieves a mean Average Precision (mAP) value of 94.6% on The PASCAL Visual Object Classes (VOC2007) dataset when using 90% of the true labels. |
Keywords: multi-label classification; prompt learning; attention mechanism |