摘 要: 代码分类是软件开发与管理的基础工作,有利于代码的重用、理解、查找和维护。现有的有监督学习方法需要大量带标签数据作为训练样本,而数据的标注成本很高,针对这一问题,提出了基于预训练的代码分类方法。首先,对代码进行消除空白、去除低频符号等预处理工作;其次,采用一种基于BERT的预训练模型(CodeBERT)在无标注样本上提取代码的语法、语义和上下文语境等相关特征;最后,基于预训练模型在小样本上微调代码分类器。实验结果表明:该方法即使在较小的训练周期也获得了较好的实验结果,其F1值比文本卷积神经网络(Text-Convolutional Neural Networks,Text-CNN)方法提高了约12%。 |
关键词: 代码表征;代码分类;预训练模型 |
中图分类号: TP311
文献标识码: A
|
基金项目: 江苏省研究生科研与实践创新计划项目(2021XKT1392);江苏省高等学校大学生创新创业训练计划(202010320035Z);江苏省现代教育技术研究课题(2022-R-102067);江苏省教育科学十四五规划立项课题(D/2021/01/139) |
|
Research on Code Classification Based on Pre-trained Model |
LIANG Yao, HONG Qingcheng, WANG Xia, XIE Chunli
|
(School of Computer Science and Technology, Jiangsu Normal University, Xuzhou 221116, China)
1726093250@qq.com; 2396769801@qq.com; 6020040016@jsnu.edu.cn; 6020030132@jsnu.edu.cn
|
Abstract: Code classification is a basic task for software development and management, which is conducive to code reuse, code comprehension, code search and code maintenance. Existing supervised approaches for code classification require a large number of labeled data, and the cost of data annotation is high. To solve this problem, this paper proposes a pre-trained code classification method. Firstly, preprocess the code by eliminating whitespace and low-frequency symbols. Secondly, a pre-trained model based on BERT (CodeBERT) is adopted to extract relevant features such as syntax, semantics, and context of the code on unlabeled samples. Finally, the classification task is finetuned on the basis of the pre-trained model. The experimental results show that this method achieves good experimental results even in small training cycles, and its F1 value is about 12% higher than that of the Text Convolutional Neural Networks (Text-CNN) method. |
Keywords: code representation; code classification; pre-trained model |