• 首页
  • 期刊简介
  • 编委会
  • 投稿指南
  • 收录情况
  • 杂志订阅
  • 联系我们
引用本文:郭 群,张华熊,王 波,王心怡.基于内容和上下文的敏感个人信息实体识别方法[J].软件工程,2025,28(2):6-9.【点击复制】
【打印本页】   【下载PDF全文】   【查看/发表评论】  【下载PDF阅读器】  
←前一篇|后一篇→ 过刊浏览
分享到: 微信 更多
基于内容和上下文的敏感个人信息实体识别方法
郭 群1,张华熊1,王 波2,王心怡2
(1.浙江理工大学计算机科学与技术学院,浙江 杭州 310018;
2.杭州数梦工场科技有限公司,浙江 杭州 310013)
202120503035@mails.zstu.edu.cn; zhxhz@zstu.edu.cn; 12541421@qq.com; 13738100147@163.com
摘 要: 针对现有方法对非结构文本中结构复杂的敏感个人信息实体无法有效识别的问题,提出一种基于内容和上下文的敏感个人信息实体识别方法。一方面,利用规则匹配检测具有可预测模式的敏感实体类型;另一方面,构建了一个基于词对关系分类架构(ELECTRA-W2NER,EW2NER)的实体关系分类识别模型,以检测模式复杂的敏感实体类型。EW2NER 使用最新的ELECTRA(Efficiently Learningan EncoderthatClassifies Token ReplacementsAccurately)模型实现词嵌入,并采取实体关系分类架构统一提取扁平型和重叠型的敏感个人信息实体。该模型在中文敏感数据集上取得了97.05% 的F1值,优于ExSense(Extractsensitiveinformationfrom unstructureddata)模型。
关键词: 敏感信息检测;命名实体识别;模式匹配;深度学习
中图分类号: TP391.1    文献标识码: A
基金项目: 浙江省科技厅“尖兵”“领雁”研发攻关计划项目(2024C01019,2022C01220)
Content and Contextual Sensitive Personal Information Entity Recogniti on Method
GUO Qun1, ZHANG Huaxiong1, WANG Bo2, WANG Xinyi2
(1.School of Computer Science and Technology, Zhejiang Sc-i Tech University, Hangzhou 310018, China;
2.Hangzhou DtDream Technologies Co., Ltd., Hangzhou 310013, China)
202120503035@mails.zstu.edu.cn; zhxhz@zstu.edu.cn; 12541421@qq.com; 13738100147@163.com
Abstract: Aiming at the problem that existing methods cannot effectively recognize sensitive personal information entities with complex structures in unstructured text, this paper proposes a content and context based sensitive personal information entity recognition method. On one hand, it employs rule matching to detect sensitive entity types with predictable patterns; on the other hand, it constructs an entity relationship classification and recognition model based on a word pair relationship classification architecture (ELECTRA-W2NER, EW2NER) to detect sensitive entity types with complex patterns. EW2NER utilizes the latest ELECTRA (Efficiently Learning an Encoder that Classifies Token Replacements Accurately) model for word embeddings and adopts an entity relationship classification architecture to systematically extract both flat and overlapping sensitive personal information entities. The proposed model achieves an F1 score of 97.05% on a Chinese sensitive data set, surpassing the ExSense (Extract Sensitive Information from Unstructured Data) model.
Keywords: sensitive information detection; named entity recognition; pattern matching; deep learning


版权所有:软件工程杂志社
地址:辽宁省沈阳市浑南区新秀街2号 邮政编码:110179
电话:0411-84767887 传真:0411-84835089 Email:semagazine@neusoft.edu.cn
备案号:辽ICP备17007376号-1
技术支持:北京勤云科技发展有限公司

用微信扫一扫

用微信扫一扫